An Interesting Problem with ODI: Unable to retrieve user GUID

One of my customers had a problem about logging in to Oracle Data Integrator (ODI) Studio. Their ODI implementation is configured to use external authentication (Microsoft Active Directory). The configuration was done years ago. No one modified it since it’s done, in fact most people even do not remember how it’s configured. Everything was fine until they started to get “ODI-10192: Unable to retrieve user GUID” error.

They said they got the first error about (more...)

PySpark Examples #5: Discretized Streams (DStreams)

This is the fourth blog post which I share sample scripts of my presentation about “Apache Spark with Python“. Spark supports two different way for streaming: Discretized Streams (DStreams) and Structured Streaming. DStreams is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs representing stream of data. Structured Streaming is the newer way of streaming and it’s built on the Spark SQL engine. In next blog post, I’ll also (more...)

PySpark Examples #3-4: Spark SQL Module

In this blog post, I’ll share example #3 and #4 from my presentation to demonstrate capabilities of Spark SQL Module. As I already explained in my previous blog posts, Spark SQL Module provides DataFrames (and DataSets – but Python doesn’t support DataSets because it’s a dynamically typed language) to work with structured data.

First, let’s start creating a temporary table from a CSV file and run query on it. Like I did my (more...)

PySpark Examples #2: Grouping Data from CSV File (Using DataFrames)

I continue to share example codes related with my “Spark with Python” presentation. In my last blog post, I showed how we use RDDs (the core data structures of Spark). This time, I will use DataFrames instead of RDDs. DataFrames are distributed collection of data organized into named columns (in a structured way). They are similar to tables in relational databases. They also provide a domain specific language API to manipulate your distributed (more...)

PySpark Examples #1: Grouping Data from CSV File (Using RDDs)

During my presentation about “Spark with Python”, I told that I would share example codes (with detailed explanations). So this is my first example code. In this code, I read data from a CSV file to create a Spark RDD (Resilient Distributed Dataset). RDDs are the core data structures of Spark. I explained the features of RDDs in my presentation, so in this blog post, I will only focus on the example code.

For (more...)

Introduction to Apache Spark with Python

Today, I spoke about “Apache Spark with Python” at Big Talk #2 meet-up in Istanbul Teknokent ARI-3, another event organized by Komtas for big data community. We had almost full room. Mine was the last session of the day but the audience was still very focused and eager to listen the subjects, so for me, the event was great.

By the way, I also enjoyed the sessions of other speakers: Zekeriya Beşioğlu spoke about Data (more...)

Using Spark to Process Data From Cassandra for Analytics

After my presentation about Apache Cassandra, most people asked if they can run analytical queries on Cassandra, and how they can integrate Spark with Cassandra. So I decided to write a blog post to demonstrate how we can process data from Cassandra using Spark. In this blog post, I’ll show how I can build a testing environment on Oracle Cloud (Spark + Cassandra), load sample data to Cassandra, and query the data using Spark.

Let (more...)

Build a Cassandra Cluster on Docker

In this blog post, I’ll show how we can build a three-node cassandra cluster on Docker for testing. I’ll use official cassandra images instead of creating my own images, so all process will take only a few minutes (depending on your network connection). I assume that you have Docker installed on your PC, have internet connection (I was born in 1976 so it’s normal for me to ask this kind of questions) and your PC (more...)

Introduction to Apache Cassandra

On Friday, I gave a presentation about Apache Cassandra at Big Talk event organized by Komtaş Information Management company. Cassandra is a top level Apache project which is born at Facebook. It is a distributed database for managing large amounts of structured data. It provides highly available service and no single point of failure, even running on commodity harware.

In my previous company, we used Cassandra to store our social platform data. It performs well (more...)

Using Spark to join data from CSV and MySQL Table

Yesterday, I explained how we can access MySQL database from Zeppelin which comes with Oracle Big Data Cloud Service Compute Edition (BDCSCE). Although we can use Zeppelin to access MySQL, we still need something more powerful to combine data from two different sources (for example data from CSV file and RDBMS tables). Spark is a great choice to process data. In this blog post, I’ll write a simple PySpark (Python for Spark) code which will (more...)