Amazon QLDB and the Missing Command Line Client

Amazon Quantum Ledger Database is is a fully managed ledger database which tracks all changes of user data and maintains a verifiable history of changes over time. It was announced at AWS re:Invent 2018 and now available in five AWS regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo).

You may ask why you would like to use QLDB (a ledger database) instead of using your traditional (more...)

Sample AWS Lambda Function to Monitor Oracle Database

I wrote a very simple AWS Lambda function to demonstrate how to connect an Oracle database, gather the tablespace usage information, and send these metrics to CloudWatch. First, I wrote this lambda function in Python and then I had to re-write it in Java. As you may know, you need to use cx_oracle module to connect Oracle Databases with Python. This extension module requires some libraries which are shipped by Oracle Database Client (oh God! (more...)

An Interesting Problem with ODI: Unable to retrieve user GUID

One of my customers had a problem about logging in to Oracle Data Integrator (ODI) Studio. Their ODI implementation is configured to use external authentication (Microsoft Active Directory). The configuration was done years ago. No one modified it since it’s done, in fact most people even do not remember how it’s configured. Everything was fine until they started to get “ODI-10192: Unable to retrieve user GUID” error.

They said they got the first error about (more...)

PySpark Examples #5: Discretized Streams (DStreams)

This is the fourth blog post which I share sample scripts of my presentation about “Apache Spark with Python“. Spark supports two different way for streaming: Discretized Streams (DStreams) and Structured Streaming. DStreams is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs representing stream of data. Structured Streaming is the newer way of streaming and it’s built on the Spark SQL engine. In next blog post, I’ll also (more...)

PySpark Examples #3-4: Spark SQL Module

In this blog post, I’ll share example #3 and #4 from my presentation to demonstrate capabilities of Spark SQL Module. As I already explained in my previous blog posts, Spark SQL Module provides DataFrames (and DataSets – but Python doesn’t support DataSets because it’s a dynamically typed language) to work with structured data.

First, let’s start creating a temporary table from a CSV file and run query on it. Like I did my (more...)

PySpark Examples #2: Grouping Data from CSV File (Using DataFrames)

I continue to share example codes related with my “Spark with Python” presentation. In my last blog post, I showed how we use RDDs (the core data structures of Spark). This time, I will use DataFrames instead of RDDs. DataFrames are distributed collection of data organized into named columns (in a structured way). They are similar to tables in relational databases. They also provide a domain specific language API to manipulate your distributed (more...)

PySpark Examples #1: Grouping Data from CSV File (Using RDDs)

During my presentation about “Spark with Python”, I told that I would share example codes (with detailed explanations). So this is my first example code. In this code, I read data from a CSV file to create a Spark RDD (Resilient Distributed Dataset). RDDs are the core data structures of Spark. I explained the features of RDDs in my presentation, so in this blog post, I will only focus on the example code.

For (more...)

Introduction to Apache Spark with Python

Today, I spoke about “Apache Spark with Python” at Big Talk #2 meet-up in Istanbul Teknokent ARI-3, another event organized by Komtas for big data community. We had almost full room. Mine was the last session of the day but the audience was still very focused and eager to listen the subjects, so for me, the event was great.

By the way, I also enjoyed the sessions of other speakers: Zekeriya Beşioğlu spoke about Data (more...)

Using Spark to Process Data From Cassandra for Analytics

After my presentation about Apache Cassandra, most people asked if they can run analytical queries on Cassandra, and how they can integrate Spark with Cassandra. So I decided to write a blog post to demonstrate how we can process data from Cassandra using Spark. In this blog post, I’ll show how I can build a testing environment on Oracle Cloud (Spark + Cassandra), load sample data to Cassandra, and query the data using Spark.

Let (more...)

Build a Cassandra Cluster on Docker

In this blog post, I’ll show how we can build a three-node cassandra cluster on Docker for testing. I’ll use official cassandra images instead of creating my own images, so all process will take only a few minutes (depending on your network connection). I assume that you have Docker installed on your PC, have internet connection (I was born in 1976 so it’s normal for me to ask this kind of questions) and your PC (more...)