How to Use AWS S3 bucket for Spark History Server

Uncategorized
| Nov 18, 2019

Since EMR Version 5.25, it’s possible to debug and monitor your Apache Spark jobs by logging directly into the off-cluster, persistent, Apache Spark History Server using the EMR Console. You do not need to anything extra to enable it, and you can access the Spark history even after the cluster is terminated. The logs are available for active clusters and are retained for 30 days after the cluster is terminated.

Although this is a (more...)

Query a HBASE table through Hive using PySpark on EMR

Uncategorized
| Oct 15, 2019

In this blog post, I’ll demonstrate how we can access a HBASE table through Hive from a PySpark script/job on an AWS EMR cluster. First I created an EMR cluster (EMR 5.27.0, Hive 2.3.5, Hbase 1.4.0). Then I connected to the master node, executed “hbase shell”, created a HBASE table, and inserted a sample row:

create 'mytable','f1'
put 'mytable', 'row1', 'f1:name', 'Gokhan'

I logged in to hive and created (more...)

How to manage GDPR compliance with Snowflake’s Time Travel and Disaster Recovery 

Uncategorized
| Jul 16, 2019
As organizations work to bring their data practices into compliance with GDPR, one question comes up repeatedly: How does Snowflake enable my organization to be GDPR compliant?

Thinking to setup a Data Lake for your org – let’s read this!

Uncategorized
| Jul 13, 2019
The fact is not new that data-driven outcomes have been one of the most electrifying revelations of the 20th century. Almost every organization considers data as the primary driver for their business and growth. Several strategies have evolved to manage the storage, analytics, and consumption of data. Not just for regulatory purposes, but data help… Read More Thinking to setup a Data Lake for your org – let’s read this!

The Elephant in the Data Lake and Snowflake

Uncategorized
| May 31, 2019
Originally posted on Jeffrey Jacobs, Consulting Data Architect:
Let’s talk about the elephant in the data lake, Hadoop, and the constant evolution of technology. Hadoop, (symbolized by an elephant), was created to handle massive amounts of raw data that were beyond the capabilities of existing database technologies. At its core, Hadoop is simply a distributed…

Schema-on-what? How to model JSON

Uncategorized
| Nov 25, 2018
How do you make sense out of schema-on-read? This post shows you how to turn a JSON document into a relational data model that any modeler or relational database person could understand.

Automatic Clustering, Materialized Views and Automatic Maintenance in Snowflake 

Uncategorized
| Nov 15, 2018
Boy are things going bananas at Snowflake these days. The really big news a few weeks back was another round of funding! This week we announced two new major features.

Why document databases are old news…

Uncategorized
| Nov 9, 2018

We’re going to store data the way it’s stored naturally in the brain.

This is a phrase being heard more often today. This blog post is inspired by a short rant by Babak Tourani (@2ndhalf_oracle) and myself had on Twitter today.

How cool is that!!

This phrase is used by companies like MongoDB or Graph Database vendors to explain why they choose to store information / data in an unstructured format. It is (more...)

Get Started Faster with Snowflake Partner Connect

Uncategorized
| Aug 8, 2018
Got data you want to load and analyze in the cloud? Would you like to do it today? The check out this announcement about getting up and running with your data in Snowflake faster. #YourDataNoLimits Ending the struggle for data starts with getting to your data faster. Snowflake already streamlines the path to get up […]

#Apress “Practical Enterprise Data Lake Insights” – Published!

Uncategorized
| Jun 29, 2018
Hello All, Gives me immense pleasure to announce the release of our book “Practical Enterprise Data Lake Insights” with Apress. The book takes an end-to-end solution approach in a data lake environment that includes data capture, processing, security, and availability. Credits to the co-author of the book, Venkata Giri and technical reviewer, Sai Sundar. The … Continue reading

Big Data Introduction – Workshop

Uncategorized
| May 28, 2018
Our focus was clear - this was a level 101 class, for IT professionals in Bangalore who had heard of Big Data, were interested in Big Data, but were unsure how and where to dig their toe in the world of analytics and Big Data. A one-day workshop - with a mix of slides, white-boarding, case-study, a small game, and a mini-project - we felt, was the ideal vehicle for getting people to wrap their (more...)

PySpark Examples #5: Discretized Streams (DStreams)

Uncategorized
| Apr 18, 2018

This is the fourth blog post which I share sample scripts of my presentation about “Apache Spark with Python“. Spark supports two different way for streaming: Discretized Streams (DStreams) and Structured Streaming. DStreams is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs representing stream of data. Structured Streaming is the newer way of streaming and it’s built on the Spark SQL engine. In next blog post, I’ll also (more...)

PySpark Examples #3-4: Spark SQL Module

Uncategorized
| Apr 17, 2018

In this blog post, I’ll share example #3 and #4 from my presentation to demonstrate capabilities of Spark SQL Module. As I already explained in my previous blog posts, Spark SQL Module provides DataFrames (and DataSets – but Python doesn’t support DataSets because it’s a dynamically typed language) to work with structured data.

First, let’s start creating a temporary table from a CSV file and run query on it. Like I did my (more...)

PySpark Examples #2: Grouping Data from CSV File (Using DataFrames)

Uncategorized
| Apr 16, 2018

I continue to share example codes related with my “Spark with Python” presentation. In my last blog post, I showed how we use RDDs (the core data structures of Spark). This time, I will use DataFrames instead of RDDs. DataFrames are distributed collection of data organized into named columns (in a structured way). They are similar to tables in relational databases. They also provide a domain specific language API to manipulate your distributed (more...)

Data Warrior #HappyDance! Guess who joined @Snowflakedb

Uncategorized
| Apr 15, 2018
It going to be a big day at Snowflake. Two of my good friends are joining my team.

Ingesting data into Hive using Spark

Uncategorized
| Feb 6, 2018

Before heading off for my hols at the end of this week to soak up some sun with my gorgeous princess, I thought I would write another blog post. Two posts in a row – that has not happened for ages ????

 

My previous post was about loading a text file into Hadoop using Hive. Today what I am looking to do is to load the same file into a Hive table but using Spark (more...)

Loading data in Hadoop with Hive

Uncategorized
| Jan 31, 2018

It’s been a busy month but 2018 begins with a new post.

 

A common Big Data scenario is to use Hadoop for transforming data and data ingestion – in other words using Hadoop for ETL.

In this post, I will show an example of how to load a comma separated values text file into HDFS. Once the file is moved in HDFS, use Apache Hive to create a table and load the data into (more...)

Big Data Marathon

Uncategorized
| Nov 16, 2017

This week there is a Big Data event in London, gathering Big Data clients, geeks and vendors from all over to speak on the latest trends, projects, platforms and products which helps everyone to stay on the same page and align the steering wheel as well as get a feeling of where the fast-pacing technology world is going. The event is massive but I am glad I could make it even only for one hour (more...)

Hadoop for Database Professionals class at NoCOUG Fall Conference on 9th Nov

Uncategorized
| Oct 27, 2017

If you happen to be in Bay Area on Thursday 9th November, then come check out the NoCOUG Fall Conference in California State University in downtown Oakland, CA.

Gluent is delivering a Hadoop for Database Professionals class as a separate track there (with myself and Michael Rainey as speakers) where we’ll explain the basics & concepts of modern distributed data processing platforms and then show a bunch of Hadoop demos too (mostly SQL-on-Hadoop stuff (more...)

Hadoop for Database Professionals – St. Louis (7. Sep)

Uncategorized
| Aug 28, 2017

Here’s some more free stuff by Gluent!

We are running another half-day course together with Cloudera, this time in St. Louis on 7. September 2017.

We will use our database background and explain using database professionals terminology why “new world” technologies like Hadoop will take over some parts of the enterprise IT, why are those platforms so much better for advanced analytics over big datasets and how to use the right tool from Hadoop ecosystem (more...)