#Apress “Practical Enterprise Data Lake Insights” – Published!

Hello All, Gives me immense pleasure to announce the release of our book “Practical Enterprise Data Lake Insights” with Apress. The book takes an end-to-end solution approach in a data lake environment that includes data capture, processing, security, and availability. Credits to the co-author of the book, Venkata Giri and technical reviewer, Sai Sundar. The … Continue reading

The Forrester Wave™: Big Data Fabric, Q2 2018

The @Forrester Wave™: Big Data Fabric, Q2 2018. @Oracle continues to broaden its Big Data Fabric solution and leads the pack.

Big Data Introduction – Workshop

Our focus was clear - this was a level 101 class, for IT professionals in Bangalore who had heard of Big Data, were interested in Big Data, but were unsure how and where to dig their toe in the world of analytics and Big Data. A one-day workshop - with a mix of slides, white-boarding, case-study, a small game, and a mini-project - we felt, was the ideal vehicle for getting people to wrap their (more...)

Time for #GLOC, #SQLSatDallas, #DataSummit18

The next nine days, I’m traveling to three cities for four events. We’ll just call this the 9-3-4 gauntlet of speaker life. I booked this travel as four, one-way flights to get the itinerary
I needed to make the most of my schedule and will have breaks between each event to make sure I don’t kill myself my last two weeks at Delphix.

GLOC

Today I’m heading to the Great Lakes Oracle Conference, (https://gloc.neooug. (more...)

Broadening Your Audience

I spent this week speaking at two conferences that may not be familiar to my usual crowd:
Techwell StarEast Testing Conference in Orlando, FL
Interop ITX Data Conference in Las Vegas, NV

StarEast Testing Conference

Techwell’s event is attended by testers and had over 2000 attendees at the Hyatt Regency Orlando’s Convention Center. This is a huge convention center and I won’t lie- I did try to first register at the Mazda event (more...)

StarEast, InteropITX and GDPR

I’m getting ready to get on a plane between two events today and have been so busy, that there’s been a break in blogging.  That’s right folks, Kellyn has let a few things slide….

For those people on top of all the happenings in Kevlar’s life, I’ve been busy removing 15 years of possessions from my home so we can sell it in the next month, along with the purchase, upgrade and consolidation into a (more...)

PySpark Examples #5: Discretized Streams (DStreams)

This is the fourth blog post which I share sample scripts of my presentation about “Apache Spark with Python“. Spark supports two different way for streaming: Discretized Streams (DStreams) and Structured Streaming. DStreams is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs representing stream of data. Structured Streaming is the newer way of streaming and it’s built on the Spark SQL engine. In next blog post, I’ll also (more...)

PySpark Examples #3-4: Spark SQL Module

In this blog post, I’ll share example #3 and #4 from my presentation to demonstrate capabilities of Spark SQL Module. As I already explained in my previous blog posts, Spark SQL Module provides DataFrames (and DataSets – but Python doesn’t support DataSets because it’s a dynamically typed language) to work with structured data.

First, let’s start creating a temporary table from a CSV file and run query on it. Like I did my (more...)

PySpark Examples #2: Grouping Data from CSV File (Using DataFrames)

I continue to share example codes related with my “Spark with Python” presentation. In my last blog post, I showed how we use RDDs (the core data structures of Spark). This time, I will use DataFrames instead of RDDs. DataFrames are distributed collection of data organized into named columns (in a structured way). They are similar to tables in relational databases. They also provide a domain specific language API to manipulate your distributed (more...)

Data Warrior #HappyDance! Guess who joined @Snowflakedb

It going to be a big day at Snowflake. Two of my good friends are joining my team.

PySpark Examples #1: Grouping Data from CSV File (Using RDDs)

During my presentation about “Spark with Python”, I told that I would share example codes (with detailed explanations). So this is my first example code. In this code, I read data from a CSV file to create a Spark RDD (Resilient Distributed Dataset). RDDs are the core data structures of Spark. I explained the features of RDDs in my presentation, so in this blog post, I will only focus on the example code.

For (more...)

Introduction to Apache Spark with Python

Today, I spoke about “Apache Spark with Python” at Big Talk #2 meet-up in Istanbul Teknokent ARI-3, another event organized by Komtas for big data community. We had almost full room. Mine was the last session of the day but the audience was still very focused and eager to listen the subjects, so for me, the event was great.

By the way, I also enjoyed the sessions of other speakers: Zekeriya Beşioğlu spoke about Data (more...)

Using Spark to Process Data From Cassandra for Analytics

After my presentation about Apache Cassandra, most people asked if they can run analytical queries on Cassandra, and how they can integrate Spark with Cassandra. So I decided to write a blog post to demonstrate how we can process data from Cassandra using Spark. In this blog post, I’ll show how I can build a testing environment on Oracle Cloud (Spark + Cassandra), load sample data to Cassandra, and query the data using Spark.

Let (more...)

Ingesting data into Hive using Spark

Before heading off for my hols at the end of this week to soak up some sun with my gorgeous princess, I thought I would write another blog post. Two posts in a row – that has not happened for ages ????

 

My previous post was about loading a text file into Hadoop using Hive. Today what I am looking to do is to load the same file into a Hive table but using Spark (more...)

Loading data in Hadoop with Hive

It’s been a busy month but 2018 begins with a new post.

 

A common Big Data scenario is to use Hadoop for transforming data and data ingestion – in other words using Hadoop for ETL.

In this post, I will show an example of how to load a comma separated values text file into HDFS. Once the file is moved in HDFS, use Apache Hive to create a table and load the data into (more...)

Using Spark to join data from CSV and MySQL Table

Yesterday, I explained how we can access MySQL database from Zeppelin which comes with Oracle Big Data Cloud Service Compute Edition (BDCSCE). Although we can use Zeppelin to access MySQL, we still need something more powerful to combine data from two different sources (for example data from CSV file and RDBMS tables). Spark is a great choice to process data. In this blog post, I’ll write a simple PySpark (Python for Spark) code which will (more...)

Okta SSO with Snowflake 

Ever wonder how to secure a cloud data warehouse?

Big Data Marathon

This week there is a Big Data event in London, gathering Big Data clients, geeks and vendors from all over to speak on the latest trends, projects, platforms and products which helps everyone to stay on the same page and align the steering wheel as well as get a feeling of where the fast-pacing technology world is going. The event is massive but I am glad I could make it even only for one hour (more...)

Join the Cloud Analytics Academy

Maybe not a cool as Star Fleet Academy, but this is pretty cool. Snowflake and a number of our partners have come together to create the first, self-paced, vendor agnostic, online training academy for analytics in the cloud. This academy will get you up to speed on what is happening today in the cloud with […]

Hadoop for Database Professionals class at NoCOUG Fall Conference on 9th Nov

If you happen to be in Bay Area on Thursday 9th November, then come check out the NoCOUG Fall Conference in California State University in downtown Oakland, CA.

Gluent is delivering a Hadoop for Database Professionals class as a separate track there (with myself and Michael Rainey as speakers) where we’ll explain the basics & concepts of modern distributed data processing platforms and then show a bunch of Hadoop demos too (mostly SQL-on-Hadoop stuff (more...)