How to manage GDPR compliance with Snowflake’s Time Travel and Disaster Recovery 

As organizations work to bring their data practices into compliance with GDPR, one question comes up repeatedly: How does Snowflake enable my organization to be GDPR compliant?

Thinking to setup a Data Lake for your org – let’s read this!

The fact is not new that data-driven outcomes have been one of the most electrifying revelations of the 20th century. Almost every organization considers data as the primary driver for their business and growth. Several strategies have evolved to manage the storage, analytics, and consumption of data. Not just for regulatory purposes, but data help… Read More Thinking to setup a Data Lake for your org – let’s read this!

The Elephant in the Data Lake and Snowflake

Originally posted on Jeffrey Jacobs, Consulting Data Architect:
Let’s talk about the elephant in the data lake, Hadoop, and the constant evolution of technology. Hadoop, (symbolized by an elephant), was created to handle massive amounts of raw data that were beyond the capabilities of existing database technologies. At its core, Hadoop is simply a distributed…

Examples of using Machine Learning on Video and Photo in Public

Over the past 18 months or so most of the examples of using machine learning have been on looking at images and identifying objects in them. There are the typical examples of examining pictures looking for a Cat or a Dog, or some famous person, etc. Most of these examples are very noddy, although they do illustrate important examples.

But what if this same technology was used to monitor people going about their daily lives. (more...)

Announcing The Kafka Pilot with Rittman Mead

Rittman Mead is today pleased to announce the launch of it's Kafka Pilot service, focusing on engaging with companies to help fully assess the capabilities of Apache Kafka for event streaming use cases with both a technical and business focus.

Our 30 day Kafka Pilot includes:

  • A comprehensive assessment of your use cases for event streaming and Kafka
  • A full assessment of connectors
  • Provides a transformation from your current state to future state architecture
  • Delivers (more...)

Schema-on-what? How to model JSON

How do you make sense out of schema-on-read? This post shows you how to turn a JSON document into a relational data model that any modeler or relational database person could understand.

Automatic Clustering, Materialized Views and Automatic Maintenance in Snowflake 

Boy are things going bananas at Snowflake these days. The really big news a few weeks back was another round of funding! This week we announced two new major features.

Why document databases are old news…

We’re going to store data the way it’s stored naturally in the brain.

This is a phrase being heard more often today. This blog post is inspired by a short rant by Babak Tourani (@2ndhalf_oracle) and myself had on Twitter today.

How cool is that!!

This phrase is used by companies like MongoDB or Graph Database vendors to explain why they choose to store information / data in an unstructured format. It is (more...)

Get Started Faster with Snowflake Partner Connect

Got data you want to load and analyze in the cloud? Would you like to do it today? The check out this announcement about getting up and running with your data in Snowflake faster. #YourDataNoLimits Ending the struggle for data starts with getting to your data faster. Snowflake already streamlines the path to get up […]

#Apress “Practical Enterprise Data Lake Insights” – Published!

Hello All, Gives me immense pleasure to announce the release of our book “Practical Enterprise Data Lake Insights” with Apress. The book takes an end-to-end solution approach in a data lake environment that includes data capture, processing, security, and availability. Credits to the co-author of the book, Venkata Giri and technical reviewer, Sai Sundar. The … Continue reading

Big Data Introduction – Workshop

Our focus was clear - this was a level 101 class, for IT professionals in Bangalore who had heard of Big Data, were interested in Big Data, but were unsure how and where to dig their toe in the world of analytics and Big Data. A one-day workshop - with a mix of slides, white-boarding, case-study, a small game, and a mini-project - we felt, was the ideal vehicle for getting people to wrap their (more...)

PySpark Examples #5: Discretized Streams (DStreams)

This is the fourth blog post which I share sample scripts of my presentation about “Apache Spark with Python“. Spark supports two different way for streaming: Discretized Streams (DStreams) and Structured Streaming. DStreams is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs representing stream of data. Structured Streaming is the newer way of streaming and it’s built on the Spark SQL engine. In next blog post, I’ll also (more...)

PySpark Examples #3-4: Spark SQL Module

In this blog post, I’ll share example #3 and #4 from my presentation to demonstrate capabilities of Spark SQL Module. As I already explained in my previous blog posts, Spark SQL Module provides DataFrames (and DataSets – but Python doesn’t support DataSets because it’s a dynamically typed language) to work with structured data.

First, let’s start creating a temporary table from a CSV file and run query on it. Like I did my (more...)

PySpark Examples #2: Grouping Data from CSV File (Using DataFrames)

I continue to share example codes related with my “Spark with Python” presentation. In my last blog post, I showed how we use RDDs (the core data structures of Spark). This time, I will use DataFrames instead of RDDs. DataFrames are distributed collection of data organized into named columns (in a structured way). They are similar to tables in relational databases. They also provide a domain specific language API to manipulate your distributed (more...)

Data Warrior #HappyDance! Guess who joined @Snowflakedb

It going to be a big day at Snowflake. Two of my good friends are joining my team.

PySpark Examples #1: Grouping Data from CSV File (Using RDDs)

During my presentation about “Spark with Python”, I told that I would share example codes (with detailed explanations). So this is my first example code. In this code, I read data from a CSV file to create a Spark RDD (Resilient Distributed Dataset). RDDs are the core data structures of Spark. I explained the features of RDDs in my presentation, so in this blog post, I will only focus on the example code.

For (more...)

Introduction to Apache Spark with Python

Today, I spoke about “Apache Spark with Python” at Big Talk #2 meet-up in Istanbul Teknokent ARI-3, another event organized by Komtas for big data community. We had almost full room. Mine was the last session of the day but the audience was still very focused and eager to listen the subjects, so for me, the event was great.

By the way, I also enjoyed the sessions of other speakers: Zekeriya Beşioğlu spoke about Data (more...)

Using Spark to Process Data From Cassandra for Analytics

After my presentation about Apache Cassandra, most people asked if they can run analytical queries on Cassandra, and how they can integrate Spark with Cassandra. So I decided to write a blog post to demonstrate how we can process data from Cassandra using Spark. In this blog post, I’ll show how I can build a testing environment on Oracle Cloud (Spark + Cassandra), load sample data to Cassandra, and query the data using Spark.

Let (more...)

Ingesting data into Hive using Spark

Before heading off for my hols at the end of this week to soak up some sun with my gorgeous princess, I thought I would write another blog post. Two posts in a row – that has not happened for ages ????

 

My previous post was about loading a text file into Hadoop using Hive. Today what I am looking to do is to load the same file into a Hive table but using Spark (more...)

Loading data in Hadoop with Hive

It’s been a busy month but 2018 begins with a new post.

 

A common Big Data scenario is to use Hadoop for transforming data and data ingestion – in other words using Hadoop for ETL.

In this post, I will show an example of how to load a comma separated values text file into HDFS. Once the file is moved in HDFS, use Apache Hive to create a table and load the data into (more...)