Using Spark to join data from CSV and MySQL Table

Yesterday, I explained how we can access MySQL database from Zeppelin which comes with Oracle Big Data Cloud Service Compute Edition (BDCSCE). Although we can use Zeppelin to access MySQL, we still need something more powerful to combine data from two different sources (for example data from CSV file and RDBMS tables). Spark is a great choice to process data. In this blog post, I’ll write a simple PySpark (Python for Spark) code which will (more...)

Using Zeppelin to Access MySQL

If you want to access MySQL Cloud Service using Zeppelin of Oracle Big Data Cloud Service Compute Edition (BDCSCE), you can use Spark DataFrames or Zeppelin interpreters. In this blog post, I’ll show how we can edit JDBC interpreter to connect MySQL Cloud Service.

First login to Oracle Big Data Cloud console, and go to “settings” tab, and open “notebook” section. You’ll see the interpreter settings. Search for “jdbc” and click “edit” button to edit (more...)

Oracle Cloud Day Istanbul

Yesterday, I spoke at the Oracle Cloud Day Istanbul. It was an amazing event. The venue (Swissotel the Bosphorus) was great, the conference rooms were comfortable, the presentations are attractive and well-balanced (DB, Middleware, Development), and the audience was great. This year, the event was much more crowded than previous years.

As usual, Oracle Turkey set a separate track for TROUG (Turkish Oracle User Group) presentations, and I was one of the speakers of TROUG. (more...)

Python for Data Science – Importing table data from a web page

This is another blog post about using Pandas package. This time, I’ll show you how to import table data from a web page. To be able to get table data, there should be a table defined with table tags (table,td,tr) in the web page we access. Unfortunately most web sites do not use “tables” anymore. They usually prefer to use “div” tags, so if this code doesn’t work, check HTML source code of the page.

(more...)

Python for Data Science – Importing XML to Pandas DataFrame

In my previous post, I showed how easy to import data from CSV, JSON, Excel files using Pandas package. Another popular format to exchange data is XML. Unfortunately Pandas package does not have a function to import data from XML so we need to use standard XML package and do some extra work to convert the data to Pandas DataFrames.

Here’s a sample XML file (save it as test.xml):

<?xml version="1.0"? (more...)

Python for Data Science – Importing CSV, JSON, Excel Using Pandas

Although I think that R is the language for Data Scientists, I still prefer Python to work with data. In this blog post, I will show you how easy to import data from CSV, JSON and Excel files using Pandas libary. Pandas is a Python package designed for doing practical, real world data analysis.

Here is the content of the sample CSV file (test.csv):

name,email
"gokhan","gokhan@gmail.com"
"mike","mike@gmail.com"

Here is the content of (more...)

Oracle Big Data Cloud Service CE: Working with Hive, Spark and Zeppelin 0.7

In my previous post, I mentioned that Oracle Big Data Cloud Service – Compute Edition started to come with Zeppelin 0.7 and the version 0.7 does not have HIVE interpreter. It means we won’t be able to use “%hive” blocks to run queries for Apache Hive. Instead of “%hive” blocks, we can use JDBC interpreter (“%jdbc” blocks) or Spark SQL (“%sql” blocks).

The JDBC interpreter lets you create a JDBC connection to any (more...)

Oracle BDCSCE Upgraded: Zeppelin 0.7 and Spark 2.1

Last week, Oracle Big Data Cloud Service – Compute Edition was upgraded from 17.2.5 to 17.3.1-20. I do not know if the new version is still in testing phase and available to only trial users, but sooner or later the new version will be available to all Oracle Cloud users.

The new version is still based on HDP 2.4.2 but it contains upgrades on two key components: Zeppelin and (more...)

Introduction to Oracle Big Data Cloud Service – Compute Edition (Part VI) – Hive

I though I would stop writing about “Oracle Big Data Cloud Service – Compute Edition” after my fifth blog post, but then I noticed that I didn’t mention about the Apache Hive, another important component of the Big Data. Hive is a data warehouse infrastructure built on top of Hadoop, designed to work with large datasets. Why is it so important? Because it includes support for SQL (SQL:2003 and SQL:2011), and helps users to utilize (more...)

Introduction to Oracle Big Data Cloud Service – Compute Edition (Part V) – Pig

This is my last blog post of my introduction series for Oracle Big Data Cloud Service – Compute Edition. In this blog post, I’ll mention “Apache Pig”. It’s a tool/platform created by “Yahoo!” to analyze large data sets without the complexities of writing a traditional MapReduce program. It’s designed to process any kind of data (structured or unstructured) so it’s a great tool for ETL jobs. Pig comes installed and ready to use with (more...)