This past week I spent some time setting up and running various Hadoop workloads on my CDH cluster. After some Hadoop jobs had been running for several minutes, I noticed something quite alarming — the system CPU percentages where extremely high.
This cluster is comprised of 2s8c16t Xeon L5630 nodes with 96 GB of RAM running CentOS Linux 6.2 with java 1.6.0_30. The details of those are:
$ cat /etc/redhat-release CentOS release 6.2 (Final) $ uname -a Linux chaos 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 (more...)
The other day I watched the Oracle Big Data forum. Now available here. A half-day event with various speakers on the subject of BigData, including Tom Kyte , a mentor who I admire!
In the forum they have gone over Oracle's approach to Big Data and allow me to summarise it below:
- Acquire - Collect Big Data, identify it, where is it? Then store it in Oracle NoSQL - a value-pair database
- Organise - Stage Big Data in a transient elastic database. Using Oracle Data Integrator and the Oracle Hadoop connector, reduce and distil it.
- Analyse - Start Analytics on (more...)
If, like me, you want to play around with data in a Hadoop cluster without having to write hundreds or thousands of lines of Java MapReduce code, you most likely will use either Hive (using the Hive Query Language HQL) or Pig.
Hive is a SQL-like language which compiles to Java map-reduce code, while Pig is a data flow language which allows you to specify your map-reduce data pipelines using high level abstractions.
The way I like to think of it is that writing Java MapReduce is like programming in assembler: you need to manually construct every low level (more...)