Map Reduce: File compression and Processing cost

| Aug 18, 2015

Recently while working with a customer we ran into an interesting situation concerning file compression and processing time. For a system like Hadoop, file compression has been always a good way to save on space, especially when Hadoop replicates the data multiple times.

All Hadoop compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of space savings.  For more details about how compression is used, see https://documentation.altiscale. (more...)

Accessing HDFS files on local File system using mountableHDFS – FUSE

Hi All

Recently we had one requirement wherein we had to merge the files post Map and Reducer job. Since the file needed to be given to the outbound team outside of Hadoop development team, having these files on local system would have been ideal. The customer IT team worked with cloudera and gave us a mount point using a utility/concept called “mountableHDFS” aka FUSE (Filesystem in Userspace)  .

mountableHDFS, helps allowing HDFS to be mounted (more...)

How to read HDFS fsImage file

During one of the sizing exercise the  ask for server capacity  was more than the actual usage of cluster . Knowing the data and usage, I was not convinced that we should be asking for more memory space. That triggered the thought of

Conceptually FSIMG file is the balancesheet of all the file and their existence and location. If somehow we could read the metadata withing the file and make sence out of it, than it could help (more...)

Permissions for both HDFS and local fileSystem paths

| Jul 18, 2014

Hi All,

Permission issues is one of the key error , while setting up Hadoop Cluster, while debugging some error found below table on . It’s a good scorecard to keep handy.


Permissions for both HDFS and local fileSystem paths

The following table lists various paths on HDFS and local filesystems (on all nodes) and recommended permissions:

Filesystem Path User:Group Permissions
local hdfs:hadoop drwx——
local (more...)

Need for Defining Reference Architecture For Big Data

Hi Fellow Big Data Admirers ,

With big data and analytics playing an influential role helping organizations achieve a competitive advantage, IT managers are advised not to deploy big data in silos but instead to take a holistic approach toward it and define a base reference architecture even before contemplating positioning the necessary tools. 

My latest print media article (5th in the series) for CIO magazine (ITNEXT) talks extensively about need of reference architecture in (more...)

How to find out a table type in Hive Metastore.

| Apr 10, 2014

Hi All

As Hive metastore is getting into the center of nervous system for the different type of  SQL engines like Shark and Impala. It getting equally difficult to distinguish type of table created in Hive metastore. Eg. if we create a impala table using impala shell you will see the same table on hive prompt and vice versa. See the below example


Step 1 : “Create Table” in Impala Shell and “Show Table” (more...)

How To Create External Hive Table on HBase

| Mar 28, 2014

Hi All,

While building a data flow for replacing one of the EDW’ workflow using Big Data technology stack , came across some interesting findings and issues.  Due to  UPSERT ( INSERT new records or UPDATE existing records depending) nature of data we had to use Hbase, but to expose the outbound feed we need to do some calculation on HBase and publish that to Hive as external. Even though conceptually , its easy to create an (more...)

Hbase : Co-relation between RegionServer and Region

| Mar 20, 2014

Hi All

While looking into HBase performance issue, one of the suggestion was to have more region for a larger table. There was some confusion around, “Region” vs “RegionServer” . While doing some digging, found a simple text written below.

The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may (more...)

HDFS Free Space Command

| Mar 17, 2014

Hi All

With increasing data  volume , in HDFS space could be continued challenge. While running into some space related issue, following command came very handy, hence thought of sharing with extended virtual community.

hadoop dfsadmin -report

Post running the command, below is the result, it takes all the nodes in the cluster and gives the detail break-up based on the space availability and spaces used.

Configured Capacity: 13965170479105 (12.70 TB)
Present Capacity: 4208469598208  (more...)

Big Data : Right Approach Right Solution


Hi All,

Past few months I have been meeting with clients and discussing their potential need of Big Data. The discuss gets to the bottom of , do they really need the Big Data ? The below link to my ITNext article talks about As big data goes bigger,IT managers are challenged with the task of identifying data that qualifies for big and finding appropriate solutions to process it.

Click Here To Read Full Article (more...)