Looking back over some of my truly ancient Rittman Mead blogs (so old in fact that they came with me when I joined the company soon after Rittman Mead was launched), I see recurrent themes on why people “do” BI and what makes for successful implementations. After all, why would an organisation wish to invest serious money in a project if it does not give value either in terms of cost reduction or increasing profitability (more...)
A few months ago I wrote a post on the blog around using Apache Flume to trickle-feed log data into HDFS and Hive, using the Rittman Mead website as the source for the log entries. Flume is a good technology to use for this type of capture requirement as it captures log entries, HTTP calls, JMS queue entries and other “event” sources easily, has a resilient architecture and integrates well with HDFS and Hive. (more...)
Last week I posted an article on the blog around analysing Twitter data using Datasift, MongoDB and Pig, where I used the Datasift service to stream tweets about Rittman Mead into a MongoDB NoSQL database, and then queried the dataset using Pig. The context for this is the idea of a “data reservoir”, where we supplement the more traditional file and relational datasets we find in data warehouses with other data, typically machine generated, (more...)
I will give a presentation on 24 September at the Jury’s Inn in Dublin on the next generation of Big Data 2.0 tools and architecture.
Over the last two years there have been significant changes and improvements in the various Big Data frameworks. With the release of Yarn (Hadoop 2.0) the most popular of these platforms now allows you to run mixed workloads. Gone are the days when Hadoop was only good for (more...)
For an organization to respond in real-time it needs to acquire or develop systems
that can respond in real-time. Such systems need to be able to rapidly
determine that a response is required and determine also what the
appropriate and relevant response should be – they need to decide when
and how to act. These kinds of decision-making systems are known as
Decision Management Systems. To ensure that a response is delivered in
real-time, more (more…)
Both ODI and the Hadoop ecosystem share a common design philosophy. Bring the processing to the data rather than the other way around. Sounds logical, doesn’t it? Why move Terabytes of data around your network if you can process it all in the one place. Why invest millions in additional servers and hardware just to transform and process your data?
In the ODI world this approach is known as ELT. ELT is a marketing concept (more...)
Permission issues is one of the key error , while setting up Hadoop Cluster, while debugging some error found below table on http://hadoop.apache.org/ . It’s a good scorecard to keep handy.
The following table lists various paths on HDFS and local filesystems (on all nodes) and recommended permissions:
You may have wondered why we were quiet over the last couple of weeks? Well, we locked ourselves into the basement and did some research and a couple of projects and PoCs on Hadoop, Big Data, and distributed processing frameworks in general. We were also looking at Clickstream data and Web Analytics solutions. Over the next couple of weeks we will update our website with our new offerings, products, and services. The article below summarises (more...)
A PDF version of the article can be downloaded here http://www.proligence.com/art/oracle_scene_summ14_hadoop.pdf
Hi Fellow Big Data Admirers ,
With big data and analytics playing an influential role helping organizations achieve a competitive advantage, IT managers are advised not to deploy big data in silos but instead to take a holistic approach toward it and define a base reference architecture even before contemplating positioning the necessary tools.
My latest print media article (5th in the series) for CIO magazine (ITNEXT) talks extensively about need of reference architecture in (more...)
As Hive metastore is getting into the center of nervous system for the different type of SQL engines like Shark and Impala. It getting equally difficult to distinguish type of table created in Hive metastore. Eg. if we create a impala table using impala shell you will see the same table on hive prompt and vice versa. See the below example
Step 1 : “Create Table” in Impala Shell and “Show Table” (more...)
While building a data flow for replacing one of the EDW’ workflow using Big Data technology stack , came across some interesting findings and issues. Due to UPSERT ( INSERT new records or UPDATE existing records depending) nature of data we had to use Hbase, but to expose the outbound feed we need to do some calculation on HBase and publish that to Hive as external. Even though conceptually , its easy to create an (more...)
While looking into HBase performance issue, one of the suggestion was to have more region for a larger table. There was some confusion around, “Region” vs “RegionServer” . While doing some digging, found a simple text written below.
The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may (more...)