Big Data 2.0 and Agile BI all at Irish BI OUG (24 September).

I will give a presentation on 24 September at the Jury’s Inn in Dublin on the next generation of Big Data 2.0 tools and architecture.

Over the last two years there have been significant changes and improvements in the various Big Data frameworks. With the release of Yarn (Hadoop 2.0) the most popular of these platforms now allows you to run mixed workloads. Gone are the days when Hadoop was only good for (more...)

Oracle Data Integrator and Hadoop. Is ODI the only ETL tool for Big Data that works?

Both ODI and the Hadoop ecosystem share a common design philosophy. Bring the processing to the data rather than the other way around. Sounds logical, doesn’t it? Why move Terabytes of data around your network if you can process it all in the one place. Why invest millions in additional servers and hardware just to transform and process your data?

In the ODI world this approach is known as ELT. ELT is a marketing concept (more...)

Permissions for both HDFS and local fileSystem paths

| Jul 18, 2014

Hi All,

Permission issues is one of the key error , while setting up Hadoop Cluster, while debugging some error found below table on http://hadoop.apache.org/ . It’s a good scorecard to keep handy.

 

Permissions for both HDFS and local fileSystem paths

The following table lists various paths on HDFS and local filesystems (on all nodes) and recommended permissions:

Filesystem Path User:Group Permissions
local dfs.namenode.name.dir hdfs:hadoop drwx——
local dfs.datanode.data.dir (more...)

Big Data doom mongers need to look outside of the marketing department

In every change there are hype machines that over play and sages who call doom.  Into the Big Data arena steps David Searls to proclaim that Big Data is a myth and simply hype which is set to burst in an article over at ZDNet. But big data, he said, is nothing more than the myth that collecting vast amounts of data can help companies know customers better than those customers even know

Editor’s Choice award at ODTUG Kscope14: NoSQL and Big Data for the Oracle Professional

My paper on NoSQL and Big Data won the Editor’s Choice award at ODTUG Kscope14. Here are some key points from the paper: The relational camp made serious mistakes that limited the performance and usefulness of the relational model. NoSQL is based on the incorrect premise that tables in the relational model must be mapped to […]

MDM isn’t about data quality its about collaboration

I'm going to state a sacrilegious position for a moment: the quality of data isn't a primary goal in Master Data Management Now before the perfectly correct 'Garbage In, Garbage Out' statement let me explain.  Data Quality is certainly something that MDM can help with but its not actually the primary aim of MDM. MDM is about enabling collaboration, collaboration is about the cross-reference

Lipstick on the iceberg – why the local view matters for IT evolution

There is a massive amount of IT hype that is focused on what people see, its about the agile delivery of interfaces, about reporting, visualisation and interactional models.  If you could weight hype then it is quite clear that 95% of all IT is about this area.  Its why we need development teams working hand-in-hand with the business, its why animations and visualisation are massively important.

How to select a Hadoop distro – stop thinking about Hadoop

Scoop, Flume, PIG, Zookeeper.  Do these mean anything to you?  If they do then the odds are you are looking at Hadoop.  The thing is that while that was cool a few years ago it really is time to face it that HDFS is a commodity, Map Reduce is interesting but not feasible for most users and the real question is how we turn all that raw data in HDFS into something we can actually (more...)

Need for Defining Reference Architecture For Big Data

Hi Fellow Big Data Admirers ,

With big data and analytics playing an influential role helping organizations achieve a competitive advantage, IT managers are advised not to deploy big data in silos but instead to take a holistic approach toward it and define a base reference architecture even before contemplating positioning the necessary tools. 

My latest print media article (5th in the series) for CIO magazine (ITNEXT) talks extensively about need of reference architecture in (more...)

Data Lakes will replace EDWs – a prediction

Over the last few years there has been a trend of increased spending on BI, and that trend isn't going away.  The analyst predictions however have, understandably, been based on the mentality that the choice was between a traditional EDW/DW model or Hadoop.  With the new 'Business Data Lake' type of hybrid approach its pretty clear that the shift is underway for all vendors to have a hybrid

How to find out a table type in Hive Metastore.

| Apr 10, 2014

Hi All

As Hive metastore is getting into the center of nervous system for the different type of  SQL engines like Shark and Impala. It getting equally difficult to distinguish type of table created in Hive metastore. Eg. if we create a impala table using impala shell you will see the same table on hive prompt and vice versa. See the below example

 

Step 1 : “Create Table” in Impala Shell and “Show Table” (more...)

How To Create External Hive Table on HBase

| Mar 28, 2014

Hi All,

While building a data flow for replacing one of the EDW’ workflow using Big Data technology stack , came across some interesting findings and issues.  Due to  UPSERT ( INSERT new records or UPDATE existing records depending) nature of data we had to use Hbase, but to expose the outbound feed we need to do some calculation on HBase and publish that to Hive as external. Even though conceptually , its easy to create an (more...)

Hbase : Co-relation between RegionServer and Region

| Mar 20, 2014

Hi All

While looking into HBase performance issue, one of the suggestion was to have more region for a larger table. There was some confusion around, “Region” vs “RegionServer” . While doing some digging, found a simple text written below.

The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may (more...)

What is real-time? Depends on who you ask

"Real-time" its a word that gets thrown about a lot in IT and its worth documenting a few of the different ways it gets used Hard Real-time This is what Real-time Java was created to address (along with Soft Real-time) what is this?  Easiest way to say it is that often in Hard Real-time environments the following statement is true If it doesn't finish in X milliseconds then people might die So

What are the types of Data Scientist?

There are various views going around on what a Data Scientist is and what their value is to an organisation and the salaries they command.  To me however asking 'what is a Data Scientist?' is like asking 'What is a Physicist?' sure 'someone who studies Physics' might be a factually accurate but pointless definition.  How does that separate someone who did Physics in High School from Albert

BI change is coming, time to get over it and get on with the job

One of the things that always stuns me in IT is how people don't appear to like change.  Whether it was the EAI folks pushing back on Web Services in 2000 in favour of their old-school approaches.  The package guys pushing back against SaaS or now the BI guys pushing back against the new wave of BI technologies and approaches the message is always the same: We are happy doing what we are doing,

The next big wave of IT is Software Development

I can smell a change coming, the last few years have seen cloud and SaaS on the rise and seen a fragmentation in application development (thanks in a large part to the appalling stewardship of Java) and a real focus of budgets around BI and 'vanilla' package approaches.  Now this is a good thing, both because I jumped out of the Java boat onto the BI boat a few years ago but also because its

Software Development Wave 4: back to the package

The end of the next Software Development wave will be when Software development against 'eats itself' as it did with with technologies like Hadoop showing a new value in information, with platforms like SFDC showing new pre-build services, where people like GoodData have turned BI into SaaS.  So we will see the same evolution again and a new generation of commoditisation which drives

Datafication of Compensation Distribution

Is your data science providing you enough indications that challenge your existing compensation strategy?  Does it reveal that the art of compensation distribution performed by your managers is not in accordance with your compensation strategy? Old habits die-hard, so you need to make sure that your plan for data-driven decision-making is not getting overridden by compensation managers’ belief system and they are not ignoring data science recommendations.

DataficationofCompensationDistributionEven today challenge is to effectively distribute (more...)

Big Data? Start with Right Data

I’m wearing a Nike Fuelband – one of those fitness/activity tracker gizmos. Nike is offering both a website and an app showing my daily activity. As a customer, I am expecting these two to contain the same data. After all, my bank balance is the same in my mobile banking app, in an ATM or in a web browser.

Unfortunately, Nike does not have a proper infrastructure behind their gadget, so the numbers do not (more...)