Oracle Scene (the publication of United Kingdom Oracle Users Group) has published my article "Hadoop for Oracle Professionals", where I have attempted, like many others, to demystify the terms such as Hadoop, Map/Reduce and Flume. If you were interested in Big Data and what all comes with understanding it, you might find it useful.
I'm going to state a sacrilegious position for a moment: the quality of data isn't a primary goal in Master Data Management
Now before the perfectly correct 'Garbage In, Garbage Out' statement let me explain. Data Quality is certainly something that MDM can help with but its not actually the primary aim of MDM.
MDM is about enabling collaboration, collaboration is about the cross-reference
There is a massive amount of IT hype that is focused on what people see, its about the agile delivery of interfaces, about reporting, visualisation and interactional models. If you could weight hype then it is quite clear that 95% of all IT is about this area. Its why we need development teams working hand-in-hand with the business, its why animations and visualisation are massively important.
Scoop, Flume, PIG, Zookeeper. Do these mean anything to you? If they do then the odds are you are looking at Hadoop. The thing is that while that was cool a few years ago it really is time to face it that HDFS is a commodity, Map Reduce is interesting but not feasible for most users and the real question is how we turn all that raw data in HDFS into something we can actually (more...)
With big data and analytics playing an influential role helping organizations achieve a competitive advantage, IT managers are advised not to deploy big data in silos but instead to take a holistic approach toward it and define a base reference architecture even before contemplating positioning the necessary tools.
My latest print media article (5th in the series) for CIO magazine (ITNEXT) talks extensively about need of reference architecture in (more...)
Over the last few years there has been a trend of increased spending on BI, and that trend isn't going away. The analyst predictions however have, understandably, been based on the mentality that the choice was between a traditional EDW/DW model or Hadoop. With the new 'Business Data Lake' type of hybrid approach its pretty clear that the shift is underway for all vendors to have a hybrid
As Hive metastore is getting into the center of nervous system for the different type of SQL engines like Shark and Impala. It getting equally difficult to distinguish type of table created in Hive metastore. Eg. if we create a impala table using impala shell you will see the same table on hive prompt and vice versa. See the below example
Step 1 : “Create Table” in Impala Shell and “Show Table” (more...)
While building a data flow for replacing one of the EDW’ workflow using Big Data technology stack , came across some interesting findings and issues. Due to UPSERT ( INSERT new records or UPDATE existing records depending) nature of data we had to use Hbase, but to expose the outbound feed we need to do some calculation on HBase and publish that to Hive as external. Even though conceptually , its easy to create an (more...)
While looking into HBase performance issue, one of the suggestion was to have more region for a larger table. There was some confusion around, “Region” vs “RegionServer” . While doing some digging, found a simple text written below.
The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may (more...)
With increasing data volume , in HDFS space could be continued challenge. While running into some space related issue, following command came very handy, hence thought of sharing with extended virtual community.
hadoop dfsadmin -report
Post running the command, below is the result, it takes all the nodes in the cluster and gives the detail break-up based on the space availability and spaces used.
"Real-time" its a word that gets thrown about a lot in IT and its worth documenting a few of the different ways it gets used
This is what Real-time Java was created to address (along with Soft Real-time) what is this? Easiest way to say it is that often in Hard Real-time environments the following statement is true
If it doesn't finish in X milliseconds then people might die
The Big Data presentation I gave yesterday is now available for download. In this presentation I define some common features of Big Data use cases, explain what the big deal about Big Data is all about and explore the impact of Big Data on the traditional data warehouse framework.
There are various views going around on what a Data Scientist is and what their value is to an organisation and the salaries they command. To me however asking 'what is a Data Scientist?' is like asking 'What is a Physicist?' sure 'someone who studies Physics' might be a factually accurate but pointless definition. How does that separate someone who did Physics in High School from Albert
One of the things that always stuns me in IT is how people don't appear to like change. Whether it was the EAI folks pushing back on Web Services in 2000 in favour of their old-school approaches. The package guys pushing back against SaaS or now the BI guys pushing back against the new wave of BI technologies and approaches the message is always the same:
We are happy doing what we are doing,
I can smell a change coming, the last few years have seen cloud and SaaS on the rise and seen a fragmentation in application development (thanks in a large part to the appalling stewardship of Java) and a real focus of budgets around BI and 'vanilla' package approaches. Now this is a good thing, both because I jumped out of the Java boat onto the BI boat a few years ago but also because its
The end of the next Software Development wave will be when Software development against 'eats itself' as it did with with technologies like Hadoop showing a new value in information, with platforms like SFDC showing new pre-build services, where people like GoodData have turned BI into SaaS. So we will see the same evolution again and a new generation of commoditisation which drives
This is the stage at which software development begins to commoditise itself, its no surprise that underneath all that Salesforce.com scripting lurked rather a lot of Java code. This wave sees the rise of the libraries, the utilities and above all the commoditisation of software in a way that enables the majority of developers to be useful in the enterprise. This was the goal of Spring, JEE
The problem with Wave 1 was that it didn't scale, I mean sure lots of the personal developers claimed it did scale, often laughing at large scale developments and going 'Me and four mates could do that in a couple of weeks' often they attempted to do that and suddenly realised that when you get a few people together it gets a bit more complicated and when that few gets over 20 it begins to (more...)
This is the wave we are in at the moment and its the wave that we last saw in the late 90s, this is where technologies enabled single people to build small specific things really quickly. Java and its applets really were the peak of this first wave back then but now we are seeing people use technologies such as R, Python and others to create small solutions that offer really good point value.