Both ODI and the Hadoop ecosystem share a common design philosophy. Bring the processing to the data rather than the other way around. Sounds logical, doesn’t it? Why move Terabytes of data around your network if you can process it all in the one place. Why invest millions in additional servers and hardware just to transform and process your data?
In the ODI world this approach is known as ELT. ELT is a marketing concept (more...)
Oracle announced their Big Data SQL product a couple of weeks ago, which effectively extends Exadata’s query-offloading to Hadoop data sources. I covered the launch a few days afterwards, focusing on how it implements Exadata’s SmartScan on Hive and NoSQL data sources and provides a single metadata catalog over both relational, and Hadoop, data sources. In a Twitter conversation later in the day though, I made the comment that in my opinion, the biggest (more...)
Permission issues is one of the key error , while setting up Hadoop Cluster, while debugging some error found below table on http://hadoop.apache.org/ . It’s a good scorecard to keep handy.
Permissions for both HDFS and local fileSystem paths
The following table lists various paths on HDFS and local filesystems (on all nodes) and recommended permissions:
Oracle launched their Oracle Big Data SQL product earlier this week, and it’ll be of interest to anyone who saw our series of posts a few weeks ago about the updated Oracle Information Management Reference Architecture, where Hadoop now sits alongside traditional Oracle data warehouses to provide what’s termed a “data reservoir”. In this type of architecture, Hadoop and its underlying technologies HDFS, Hive and schema-on-read databases provide an extension to the more structured (more...)
You may have wondered why we were quiet over the last couple of weeks? Well, we locked ourselves into the basement and did some research and a couple of projects and PoCs on Hadoop, Big Data, and distributed processing frameworks in general. We were also looking at Clickstream data and Web Analytics solutions. Over the next couple of weeks we will update our website with our new offerings, products, and services. The article below summarises (more...)
In every change there are hype machines that over play and sages who call doom. Into the Big Data arena steps David Searls to proclaim that Big Data is a myth and simply hype which is set to burst in an article over at ZDNet.
But big data, he said, is nothing more than the myth that collecting vast amounts of data can help companies know customers better than those customers even know
My paper on NoSQL and Big Data won the Editor’s Choice award at ODTUG Kscope14. Here are some key points from the paper: The relational camp made serious mistakes that limited the performance and usefulness of the relational model. NoSQL is based on the incorrect premise that tables in the relational model must be mapped to […]
In my previous post on our updated Oracle Information Management Reference Architecture, jointly-developed with Oracle’s Enterprise Architecture team, we went through a conceptual and logical view of the information architecture, introducing new concepts like the Raw Data Reservoir, the Data Factory and the Discovery Lab. I said that the Data Factory, Data Reservoir, Enterprise Information Store and Reporting, together with the Discovery Lab, comprised the Information Platform, which could be used to create data (more...)
Last month I have been attending the RittmanMead BI Forum 2014. In the wrap-up I mentioned a presentation by Andrew Bond & Stewart Bryson. They had a very nice presentation about the Oracle Information Management Reference Architecture. This needed some further investigation from my part. This blogpost is a first summary of the information I found online so far. There is […]
One of the things at Rittman Mead that we’re really interested in, is the architecture of “information management” systems and how these evolve over time as thinking, and product capabilities, evolve. In fact we often collaborate with the Enterprise Architecture team within Oracle, giving input into the architecture designs they come up with, and more recently working on a full-blown collaboration with them to come up with a next-generation Information Management architecture. I these two (more...)
Oracle Scene (the publication of United Kingdom Oracle Users Group) has published my article "Hadoop for Oracle Professionals", where I have attempted, like many others, to demystify the terms such as Hadoop, Map/Reduce and Flume. If you were interested in Big Data and what all comes with understanding it, you might find it useful.
A PDF version of the article can be downloaded here http://www.proligence.com/art/oracle_scene_summ14_hadoop.pdf
I'm going to state a sacrilegious position for a moment: the quality of data isn't a primary goal in Master Data Management
Now before the perfectly correct 'Garbage In, Garbage Out' statement let me explain. Data Quality is certainly something that MDM can help with but its not actually the primary aim of MDM.
MDM is about enabling collaboration, collaboration is about the cross-reference
There is a massive amount of IT hype that is focused on what people see, its about the agile delivery of interfaces, about reporting, visualisation and interactional models. If you could weight hype then it is quite clear that 95% of all IT is about this area. Its why we need development teams working hand-in-hand with the business, its why animations and visualisation are massively important.
Scoop, Flume, PIG, Zookeeper. Do these mean anything to you? If they do then the odds are you are looking at Hadoop. The thing is that while that was cool a few years ago it really is time to face it that HDFS is a commodity, Map Reduce is interesting but not feasible for most users and the real question is how we turn all that raw data in HDFS into something we can actually (more...)
Hi Fellow Big Data Admirers ,
With big data and analytics playing an influential role helping organizations achieve a competitive advantage, IT managers are advised not to deploy big data in silos but instead to take a holistic approach toward it and define a base reference architecture even before contemplating positioning the necessary tools.
My latest print media article (5th in the series) for CIO magazine (ITNEXT) talks extensively about need of reference architecture in (more...)
Over the last few years there has been a trend of increased spending on BI, and that trend isn't going away. The analyst predictions however have, understandably, been based on the mentality that the choice was between a traditional EDW/DW model or Hadoop. With the new 'Business Data Lake' type of hybrid approach its pretty clear that the shift is underway for all vendors to have a hybrid
As Hive metastore is getting into the center of nervous system for the different type of SQL engines like Shark and Impala. It getting equally difficult to distinguish type of table created in Hive metastore. Eg. if we create a impala table using impala shell you will see the same table on hive prompt and vice versa. See the below example
Step 1 : “Create Table” in Impala Shell and “Show Table” (more...)
While building a data flow for replacing one of the EDW’ workflow using Big Data technology stack , came across some interesting findings and issues. Due to UPSERT ( INSERT new records or UPDATE existing records depending) nature of data we had to use Hbase, but to expose the outbound feed we need to do some calculation on HBase and publish that to Hive as external. Even though conceptually , its easy to create an (more...)
While looking into HBase performance issue, one of the suggestion was to have more region for a larger table. There was some confusion around, “Region” vs “RegionServer” . While doing some digging, found a simple text written below.
The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may (more...)
With increasing data volume , in HDFS space could be continued challenge. While running into some space related issue, following command came very handy, hence thought of sharing with extended virtual community.
hadoop dfsadmin -report
Post running the command, below is the result, it takes all the nodes in the cluster and gives the detail break-up based on the space availability and spaces used.
Configured Capacity: 13965170479105 (12.70 TB)
Present Capacity: 4208469598208 (more...)