OLAP Over Hadoop
In the last few years Hadoop has really come forward as a massively scalable distributed computing platform. Most of us are aware that it uses Map Reduce Jobs to performcomputation over Big Data which is mostly unstructured. Of course such a platform cannot be compared with a relational database storing structured data with definedschema. While Hadoop allows you to perform Deep analytics with complex computations, when it comes to performing multidimensional analytics over data Hadoop seemslagging. You might argue that Hadoop was not even built for such uses. But when the users start putting their historical data in Hadoop they also start expectingmultidimensional analytics over it in real time. Here “real time” is really important.
Some of you might think that you can define OLAP friendly Warehousing Star Schema using Hive for your data in Hadoop and use a ROLAP tool. But there comes the catch.Even on the partially aggregated data the ROLAP queries will be too slow to make it real time OLAP. As Hive structures the data at read time, the fixed initial time taken foreach Hive query makes Hadoop really unusable for real time multidimensional analytics.
The only options left to you are either you aggregate the data in Hadoop and bring the partially aggregated data in an RDBMS. Thus you can use any standard OLAP tool toconnect to your RDBMS and perform Multidimensional analytics using ROLAP or MOLAP. While ROLAP will directly fire the queries against the Database, MOLAP will furthersummarize and aggregate the multidimensional data in the form of cuboids for a cube.
The other option is you use a MOLAP tool that can compute the aggregates for the data in Hadoop and get the computed cube locally. This will allow you to do a really realtime OLAP. Moreover if the aggregates can be performed in Hadoop itself that will really make cube computations scalabale and fast.
No comments:
Post a Comment