BigDataAnalytics: June 2016

Monday, 27 June 2016

Real Time Deep Analytics from Unstructured data

One of the biggest challenges while dealing with Big Data Analytics is Unstructured Data. As we saw earlier, most of the big data generated is semi structured or unstructured. Structured data is inherently relational and record oriented with a defined schema which makes it easy to query and analyze. However to analyze unstructured data first you need to extract structure from it.

Now the problem is that the process of structuring the data can itself be very complex considering the huge amount of data. Sometimes the computations required to structure the data are complex say Entity extraction from Natural Language text. Sometimes the data generates at a faster pace than the ability of your ETL tool to structure it.

Moreover, sometimes you don’t even know what should be the structure of the data. You know that the big unstructured data collected has got a lot of value but you don’t know where it is and so it becomes difficult to structure the data at the time of data collection and loading. Rather you want to delay the structuring of the data till you can actually understand the exact analytics needs.

Another challenge is to carry out complex computations over big data. Sometimes your analysis will include querying the data with simple summaries and statistics or multidimensional analytics over big data. However sometimes you actually want to perform complex computations to carry out deep analytics over big data. You might want your system to mine your data to extract knowledge out of it so that you are not only aware of what has happened or what is happening rather you are able to predict what will happen in the future.

Moreover you always want to keep the latency of the analytical queries as low as possible. You always want that the time required to process this huge amount of data is as low as possible. You want to reduce days into hours, hours into minutes and minutes into seconds. You almost want near real time analytics. So at one side your data is continuously increasing and at other side you want to reduce the processing time, two contradictory things as such.

Approaches for Big Data Analytics

In general for big data analytics, you will need a BI tool over one of the storage options that we discussed earlier. The BI tool will provides a visual interface to query the data and extract information and knowledge out of it so as to make intelligent decisions. Let us see the possible approaches one by one:

Direct Analytics over MPP DW

The first approach for big data analytics is using a BI tool directly over any of the MPP DW. Generally these DWs allow a BI tool to connect to them using a JDBC or ODBC interface with SQL as a mean to get the data for analytics. For any analytical request by the user the BI tool will send SQL queries to these DWs. These DWs will execute the queries in a parallel manner across the cluster and return the data to BI tool for further analytics.

Some of these DWs also allow you to write Map Reduce UDFs that can be used within SQLs to perform the procedural computations over big data in a parallel manner. This is also called as in database analytics, which means that the BI tool does not need to take the data out of the DW to perform complex computation over it rather the computations can be performed in the form of UDFs inside the database.

Important point to note here is that the data needs to be structured before a BI tool can do analytics over it. Either you can use an ETL tool to extract the structure it or you load the unstructured data in a column and use in-database computations in the form of MR functions to structure it.

Generally If the data is structured then this might prove to be a good approach as an MPP database enjoys all the performance enhancement techniques of relational world like indexing and aggregations, compression, materialized views, result caching. However the cost of such a solution is at a higher end which is something worth considering.

There can be a big fight over the point that Hadoop is not a DBMS but when Hadoop reaches to users and organizations who look to use it just because it is a buzzword, they expect almost anything out of it that a DBMS can do. You should see such solutions growing in the near future.

Indirect Analytics over Hadoop

Another interesting approach that might suit you is analytics over Hadoop Data but not directly rather indirectly by first processing, transforming and structuring it inside Hadoop and then exporting the structured data into RDBMS. The BI tool will work with the RDBMS to provide the analytics.

Generally one would go for such an approach when the generated data is huge and unstructured and computations required to derive the structure out of it are complex and time consuming and also it is possible to partially process and summarize it before doing the actual analytics. In such cases the huge amount of unstructured or semi structured data can be stored in the Hadoop system. The MR jobs will take care of structuring and summarizing it which can then be easily be put into any standard RDBMS over which a BI tool can work.

Please note that if the structured and summarized data is still too big to go in an RDBMS then this RDBMS can be replaced by an MPP DW as well. If an RDBMS is used here then with a moderate cost it can provide you real time analytics over your data.

Direct Analytics over Hadoop

The last approach is performing analytics directly over Hadoop. In this case all the queries that a BI tool wants to execute against the data will be executed as MR jobs over big unstructured data placed into Hadoop. The complication with this approach is that how a BI tool connects to Hadoop system as MR jobs are the only way to process the data in

Hadoop. However in the Hadoop Ecosystem the components like Hive and Pig allow one to connect to Hadoop using high level interfaces. Hive allows you to define the structured Meta layer over Hadoop. Hive supports a SQL like query language called Hive-QL. It also implements the interface like JDBC that a BI tool can easily use to connect to it. Hive is also extensible enough to allow implementing custom UDFs to work on data and SerDe classes to structure the data at run time.Such an approach will have low cost but it is supposed to be a high latency approach for analytics over big unstructured data as one would require transforming and extracting structure out of data at run time. However the good thing is that somebody does not need to worry about the data schema and modeling till he or she is clear about the analytics need.

Opposed to other approaches, here the data is structured at read time rather than write time. So if one has big un-structured data and batch analysis can suffice his or her needs then this is a good solution. One surely enjoys the scalability and fault tolerance of Hadoop that too with a cluster of commodity servers which not necessarily need to be homogeneous.

Which approach I go with?

The biggest question that that would come to everybody’s mind after reading this blog is which approach he or she should go with.So as you can see if you want a highly scalable, fault tolerant and low cost solution that allows you to do complex analytics over unstructured data then you might opt to go with Direct Analytics over Hadoop.

If you are looking for an easy to use solution that allows you to do complex and near real time analytics over huge structured data with minimal IT efforts then you might opt to go with Direct Analytics over MPP DW.Finally if you are looking for a solution that provides you the flexibility in terms of the structure of the data and allows you to do real time analytics over data then you might opt to go with Indirect analytics over Hadoop.

Big Data Storage & Processing

Let’s see the purpose-built storage options that allow you to store and process big data in a scalable, fault tolerant and efficient manner. You know what, this has been themost innovative sector of the business intelligence industry among the database vendors, both new and old, that have shipped a number of new products in the last few yearsfor big data storage and processing. A lot of progress has also been made at open source platforms. Here is a high-level categorization of these products.

The first category includes massively parallel processing or MPP Data warehouses that are designed to store huge amount of structured data across a cluster of servers andperform parallel computations over it. Most of these solutions follow shared nothing architecture which means that every node will have a dedicated disk, memory andprocessor. All the nodes are connected via high speed networks. As they are designed to hold structured data so generally you would use an ETL tool to extract the structurefrom the data and populate these data sources with the structured data.

These MPP Data Warehouses include:

MPP Databases — these are generally the distributed systems designed to run on a cluster of commodity servers.

Examples: Aster nCluster, Greenplum, DATAllegro, IBM DB2, Kognitio WX2, Teradata etcAppliances — a purpose-built machine with preconfigured MPP hardware and software designed for analytical processing.

Examples: Oracle Optimized Warehouse, Teradata machines, Netezza Performance Server and Sun’s Data Warehousing ApplianceColumnar Databases — they store data in columns instead of rows, allowing greater compression and faster query performance.

Examples: Sybase IQ, Vertica, InfoBright Data Warehouse, ParAccelMost of them provide SQLs and UDFs to process the data.Another category includes distributed file systems like Hadoop that allow us to store huge unstructured data and perform Map Reduce computations on it over a clusters built of commodity hardware.

Big Data - The Data Deluge

In today’s world, almost every enterprise is seeing an explosion of data. They are getting huge amount of digital data generated daily. Almost every growing organization wantsto automate most of its business processes and is using IT to support every conceivable Business function. This is resulting into huge amount of data being generated in theform of transactions and interactions. Web has become an important interface for interactions with suppliers and customers generating the huge amount of data in the form ofemails etc. Besides this, there is a huge amount of data emitted automatically in the form of logs like network logs and web server logs.

Various Telecom Service Providers get huge amount of data in the form of conversations and Call Data Records. Various Social N/W Sites have started getting TBs of data everyday in the form of tweets, blogs, comments, photos and videos etc. Facebook generates 4TBs of compressed data every day. Web Companies like these get huge amount ofclick stream data generated daily as well. Hospitals have data about the patients, their diseases and the data generated by various medical devices as well. Sensors used invarious machines used for production keep generating so much of event data in seconds. Almost every sector like transport, finance is seeing a tsunami of Data.

Such huge amount of data needs to be stored for various reasons. Sometimes any compliance demands more historical data to be stored. Some times organizations want tostore, process and analyse this data for intelligent decision making to get the competitive advantage.For example analyzing CDR data can help a service provider know theirquality of service and then make the necessary improvements. A Credit Card company can analyze the customer transactions for fraud detection. Server logs can be analyzedfor fault detection. Web logs can help understand the user navigation patterns. Customer emails can help understand the customer behavior, interests and some time theproblems with the products as well.Now the important question that arises at this point of time is how do we store and process such huge amount of data most of which is Semi structured or Unstructured.

OLAP Over Hadoop

In the last few years Hadoop has really come forward as a massively scalable distributed computing platform. Most of us are aware that it uses Map Reduce Jobs to performcomputation over Big Data which is mostly unstructured. Of course such a platform cannot be compared with a relational database storing structured data with definedschema. While Hadoop allows you to perform Deep analytics with complex computations, when it comes to performing multidimensional analytics over data Hadoop seemslagging. You might argue that Hadoop was not even built for such uses. But when the users start putting their historical data in Hadoop they also start expectingmultidimensional analytics over it in real time. Here “real time” is really important.

Some of you might think that you can define OLAP friendly Warehousing Star Schema using Hive for your data in Hadoop and use a ROLAP tool. But there comes the catch.Even on the partially aggregated data the ROLAP queries will be too slow to make it real time OLAP. As Hive structures the data at read time, the fixed initial time taken foreach Hive query makes Hadoop really unusable for real time multidimensional analytics.

The only options left to you are either you aggregate the data in Hadoop and bring the partially aggregated data in an RDBMS. Thus you can use any standard OLAP tool toconnect to your RDBMS and perform Multidimensional analytics using ROLAP or MOLAP. While ROLAP will directly fire the queries against the Database, MOLAP will furthersummarize and aggregate the multidimensional data in the form of cuboids for a cube.

The other option is you use a MOLAP tool that can compute the aggregates for the data in Hadoop and get the computed cube locally. This will allow you to do a really realtime OLAP. Moreover if the aggregates can be performed in Hadoop itself that will really make cube computations scalabale and fast.

Big Data stretching the scope of BI

Sometimes I wonder looking at how “Business Intelligence” is moving today. Experts in the field are trying their best to stretch the scope as much as possible for this domain. When it comes to storing the data at the back end we are trying to move as backward as possible. While when it comes to display the information we are moving as forward as possible.

Remember those days when you used to store the data in the flat files possibly at your local system. Then you faced the problem of data management and size. You moved further backwards and you started to store the Data in the RDBMS setup at remote machines and distributed across multiple nodes. Now you see even bigger data. You find it further difficult to manage at your own premises. Guess what, now you decide to go further backwards, possibly out of your premises. You start looking for Big Data Storage options located remotely either in the form of Data Centres or as Clouds. Clouds sound as attractive option as we can find many cheaper options which can provide gigantic storage spaces with already setup big data processing frameworks; Amazon Elastic Map Reduce being a good example. Even if we want to use any other commercial solution for big data processing, setting it up on the cloud should not be a big problem. Though, the Safety and Security challenges associated with Clouds still remain. We can still argue and discuss for hours if it is a good strategic decision to move to clouds just like people do today in the enterprise. Leaving behind all these arguments and challenges Clouds are gaining more and more popularity day by day. So don’t be surprised if your kid modifies his or her understanding of the Cloud. Gone are the days when Clouds were found only in the sky.

While on one side we see data moving further backwards, on the other side we can see the information moving further forward. Earlier you used to get the information in the form of reports on the paper. Somebody used to prepare the reports for you, get them printed on the paper and bring those reports to you finally. Then you started getting reports on your computer screens by connecting your Thick Desktop based viewer on you terminal to the Reporting Solution. Then you got Adhoc Analytics over browsers that allowed you to play with your data that too from any location over the web. Now you want the real time interactive Adhoc Analytics over the handheld devices; mobile and tablets. It’s amazing to see the BI solutions today in the markets allowing you to do real time Adhoc Analytics over your big data stored in some cloud on your iPad. It feels great to see the important yet horrible big data appearing in the form of really pretty charts, widgets and dashboards that too on devices like iPads. So now you don’t need to be worried. Just go wherever you want to go still you are not far from making the important strategic decisions instantly

Saturday, 25 June 2016

Big Data

The term Big Data is being increasingly used almost everywhere on the planet – online and offline. And it is not related to computers only. It comes under a blanket term called Information Technology, which is now part of almost all other technologies and fields of studies and businesses. Big Data is not a big deal. The hype surrounding it is sure pretty big deal to confuse you. This article takes a look at what is Big Data. It also contains an example on how NetFlix used its data, or rather, Big Data, to better serve its clients’ needs.

What is Big Data

The data lying in the servers of your company was just data until yesterday – sorted and filed. Suddenly, the slang Big Data got popular and now the data in your company is Big Data. The term covers each and every piece of data your organization has stored till now. It includes data stored in clouds and even the URLs that you bookmarked. Your company might not have digitized all the data. You may not have structured all the data already. But then, all the digital, papers, structured and non-structured data with your company is now Big Data.

In short, all the data – whether or not categorized – present in your servers is collectively called BIG DATA. All this data can be used to get different results using different types of analysis. It is not necessary that that all analysis use all the data. Different analysis uses different parts of the BIG
DATA to produce the results and predictions necessary.

Big Data is essentially the data that you analyze for results that you can use for predictions and for other uses. When using the term Big Data, suddenly your company or organization is working with top level Information technology to deduce different types of results using the same data that you stored intentionally or unintentionally over years.

How big is Big Data

Essentially, all the data combined is Big Data but many researchers agree that Big Data – as such – cannot be manipulated using normal spreadsheets and regular tools of database management. They need special analysis tools like Hadoop (we’ll study this in a separate post) so that all the data can be analyzed at one go (may include iterations of analysis).
Contrary to the above, though I am not an expert on the subject, I would say that data with any organization – big or small, organized or unorganized – is Big Data for that organization and that the organization may choose its own tools to analyze the data.

Big Data Concepts

This is another point where most people don’t agree. Some experts say that the Big Data Concepts are three Vs:
1.Volume
2.Velocity
3.Variety
Some others add few more Vs to the concept:
4.Visualization
5.Veracity (Reliability)
6.Variability and
7.Value

What are the Uses of Big Data

Business houses have long been dependent on whatever data they had to analyze trends, behavior (of goods and/or users), impacts and overall profits, etc. With the kind of data they now possess – thanks to the Internet – the computing goes beyond simple spreadsheets to provide them with much accurate results. Furthermore, Big Data enables them to perform more kinds of analysis to keep it a healthy and profitable business that is always on a path to growth.

Big Data Consumption

Industries Already Using Big Data: They Started Early

A] Financial Institutions: Mainly dealing with your money, these industries rely on Big Data to check previous trends and make predictions. Early data was less so predictions came with a bigger margin of risk. That risk is now reduced due to access to more data. Share markets, banks and other financial institutions may also be checking your spending methods to derive some sort of equation that helps you retain maximum profits. The following chart will assist you in understanding how financial institutions use Big Data. It will also give you an idea of how Big Data can be used.

B] Retail Marketing: The first thing that strikes mind talking about retail is the consumption of goods – area-wise or age-wise. Yes, you can use Big Data to tell how and who are using your goods and what types of goods. More than that, you can also focus on improving products and even introducing new products based on the ones that are succeeding. The other side of using Big Data in Retail Marketing is figuring out prospects (don’t forget the online window shoppers), prospect to client conversion rate and abilities or techniques, client retention and similar areas.

C] Government and Public Sector: How can we forget the government when it comes to data? Govt. and public sector units are ones that collect data more than any other sector. You can say they are drowning in data even as they digitize and store the data onto their servers or clouds worldwide. According to a whitepaper by IDC
“As government leaders across the spectrum strive to become a data-driven organization to successfully accomplish their missions, they are laying the groundwork to correlate dependencies across events and track dependencies across people, processes, and information.”
Overall, this sector gains in terms of productivity as it can track the speed + accuracy of different projects being run by them. It can then analyze the data to find of better methods of improving performance. There are quite a few other benefits too such as tracking people for serving them better healthcare, employment etc.

D] Communications Sector: Another area where Big Data plays an important role from acquisition of customers to enhancing or at the least, maintaining the class of service being provided to them, recovery and bad debts too!
Since they would want their services always up and running, they can use Big Data both for the above and into their own infrastructure to project potential growth as years go by. They come to know the requirements of bandwidth, they would know about fake customers and customers no longer using their services (helps in bringing them back), risk mitigation in case ofsudden increase in demand and much more – virtually any part of the business you can think of.

E] Media and Entertainment Businesses: The main focus here is customer retention – sometimes more important compared to customer acquisition. Big Data at hand helps in checking out what kind of media different users enjoy and based on that, the media houses develop better content of that type.
They focus on age-groups and dividing the production of artifacts according to the analysis results. At the same time, they have to find out what kind of advertising do the different age groups engage with – instead of simply watching. Earlier, it was not possible to get that much data but due to Internet Marketing Agencies and compilation of data over years instead of just flushing it, they can make real-time decisions, take appropriate actions for both customers and staff. It is just the beginning. There are no limits on what all you wish to know. With the right kind of data in hand, you can always get accurate results.

BigDataAnalytics

Google

Monday, 27 June 2016

Real Time Deep Analytics from Unstructured data

Approaches for Big Data Analytics

Direct Analytics over MPP DW

Indirect Analytics over Hadoop

Direct Analytics over Hadoop

Which approach I go with?

Big Data Storage & Processing

Big Data - The Data Deluge

OLAP Over Hadoop

Big Data stretching the scope of BI

Saturday, 25 June 2016

Big Data

Big Data

What is Big Data

How big is Big Data

Big Data Concepts

What are the Uses of Big Data

Big Data Consumption

Blog Archive