Monday, 3 October 2016

Part 1 of our Big Data series focused on whether big data was right for your business, and provided some examples of how big data is being used in the enterprise.

In Part 2, we’ll now discuss how to actually adopt big data into your business, outlining seven techniques that can help you get valuable answers from your data.

RECAP: WHAT IS BIG DATA?

‘Big Data’ is the application of specialized techniques and technologies to process very large sets of data. These data sets are often so large and complex that it becomes difficult to process using on-hand database management tools. Examples include web logs, call records, medical records, military surveillance, photography archives, video archives and large-scale e-commerce.

By ‘very large’ we’re talking about datasets that require at least one terabyte – if not hundreds of petabytes – of storage (note that 1 petabyte = 1024 terabytes!). Facebook is estimated to store at least 100 petabytes of pictures and videos alone.

IMPLEMENTING BIG DATA: 7 TECHNIQUES TO CONSIDER

According to IDC Canada, a Toronto-based IT research firm, Big Data is one of the top three things that will matter in 2013. With that in mind, there are 7 widely used Big Data analysis techniques that we’ll be seeing more of over the next 12 months:

Association rule learning
Classification tree analysis
Genetic algorithms
Machine learning
Regression analysis
Sentiment analysis
Social network analysis

1. ASSOCIATION RULE LEARNING

Are people who purchase tea more or less likely to purchase carbonated drinks?

Association rule learning is a method for discovering interesting correlations between variables in large databases. It was first used by major supermarket chains to discover interesting relations between products, using data from supermarket point-of-sale (POS) systems.

Association rule learning is being used to help:

place products in better proximity to each other in order to increase sales
extract information about visitors to websites from web server logs
analyze biological data to uncover new relationships monitor system logs to detect intruders and malicious activity
identify if people who buy milk and butter are more likely to buy diapers

2. CLASSIFICATION TREE ANALYSIS

Which categories does this document belong to?

Statistical classification is a method of identifying categories that a new observation belongs to. It requires a training set of correctly identified observations – historical data in other words. Statistical classification is being used to:

automatically assign documents to categories
categorize organisms into groupings
develop profiles of students who take online courses

3. GENETIC ALGORITHMS

Which TV programs should we broadcast, and in what time slot, to maximize our ratings?

Genetic algorithms are inspired by the way evolution works – that is, through mechanisms such as inheritance, mutation and natural selection. These mechanisms are used to “evolve” useful solutions to problems that require optimization.

Genetic algorithms are being used to:

schedule doctors for hospital emergency rooms
return combinations of the optimal materials and engineering practices required to develop fuel-efficient cars
generate “artificially creative” content such as puns and jokes

4. MACHINE LEARNING

Which movies from our catalogue would this customer most likely want to watch next, based on their viewing history?

Machine learning includes software that can learn from data. It gives computers the ability to learn without being explicitly programmed, and is focused on making predictions based on known properties learned from sets of “training data.”

Machine learning is being used to help:

distinguish between spam and non-spam email messages
learn user preferences and make recommendations based on this information
determine the best content for engaging prospective customers
determine the probability of winning a case, and setting legal billing rates

5. REGRESSION ANALYSIS

How does your age affect the kind of car you buy?

At a basic level, regression analysis involves manipulating some independent variable (i.e. background music) to see how it influences a dependent variable (i.e. time spent in store). It describes how the value of a dependent variable changes when the independent variable is varied. It works best with continuous quantitative data like weight, speed or age.

Regression analysis is being used to determine how:

levels of customer satisfaction affect customer loyalty
the number of supports calls received may be influenced by the weather forecast given the previous day
neighbourhood and size affect the listing price of houses
to find the love of your life via online dating sites

6. SENTIMENT ANALYSIS

How well is our new return policy being received?

Sentiment analysis helps researchers determine the sentiments of speakers or writers with respect to a topic.

Sentiment analysis is being used to help:

improve service at a hotel chain by analyzing guest comments
customize incentives and services to address what customers are really asking for
determine what consumers really think based on opinions from social media

7. SOCIAL NETWORK ANALYSIS

How many degrees of separation are you from Kevin Bacon?

Social network analysis is a technique that was first used in the telecommunications industry, and then quickly adopted by sociologists to study interpersonal relationships. It is now being applied to analyze the relationships between people in many fields and commercial activities. Nodes represent individuals within a network, while ties represent the relationships between the individuals.

Social network analysis is being used to:

see how people from different populations form ties with outsiders
find the importance or influence of a particular individual within a group
find the minimum number of direct ties required to connect two individuals
understand the social structure of a customer base

Whether your business wants to discover interesting correlations, categorize people into groups, optimally schedule resources, or set billing rates, a basic understanding of the seven techniques mentioned above can help Big Data work for you.

View Part 3 of our Big Data Series, which outlines the most popular big data tools being used.

Saturday, 23 July 2016

Difference between Pig and Hive-The Two Key Components of Hadoop Ecosystem

Pig and Hive are the two key components of the Hadoop ecosystem. What does pig hadoop or hive hadoop solve? Pig hadoop and Hive hadoop have a similar goal- they are tools that ease the complexity of writing complex java MapReduce programs. However, when to use Pig Latin and when to use HiveQL is the question most of the have developers have. Apache HIVE and Apache PIG components of the Hadoop ecosystem are briefed. If we take a look at diagrammatic representation of the Hadoop ecosystem, HIVE and PIG components cover the same verticals and this certainly raises the question, which one is better? It’s Pig vs Hive (Yahoo vs Facebook).

There is no simple way to compare both Pig and Hive without digging deep into both in greater detail as to how they help in processing large amounts of data. This post compares some of the prominent features of Pig Hadoop and Hive Hadoop to help users understand the similarities and difference between them

Hadoop technology is the buzz word these days but most of the IT professionals still are not aware of the key components that comprise the Hadoop Ecosystem. Not only this, few of the people are as well of the thought that Big Data and Hadoop are one and the same.
Just before we jump on to a detailed discussion on the key components of the Hadoop Ecosystem and try to understand the differences between them let us have an understanding on what is Hadoop and what is Big Data.

What is Big Data and Hadoop?

Generally data to be stored in the database is categorized into 3 types namely Structured Data, Semi Structured Data and Unstructured Data.
Structured Data is nothing but data that can be stored in databases, for instance, the transaction records of any online purchase that you make can be stored in a database whereas data that can only be partially stored in the database is referred to as semi structured data, for instance, the data that is present in the XML records can be stored partially in the database.
Any other form of data that cannot be categorized as Structured or semi-structured is referred to as Unstructured Data, for instance, the data from Social Networking websites or the web logs which cannot be analyzed or stored for processing in the databases are examples of unstructured data.
We generally refer to Unstructured Data as “Big Data” and the framework that is used for processing Big Data is popularly known as Hadoop.

Hadoop Ecosystem comprises of the following key components:

1) Map Reduce Framework

2) HDFS (Hadoop Distributed File System)

3) Hive

4) HBase

5) Pig

6) Flume

7) Sqoop

8) Oozie

9) Zoo Keeper

Is the battle HIVE vs PIG real? Does the pair have the same advantages and disadvantages while processing enormous amounts of data? The answer is NO, there is no HIVE vs PIG in the real world, it’s just the initial ambiguity on deciding the tool which suits the need. HIVE Query language (HiveQL) suits the specific demands of analytics meanwhile PIG supports huge data operation. PIG was developed as an abstraction to avoid the complicated syntax of Java programming for MapReduce. On the other hand HIVE QL is based around SQL, which makes it easier to learn for those who know SQL. AVRO is supported by PIG making serialization faster. When it really boils down on taking decision between Pig and Hive, the suitability of the each component for the given business logic must be considered and then the decision must be taken.
Just as there is a HIVE vs PIG, there is continued discussion on Hbase vs HIVE. This uncertainty can easily be justified by taking the representation of Hadoop ecosystem. Hbase covers more vertical than HIVE. So there is no Hbase vs HIVE. With deeper insight, HIVE uses queries which will later be converted to ensemble MapReduce technique to do operations on the database, at the same time Hbase works on the HDFS directly, although Hbase and HIVE work on structured database.

HIVE Hadoop

Hive Hadoop was founded by Jeff Hammerbacher who was working with Facebook. When working with Facebook he realized that they receive huge amounts of data on a daily basis and there needs to be a mechanism which can store, mine and help analysis of the data. This idea to mine and analyze huge amounts of data gave birth to Hive. It is Hive that has enabled Facebook to deal with 10’s of Terabytes of Data on a daily basis with ease

Hive uses SQL, Hive select, where, group by, and order by clauses are similar to SQL for relational databases. Hive lose some ability to optimize the query, by relying on the Hive optimizer.
Hive is similar to a SQL Interface in Hadoop. The data that is stored in HBase component of the Hadoop Ecosystem can be accessed through Hive. Hive is of great use for developers who are not well-versed with the MapReduce framework for writing data queries that are transformed into Map Reduce jobs in Hadoop.
We can consider Hive as a Data Warehousing package that is constructed on top of Hadoop for analyzing huge amounts of data. Hive is mainly developed for users who are comfortable in using SQL. The best thing about Hive is that it conceptualizes the complexity of Hadoop because the users need not write MapReduce programs when using Hive so anyone who is not familiar with Java Programming and Hadoop API’s can also make the best use of Hive.

We can summarize Hive as:

a) A Data Warehouse Infrastructure

b) Definer of a Query Language popularly known as HiveQL (similar to SQL)

c) Provides us with various tools for easy extraction, transformation and loading of data.

d) Hive allows its users to embed customized mappers and reducers.

What makes Hive Hadoop popular?

Hive Hadoop provides the users with strong and powerful statistics functions.
Hive Hadoop is like SQL, so for any SQL developer the learning curve for Hive will almost be negligible.
Hive Hadoop can be integrated with HBase for querying the data in HBase whereas this is not possible with Pig. In case of Pig, a function named HbaseStorage () will be used for loading the data from HBase.
Hive Hadoop has gained popularity as it is supported by Hue.
Hive Hadoop has various user groups such as CNET, Last.fm, Facebook, and Digg and so on.

PIG Hadoop

Pig Hadoop was developed by Yahoo in the year 2006 so that they can have an ad-hoc method for creating and executing MapReduce jobs on huge data sets. The main motive behind developing Pig was to cut-down on the time required for development via its multi query approach. Pig is a high level data flow system that renders you a simple language platform popularly known as Pig Latin that can be used for manipulating data and queries.
Pig is used by Microsoft, Yahoo and Google, to collect and store large data sets in the form of web crawls, click streams and search logs. Pig at times finds its usage in ad-hoc analysis and processing of information.

What makes Pig Hadoop popular?

Pig Hadoop follows a multi query approach thus it cuts down on the number times the data is scanned.
Pig Hadoop is very easy to learn read and write if you are familiar with SQL.
Pig provides the users with a wide range of nested data types such as Maps, Tuples and Bags that are not present in MapReduce along with some major data operations such as Ordering, Filters, and Joins.
Performance of Pig is on par with the performance of raw Map Reduce.

Pig has various user groups for instance 90% of Yahoo’s MapReduce is done by Pig, 80% of Twitter’s MapReduce is also done by Pig and various other companies such as Sales force, LinkedIn, AOL and Nokia also employ Pig.

The below tabular data will give you an overview on the basic difference between Pig and Hive:

hive V/S pig

Instead of writing Java code to implement MapReduce, one can opt between Pig Latin and Hive SQL languages to construct MapReduce programs. Benefit of coding in Pig and Hive is - much fewer lines of code, which reduces the overall development and testing time.
Difference between pig and hive is Pig needs some mental adjustment for SQL users to learn. Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!).
Hive is commonly used at Facebook for analytical purposes. Facebook promotes the Hive language. However, Yahoo! is a big advocate for Pig Latin. Yahoo! has one of the biggest Hadoop clusters in the world. Their data engineers use Pig for data processing on their Hadoop clusters. Alternatively, you may choose one among Pig and Hive at your organization, if no standards are set.
Data engineers have better control over the dataflow (ETL) processes using Pig Latin, especially with procedural language background. A data analyst finds that one can ramp up on Hadoop faster, by using Hive, especially with previous experience of SQL. If you really want to become a Hadoop expert, then you should learn both Pig and Hive for the ultimate flexibility.

Pig vs. Hive

Depending on your purpose and type of data you can either choose to use Hive Hadoop component or Pig Hadoop Component based on the below differences :
1) Hive Hadoop Component is used mainly by data analysts whereas Pig Hadoop Component is generally used by Researchers and Programmers.
2) Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data.
3) Hive Hadoop Component has a declarative SQLish language (HiveQL) whereas Pig Hadoop Component has a procedural data flow language (Pig Latin)
4) Hive Hadoop Component is mainly used for creating reports whereas Pig Hadoop Component is mainly used for programming.
5) Hive Hadoop Component operates on the server side of any cluster whereas Pig Hadoop Component operates on the client side of any cluster.
6) Hive Hadoop Component is helpful for ETL whereas Pig Hadoop is a great ETL tool for big data because of its powerful transformation and processing capabilities.
7) Hive can start an optional thrift based server that can send queries from any nook and corner directly to the Hive server which will execute them whereas this feature is not available with Pig.
8) Hive directly leverages SQL expertise and thus can be learnt easily whereas Pig is also SQL-like but varies to a great extent and thus it will take some time efforts to master Pig.
9) Hive makes use of exact variation of the SQL DLL language by defining the tables beforehand and storing the schema details in any local database whereas in case of Pig there is no dedicated metadata database and the schemas or data types will be defined in the script itself.
10) The Hive Hadoop component has a provision for partitions so that you can process the subset of data by date or in an alphabetical order whereas Pig Hadoop component does not have any notion for partitions though might be one can achieve this through filters.
11) Pig supports Avro whereas Hive does not.
12) Pig can be installed easily over Hive as it is completely based on shell interaction
13) Pig Hadoop Component renders users with sample data for each scenario and each step through its “Illustrate” function whereas this feature is not incorporated with the Hive Hadoop Component.
14) Hive has smart inbuilt features on accessing raw data but in case of Pig Latin Scripts we are not pretty sure that accessing raw data is as fast as with HiveQL.
15) You can join, order and sort data dynamically in an aggregated manner with Hive and Pig however Pig also provides you an additional COGROUP feature for performing outer joins.
To conclude with after having understood the difference between Pig and Hive, to me both Hive Hadoop and Pig Hadoop Component will help you achieve the same goals, we can say that Pig is a script kiddy and Hive comes in, innate for all the natural database developers. When it comes to access choices, Hive is said to have more features over Pig. Both the Hive and Pig components are reportedly having near about the same number of committers in every project and likely in the near future we are going to see great advancements in both on the development front.

Friday, 22 July 2016

TECHNIQUE TO ANALYZE DATA

There are many techniques being used to analyze datasets. In this article, we provide a list of some techniques applicable across a range of industries. This list is by no means exhaustive. Indeed, researchers continue to develop new techniques and improve on existing ones, particularly in response to the need to analyze new combinations of data. We note that not all of these techniques strictly require the use of big data—some of them can be applied effectively to smaller datasets (e.g., A/B testing, regression analysis). However, all of the techniques we list here can be applied to big data and, in general, larger and more diverse datasets can be used to generate more numerous and insightful results than smaller, less diverse ones.

1. A/B testing: A technique in which a control group is compared with a variety of test groups in order to determine what treatments (i.e., changes) will improve a given objective variable, e.g., marketing response rate. This technique is also known as split testing or bucket testing. An example application is determining what copy text, layouts, images, or colors will improve conversion rates on an e-commerce Web site. Big data enables huge numbers of tests to be executed and analyzed, ensuring that groups are of sufficient size to detect meaningful (i.e., statistically significant) differences between the control and treatment groups (see statistics). When more than one variable is simultaneously manipulated in the treatment, the multivariate generalization of this technique, which applies statistical modeling, is often called “A/B/N” testing

2. Association rule learning: A set of techniques for discovering interesting relationships, i.e., “association rules,” among variables in large databases.These techniques consist of a variety of algorithms to generate and test possible rules. One application is market basket analysis, in which a retailer can determine which products are frequently bought together and use this information for marketing (a commonly cited example is the discovery that many supermarket shoppers who buy diapers also tend to buy beer). Used for data mining.

Thursday, 21 July 2016

BIG DATA ANALYSIS WITH HIVE

What is HIVE?

Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to provide data summarization, query and analysis. It supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. Hive is not built to get a quick response to queries but it it is built for data mining applications. Data mining applications can take from several minutes to several hours to analysis the data and HIVE is primarily used there.

HIVE Organization

The data are organized in three different formats in HIVE.

Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.

Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.

Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.

Hive also has metastore which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). We can use MySQL database over here.

What is HiveSQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks which HQL can do easily.

Create and manage tables and partitions
Support various Relational, Arithmetic and Logical Operators
Evaluate functions
Download the contents of a table to a local directory or result of queries to HDFS directory

Here is the example of the HQL Query:

SELECT upper(name), salesprice
FROM sales;
SELECT category, count(1) 
FROM products 
GROUP BY category;

When you look at the above query, you can see they are very similar to SQL like queries

Monday, 11 July 2016

How to Analyze Big Data with Hadoop Technologies

With rapid innovations, frequent evolutions of technologies and a rapidly growing internet population, systems and enterprises are generating huge amounts of data to the tune of terabytes and even petabytes of information. Since data is being generated in very huge volumes with great velocity in all multi-structured formats like images, videos, weblogs, sensor data, etc. from all different sources, there is a huge demand to efficiently store, process and analyze this large amount of data to make it usable.

Hadoop is undoubtedly the preferred choice for such a requirement due to its key characteristics of being reliable, flexible, economical, and a scalable solution. While Hadoop provides the ability to store this large scale data on HDFS (Hadoop Distributed File System), there are multiple solutions available in the market for analyzing this huge data like MapReduce, Pig and Hive. With the advancements of these different data analysis technologies to analyze the big data, there are many different school of thoughts about which Hadoop data analysis technology should be used when and which could be efficient.

A well-executed big data analysis provides the possibility to uncover hidden markets, discover unfulfilled customer demands and cost reduction opportunities and drive game-changing, significant improvements in everything from telecommunication efficiencies and surgical or medical treatments, to social media campaigns and related digital marketing promotions.

What is Big Data Analysis?

Big data is mostly generated from social media websites, sensors, devices, video/audio, networks, log files and web, and much of it is generated in real time and on a very large scale. Big data analytics is the process of examining this large amount of different data types, or big data, in an effort to uncover hidden patterns, unknown correlations and other useful information.

Advantages of Big Data Analysis

Big data analysis allows market analysts, researchers and business users to develop deep insights from the available data, resulting in numerous business advantages. Business users are able to make a precise analysis of the data and the key early indicators from this analysis can mean fortunes for the business. Some of the exemplary use cases are as follows:

Whenever users browse travel portals, shopping sites, search flights, hotels or add a particular item into their cart, then Ad Targeting companies can analyze this wide variety of data and activities and can provide better recommendations to the user regarding offers, discounts and deals based on the user browsing history and product history.

In the telecommunications space, if customers are moving from one service provider to another service provider, then by analyzing huge call data records of the various issues faced by the customers can be unearthed. Issues could be as wide-ranging as a significant increase in the call drops or some network congestion problems. Based on analyzing these issues, it can be identified if a telecom company needs to place a new tower in a particular urban area or if they need to revive the marketing strategy for a particular region as a new player has come up there. That way customer churn can be proactively minimized.

Case Study – Stock market data

Now let’s look at a case study for analyzing stock market data. We will evaluate various big data technologies to analyze this stock market data from a sample ‘New York Stock Exchange’ dataset and calculate the covariance for this stock data and aim to solve both storage and processing problems related to a huge volume of data.

Covariance is a financial term that represents the degree or amount that two stocks or financial instruments move together or apart from each other. With covariance, investors have the opportunity to seek out different investment options based upon their respective risk profile. It is a statistical measure of how one investment moves in relation to the other.

A positive covariance means that asset returns moved together. If investment instruments or stocks tend to be up or down during the same time periods, they have positive covariance.

A negative covariance means returns move inversely. If one investment instrument tends to be up while the other is down, they have negative covariance.

This will help a stock broker in recommending the stocks to his customers.

Dataset: The sample dataset provided is a comma separated file (CSV) named ‘NYSE_daily_prices_Q.csv’ that contains the stock information such as daily quotes, Stock opening price, Stock highest price, etc. on the New York Stock Exchange.

The dataset provided is just a sample small dataset having around 3500 records, but in the real production environment there could be huge stock data running into GBs or TBs. So our solution must be supported in a real production environment.

Hadoop Data Analysis Technologies

Let’s have a look at the existing open source Hadoop data analysis technologies to analyze the huge stock data being generated very frequently.

Featured	MapReduce	Pig	Hive
Language	Algorithm of Map and Reduce Functions (Can be implemented in C, Python, Java)	PigLatin (Scripting Language)	SQL-like
Schemas/Types	No	Yes (implicit)	Yes(explicit)
Partitions	No	No	Yes
Server	No	No	Optional (Thrift)
Lines of code	More lines of code	Fewer (Around 10 lines of PIG = 200 lines of Java)	Fewer than MapReduce and Pig due to SQL Like nature
Development Time	More development effort	Rapid development	Rapid development
Abstraction	Lower level of abstraction (Rigid Procedural Structure)	Higher level of abstraction (Scripts)	Higher level of abstraction (SQL like)
Joins	Hard to achieve join functionality	Joins can be easily written	Easy for joins
Structured vs Semi-Structured Vs Unstructured data	Can handle all these kind of data types	Works on all these kind of data types	Deal mostly with structured and semi-structured data
Complex business logic	More control for writing complex business logic	Less control for writing complex business logic	Less control for writing complex business logic
Performance	Fully tuned MapReduce program would be faster than Pig/Hive	Slower than fully tuned MapReduce program, but faster than badly written MapReduce code	Slower than fully tuned MapReduce program, but faster than bad written MapReduce code

Which Data Analysis Technologies should be used?

Based on the available sample dataset, it is having following properties:

Data is having structured format
It would require joins to calculate Stock Covariance
It could be organized into schema
In real environment, data size would be too much

Based on these criteria and comparing with the above analysis of features of these technologies, we can conclude:

If we use MapReduce, then complex business logic needs to be written to handle the joins. We would have to think from map and reduce perspective and which particular code snippet will go into map and which one will go into reduce side. A lot of development effort needs to go into deciding how map and reduce joins will take place. We would not be able to map the data into schema format and all efforts need to be handled programmatically.
If we are going to use Pig, then we would not be able to partition the data, which can be used for sample processing from a subset of data by a particular stock symbol or particular date or month. In addition to that Pig is more like a scripting language which is more suitable for prototyping and rapidly developing MapReduce based jobs. It also doesn’t provide the facility to map our data into an explicit schema format that seems more suitable for this case study.
Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes tricky coding that we would have to do in MapReduce programming. If we apply Hive to analyze the stock data, then we would be able to leverage the SQL capabilities of Hive-QL as well as data can be managed in a particular schema. It will also reduce the development time as well and can manage joins between stock data also using Hive-QL which is of course pretty difficult in MapReduce. Hive also has its thrift servers, by which we can submit our Hive queries from anywhere to the Hive server, which in turn executes them. Hive SQL queries are being converted into map reduce jobs by Hive compiler, leaving programmers to think beyond complex programming and provides opportunity to focus on business problem.

So based on the above discussion, Hive seems the perfect choice for the aforementioned case study.

Problem Solution with Hive

Apache Hive is a data warehousing package built on top of Hadoop for providing data summarization, query and analysis. The query language being used by Hive is called Hive-QL and is very similar to SQL.

Since we are now done zeroing in on the data analysis technology part, now it’s time to get your feet wet with deriving solutions for the mentioned case study.

Hive Configuration on Cloudera

Follow the steps mentioned in my previous blog How to Configure Hive On Cloudera:

Create Hive Table

Use ‘create table’ Hive command to create the Hive table for our provided csv dataset:

hive> create table NYSE (exchange String,stock_symbol String,stock_date String,stock_price_open double, stock_price_high double, stock_price_low double, stock_price_close double, stock_volume double, stock_price_adj_close double) row format delimited fields terminated by ‘,’;

This will create a Hive table named ‘NYSE’ in which rows would be delimited and row fields will be terminated by commas. This schema will be created into the embedded derby database as configured into the Hive setup. By default, Hive stores metadata in an embedded Apache Derby database, but can be configured for other databases like MySQL, SQL server, Oracle, etc.

Load CSV Data into Hive Table

Use the following Hive command to load the CSV data file into Hive table:

hive> load data local inpath ‘/home/cloudera/NYSE_daily_prices_Q.csv’ into table NYSE;

This will load the dataset from the mentioned location to the Hive table ‘NYSE’ as created above but all this dataset will be stored into the Hive-controlled file system namespace on HDFS, so that it could be batch processed further by MapReduce jobs or Hive queries.

Calculate the Covariance

We can calculate the Covariance for the provided stock dataset for the inputted year as below using the Hive select query:

select a.STOCK_SYMBOL, b.STOCK_SYMBOL, month(a.STOCK_DATE),

(AVG(a.STOCK_PRICE_HIGH*b.STOCK_PRICE_HIGH) – (AVG(a.STOCK_PRICE_HIGH)*AVG(b.STOCK_PRICE_HIGH)))

from NYSE a join NYSE b on

a.STOCK_DATE=b.STOCK_DATE where a.STOCK_SYMBOL<b.STOCK_SYMBOL and year(a.STOCK_DATE)=2008

Group by a.STOCK_SYMBOL, b. STOCK_SYMBOL, month(a.STOCK_DATE);

This Hive select query will trigger the MapReduce job as below:

The covariance results after the above stock data analysis, are as follows:

The covariance has been calculated between two different stocks for each month on a particular date for the available year.
From the covariance results, stock brokers or fund managers can provide below recommendations:

For Stocks QRR and QTM, these are having more positive covariance than negative covariance, so having high probability that stocks will move together in same direction.
For Stocks QRR and QXM, these are mostly having negative covariance. So there exists a greater probability of stock prices moving in an inverse direction.
For Stocks QTM and QXM, these are mostly having positive covariance for most of all months, so these tend to move in the same direction most of the times.

So similarly we can analyze more use cases of big data and can explore all possible solutions to solve that use case and then by the comparison chart, the final best solution can be narrowed down.

Conclusion/Benefits

So this case study solves the following two important goals of big data technologies:

Storage

By storing the huge stock data into HDFS, the solution provided is much more robust, reliable, economical, and scalable. Whenever data size is increasing, you can just add some more nodes, configure into Hadoop and that’s all. If sometime any node is down, then even other nodes are ready to handle the responsibility due to data replication.

By managing the Hive schema into embedded database or any other standard SQL database, we are able to utilize the power of SQL as well.

Processing

Since Hive schema is created on a standard SQL database, you get the advantage of running SQL queries on the huge dataset also and are able to process GBs or TBs of data with simple SQL queries. Since actual data resides into HDFS, so these Hive SQL queries are being converted into MapReduce jobs and these parallelized map reduce jobs process these huge volume of data and achieve scalable and fault tolerant solutions.