Google

Saturday, 23 July 2016

Difference between Pig and Hive-The Two Key Components of Hadoop Ecosystem

Pig and Hive are the two key components of the Hadoop ecosystem. What does pig hadoop or hive hadoop solve? Pig hadoop and Hive hadoop have a similar goal- they are tools that ease the complexity of writing complex java MapReduce programs. However, when to use Pig Latin and when to use HiveQL is the question most of the have developers have. Apache HIVE and Apache PIG components of the Hadoop ecosystem are briefed. If we take a look at diagrammatic representation of the Hadoop ecosystem, HIVE and PIG components cover the same verticals and this certainly raises the question, which one is better? It’s Pig vs Hive (Yahoo vs Facebook).

There is no simple way to compare both Pig and Hive without digging deep into both in greater detail as to how they help in processing large amounts of data. This post compares some of the prominent features of Pig Hadoop and Hive Hadoop to help users understand the similarities and difference between them

Hadoop technology is the buzz word these days but most of the IT professionals still are not aware of the key components that comprise the Hadoop Ecosystem. Not only this, few of the people are as well of the thought that Big Data and Hadoop are one and the same.
Just before we jump on to a detailed discussion on the key components of the Hadoop Ecosystem and try to understand the differences between them let us have an understanding on what is Hadoop and what is Big Data.

What is Big Data and Hadoop?

Generally data to be stored in the database is categorized into 3 types namely Structured Data, Semi Structured Data and Unstructured Data.
Structured Data is nothing but data that can be stored in databases, for instance, the transaction records of any online purchase that you make can be stored in a database whereas data that can only be partially stored in the database is referred to as semi structured data, for instance, the data that is present in the XML records can be stored partially in the database.
Any other form of data that cannot be categorized as Structured or semi-structured is referred to as Unstructured Data, for instance, the data from Social Networking websites or the web logs which cannot be analyzed or stored for processing in the databases are examples of unstructured data.
We generally refer to Unstructured Data as “Big Data” and the framework that is used for processing Big Data is popularly known as Hadoop.

Hadoop Ecosystem comprises of the following key components:

1) Map Reduce Framework

2) HDFS (Hadoop Distributed File System)

3)  Hive

4) HBase

5)  Pig

6) Flume

7) Sqoop

8) Oozie

9) Zoo Keeper



Is the battle HIVE vs PIG real? Does the pair have the same advantages and disadvantages while processing enormous amounts of data? The answer is NO, there is no HIVE vs PIG in the real world, it’s just the initial ambiguity on deciding the tool which suits the need. HIVE Query language (HiveQL) suits the specific demands of analytics meanwhile PIG supports huge data operation. PIG was developed as an abstraction to avoid the complicated syntax of Java programming for MapReduce. On the other hand HIVE QL is based around SQL, which makes it easier to learn for those who know SQL. AVRO is supported by PIG making serialization faster. When it really boils down on taking decision between Pig and Hive, the suitability of the each component for the given business logic must be considered and then the  decision must be taken.
Just as there is a HIVE vs PIG, there is continued discussion on Hbase vs HIVE. This uncertainty can easily be justified by taking the representation of Hadoop ecosystem. Hbase covers more vertical than HIVE. So there is no Hbase vs HIVE. With deeper insight, HIVE uses queries which will later be converted to ensemble MapReduce technique to do operations on the database, at the same time Hbase works on the HDFS directly, although Hbase and HIVE work on structured database.

HIVE Hadoop

Hive Hadoop was founded by Jeff Hammerbacher who was working with Facebook. When working with Facebook he realized that they receive huge amounts of data on a daily basis and there needs to be a mechanism which can store, mine and help analysis of the data. This idea to mine and analyze huge amounts of data gave birth to Hive. It is Hive that has enabled Facebook to deal with 10’s of Terabytes of Data on a daily basis with ease

Hive uses SQL, Hive select, where, group by, and order by clauses are similar to SQL for relational databases.  Hive lose some ability to optimize the query, by relying on the Hive optimizer.
Hive is similar to a SQL Interface in Hadoop. The data that is stored in HBase component of the Hadoop Ecosystem can be accessed through Hive. Hive is of great use for developers who are not well-versed with the MapReduce framework for writing data queries that are transformed into Map Reduce jobs in Hadoop.
We can consider Hive as a Data Warehousing package that is constructed on top of Hadoop for analyzing huge amounts of data. Hive is mainly developed for users who are comfortable in using SQL. The best thing about Hive is that it conceptualizes the complexity of Hadoop because the users need not write MapReduce programs when using Hive so anyone who is not familiar with  Java Programming and Hadoop API’s can also make the best use of Hive.
We can summarize Hive as:
a) A Data Warehouse Infrastructure
b) Definer of a Query Language popularly known as HiveQL (similar to SQL)
c) Provides us with various tools for easy extraction, transformation and loading of data.
d) Hive allows its users to embed customized mappers and reducers.

What makes Hive Hadoop popular?

  • Hive Hadoop provides the users with strong and powerful statistics functions.
  • Hive Hadoop is like SQL, so for any SQL developer the learning curve for Hive will almost be negligible.
  • Hive Hadoop can be integrated with HBase for querying the data in HBase whereas this is not possible with Pig. In case of Pig, a function named HbaseStorage () will be used for loading the data from HBase.
  • Hive Hadoop has gained popularity as it is supported by Hue.
  • Hive Hadoop has various user groups such as CNET, Last.fm, Facebook, and Digg and so on.



PIG Hadoop

Pig Hadoop was developed by Yahoo in the year 2006 so that they can have an ad-hoc method for creating and executing MapReduce jobs on huge data sets. The main motive behind developing Pig was to cut-down on the time required for development via its multi query approach. Pig is a high level data flow system that renders you a simple language platform popularly known as Pig Latin that can be used for manipulating data and queries.
Pig is used by Microsoft, Yahoo and Google, to collect and store large data sets in the form of web crawls, click streams and search logs. Pig at times finds its usage in ad-hoc analysis and processing of information.

What makes Pig Hadoop popular?

  • Pig Hadoop follows a multi query approach thus it cuts down on the number times the data is scanned.
  • Pig Hadoop is very easy to learn read and write if you are familiar with SQL.
  • Pig provides the users with a wide range of nested data types such as Maps, Tuples and Bags that are not present in MapReduce along with some major data operations such as Ordering, Filters, and Joins.
  • Performance of Pig is on par with the performance of raw Map Reduce.
Pig has various user groups for instance 90% of Yahoo’s MapReduce is done by Pig, 80% of Twitter’s MapReduce is also done by Pig and various other companies such as Sales force, LinkedIn, AOL and Nokia also employ Pig.

The below tabular data will give you an overview on the basic difference between Pig and Hive:
Pig vs Hive 

Instead of writing Java code to implement MapReduce, one can opt between Pig Latin and Hive SQL languages to construct MapReduce programs. Benefit of coding in Pig and Hive is - much fewer lines of code, which  reduces the overall development and testing time.
Difference between pig and hive is Pig needs some mental adjustment for SQL users to learn.  Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!).
Hive is commonly used at Facebook for analytical purposes.  Facebook promotes the Hive language. However, Yahoo! is a big advocate for Pig Latin.  Yahoo! has one of the biggest Hadoop clusters in the world.  Their data engineers use Pig for data processing on their Hadoop clusters. Alternatively, you may choose one among Pig and Hive at your organization, if no standards are set.
Data engineers have better control over the dataflow (ETL) processes using Pig Latin, especially with procedural language background. A data analyst finds that one can ramp up on Hadoop faster, by using Hive, especially with previous experience of SQL.  If you really want to become a Hadoop expert, then you should learn both Pig and Hive for the ultimate flexibility.

Pig vs. Hive



Depending on your purpose and type of data you can either choose to use Hive Hadoop component or Pig Hadoop Component based on the below differences :
1) Hive Hadoop Component is used mainly by data analysts whereas Pig Hadoop Component is generally used by Researchers and Programmers.
2) Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data.
3) Hive Hadoop Component has a declarative SQLish language (HiveQL) whereas Pig Hadoop Component has a procedural data flow language (Pig Latin)
4) Hive Hadoop Component is mainly used for creating reports whereas Pig Hadoop Component is mainly used for programming.
5) Hive Hadoop Component operates on the server side of any cluster whereas Pig Hadoop Component operates on the client side of any cluster.
6) Hive Hadoop Component is helpful for ETL whereas Pig Hadoop is a great ETL tool for big data because of its powerful transformation and processing capabilities.
7) Hive can start an optional thrift based server that can send queries from any nook and corner directly to the Hive server which will execute them whereas this feature is not available with Pig.
8) Hive directly leverages SQL expertise and thus can be learnt easily whereas Pig is also SQL-like but varies to a great extent and thus it will take some time efforts to master Pig.
9) Hive makes use of exact variation of the SQL DLL language by defining the tables beforehand and storing the schema details in any local database whereas in case of Pig there is no dedicated metadata database and the schemas or data types will be defined in the script itself.
10) The Hive Hadoop component has a provision for partitions so that you can process the subset of data by date or in an alphabetical order whereas Pig Hadoop component does not have any notion for partitions though might be one can achieve this through filters.
11) Pig supports Avro whereas Hive does not.
12) Pig can be installed easily over Hive as it is completely based on shell interaction
13) Pig Hadoop Component renders users with sample data for each scenario and each step through its “Illustrate” function whereas this feature is not incorporated with the Hive Hadoop Component.
14) Hive has smart inbuilt features on accessing raw data but in case of Pig Latin Scripts we are not pretty sure that accessing raw data is as fast as with HiveQL.
15) You can join, order and sort data dynamically in an aggregated manner with Hive and Pig however Pig also provides you an additional COGROUP feature for performing outer joins.
To conclude with after having understood the difference between Pig and Hive, to me both Hive Hadoop and Pig Hadoop Component will help you achieve the same goals, we can say that Pig is a script kiddy and Hive comes in, innate for all the natural database developers. When it comes to access choices, Hive is said to have more features over Pig. Both the Hive and Pig components are reportedly having near about the same number of committers in every project and likely in the near future we are going to see great advancements in both on the development front.



Friday, 22 July 2016

TECHNIQUE TO ANALYZE DATA

There are many techniques being used to analyze datasets. In this article, we provide a list of some techniques applicable across a range of industries. This list is by no means exhaustive. Indeed, researchers continue to develop new techniques and improve on existing ones, particularly in response to the need to analyze new combinations of data. We note that not all of these techniques strictly require the use of big data—some of them can be applied effectively to smaller datasets (e.g., A/B testing, regression analysis). However, all of the techniques we list here can be applied to big data and, in general, larger and more diverse datasets can be used to generate more numerous and insightful results than smaller, less diverse ones.

1. A/B testing: A technique in which a control group is compared with a variety of test groups in order to determine what treatments (i.e., changes) will improve a given objective variable, e.g., marketing response rate. This technique is also known as split testing or bucket testing. An example application is determining what copy text, layouts, images, or colors will improve conversion rates on an e-commerce Web site. Big data enables huge numbers of tests to be executed and analyzed, ensuring that groups are of sufficient size to detect meaningful (i.e., statistically significant) differences between the control and treatment groups (see statistics). When more than one variable is simultaneously manipulated in the treatment, the multivariate generalization of this technique, which applies statistical modeling, is often called “A/B/N” testing

2. Association rule learning: A set of techniques for discovering interesting relationships, i.e., “association rules,” among variables in large databases.These techniques consist of a variety of algorithms to generate and test possible rules. One application is market basket analysis, in which a retailer can determine which products are frequently bought together and use this information for marketing (a commonly cited example is the discovery that many supermarket shoppers who buy diapers also tend to buy beer). Used for data mining.

Thursday, 21 July 2016

 

BIG DATA ANALYSIS WITH HIVE

What is HIVE?

 Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. Hive is not built to get a quick response to queries but it it is built for data mining applications. Data mining applications can take from several minutes to several hours to analysis the data and HIVE is primarily used there.

HIVE Organization

The data are organized in three different formats in HIVE.

Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.
Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.
Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.
Hive also has metastore which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). We can use MySQL database over here.



What is HiveSQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks which HQL can do easily.
  • Create and manage tables and partitions
  • Support various Relational, Arithmetic and Logical Operators
  • Evaluate functions
  • Download the contents of a table to a local directory or result of queries to HDFS directory
Here is the example of the HQL Query:

SELECT upper(name), salesprice
FROM sales;
SELECT category, count(1) 
FROM products 
GROUP BY category;

When you look at the above query, you can see they are very similar to SQL like queries


Monday, 11 July 2016

How to Analyze Big Data with Hadoop Technologies

With rapid innovations, frequent evolutions of technologies and a rapidly growing internet population, systems and enterprises are generating huge amounts of data to the tune of terabytes and even petabytes of information. Since data is being generated in very huge volumes with great velocity in all multi-structured formats like images, videos, weblogs, sensor data, etc. from all different sources, there is a huge demand to efficiently store, process and analyze this large amount of data to make it usable.
Hadoop is undoubtedly the preferred choice for such a requirement due to its key characteristics of being reliable, flexible, economical, and a scalable solution. While Hadoop provides the ability to store this large scale data on HDFS (Hadoop Distributed File System), there are multiple solutions available in the market for analyzing this huge data like MapReduce, Pig and Hive. With the advancements of these different data analysis technologies to analyze the big data, there are many different school of thoughts about which Hadoop data analysis technology should be used when and which could be efficient.
A well-executed big data analysis provides the possibility to uncover hidden markets, discover unfulfilled customer demands and cost reduction opportunities and drive game-changing, significant improvements in everything from telecommunication efficiencies and surgical or medical treatments, to social media campaigns and related digital marketing promotions.


What is Big Data Analysis?
Big data is mostly generated from social media websites, sensors, devices, video/audio, networks, log files and web, and much of it is generated in real time and on a very large scale. Big data analytics is the process of examining this large amount of different data types, or big data, in an effort to uncover hidden patterns, unknown correlations and other useful information.

Advantages of Big Data Analysis
Big data analysis allows market analysts, researchers and business users to develop deep insights from the available data, resulting in numerous business advantages. Business users are able to make a precise analysis of the data and the key early indicators from this analysis can mean fortunes for the business. Some of the exemplary use cases are as follows:
  • Whenever users browse travel portals, shopping sites, search flights, hotels or add a particular item into their cart, then Ad Targeting companies can analyze this wide variety of data and activities and can provide better recommendations to the user regarding offers, discounts and deals based on the user browsing history and product history.
  • In the telecommunications space, if customers are moving from one service provider to another service provider, then by analyzing huge call data records of the various issues faced by the customers can be unearthed. Issues could be as wide-ranging as a significant increase in the call drops or some network congestion problems. Based on analyzing these issues, it can be identified if a telecom company needs to place a new tower in a particular urban area or if they need to revive the marketing strategy for a particular region as a new player has come up there. That way customer churn can be proactively minimized.
Case Study – Stock market data
Now let’s look at a case study for analyzing stock market data. We will evaluate various big data technologies to analyze this stock market data from a sample ‘New York Stock Exchange’ dataset and calculate the covariance for this stock data and aim to solve both storage and processing problems related to a huge volume of data.

Covariance is a financial term that represents the degree or amount that two stocks or financial instruments move together or apart from each other. With covariance, investors have the opportunity to seek out different investment options based upon their respective risk profile. It is a statistical measure of how one investment moves in relation to the other.

A positive covariance means that asset returns moved together. If investment instruments or stocks tend to be up or down during the same time periods, they have positive covariance.

A negative covariance means returns move inversely. If one investment instrument tends to be up while the other is down, they have negative covariance.
This will help a stock broker in recommending the stocks to his customers.

Dataset: The sample dataset provided is a comma separated file (CSV) named NYSE_daily_prices_Q.csv that contains the stock information such as daily quotes, Stock opening price, Stock highest price, etc. on the New York Stock Exchange.

The dataset provided is just a sample small dataset having around 3500 records, but in the real production environment there could be huge stock data running into GBs or TBs. So our solution must be supported in a real production environment.

Hadoop Data Analysis Technologies
Let’s have a look at the existing open source Hadoop data analysis technologies to analyze the huge stock data being generated very frequently.





Featured
MapReduce
Pig
Hive
Language
Algorithm of Map and Reduce Functions (Can be implemented in C, Python, Java)
PigLatin (Scripting Language)
SQL-like
Schemas/Types
No
Yes (implicit)
Yes(explicit)
Partitions
No
 No
Yes
Server
No
 No
 Optional (Thrift)
Lines of code
More lines of code
Fewer (Around
10 lines of PIG = 200 lines of Java)
Fewer than MapReduce and Pig due to SQL Like nature
Development Time
More development effort
Rapid development
Rapid development
Abstraction
Lower level of abstraction (Rigid Procedural Structure)
Higher level of abstraction (Scripts)
Higher level of abstraction (SQL like)
Joins
Hard to achieve join functionality
Joins can be easily written
Easy for joins
Structured vs Semi-Structured Vs Unstructured data
Can handle all these kind of data types
Works on all these kind of data types
Deal mostly with structured and semi-structured data
Complex business logic
More control for writing complex business logic
Less control for writing complex business logic
Less control for writing complex business logic
Performance
Fully tuned MapReduce program would be faster than Pig/Hive
Slower than fully tuned MapReduce program, but faster than badly written MapReduce code
Slower than fully tuned MapReduce program, but faster than bad written MapReduce code

Which Data Analysis Technologies should be used?
Based on the available sample dataset, it is having following properties:
  • Data is having structured format
  • It would require joins to calculate Stock Covariance
  • It could be organized into schema
  • In real environment, data size would be too much
Based on these criteria and comparing with the above analysis of features of these technologies, we can conclude:
  • If we use MapReduce, then complex business logic needs to be written to handle the joins. We would have to think from map and reduce perspective and which particular code snippet will go into map and which one will go into reduce side. A lot of development effort needs to go into deciding how map and reduce joins will take place. We would not be able to map the data into schema format and all efforts need to be handled programmatically.
  • If we are going to use Pig, then we would not be able to partition the data, which can be used for sample processing from a subset of data by a particular stock symbol or particular date or month. In addition to that Pig is more like a scripting language which is more suitable for prototyping and rapidly developing MapReduce based jobs. It also doesn’t provide the facility to map our data into an explicit schema format that seems more suitable for this case study.
  • Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes tricky coding that we would have to do in MapReduce programming. If we apply Hive to analyze the stock data, then we would be able to leverage the SQL capabilities of Hive-QL as well as data can be managed in a particular schema. It will also reduce the development time as well and can manage joins between stock data also using Hive-QL which is of course pretty difficult in MapReduce. Hive also has its thrift servers, by which we can submit our Hive queries from anywhere to the Hive server, which in turn executes them. Hive SQL queries are being converted into map reduce jobs by Hive compiler, leaving programmers to think beyond complex programming and provides opportunity to focus on business problem.
So based on the above discussion, Hive seems the perfect choice for the aforementioned case study.
Problem Solution with Hive
Apache Hive is a data warehousing package built on top of Hadoop for providing data summarization, query and analysis. The query language being used by Hive is called Hive-QL and is very similar to SQL.
Since we are now done zeroing in on the data analysis technology part, now it’s time to get your feet wet with deriving solutions for the mentioned case study.
  • Hive Configuration on Cloudera
Follow the steps mentioned in my previous blog How to Configure Hive On Cloudera:                                                                                                                      
  • Create Hive Table
Use ‘create table’ Hive command to create the Hive table for our provided csv dataset:
hive> create table NYSE (exchange String,stock_symbol String,stock_date String,stock_price_open double, stock_price_high double, stock_price_low double, stock_price_close double, stock_volume double, stock_price_adj_close double) row format delimited fields terminated by ‘,’;
This will create a Hive table named ‘NYSE’ in which rows would be delimited and row fields will be terminated by commas. This schema will be created into the embedded derby database as configured into the Hive setup. By default, Hive stores metadata in an embedded Apache Derby database, but can be configured for other databases like MySQL, SQL server, Oracle, etc.
  • Load CSV Data into Hive Table
Use the following Hive command to load the CSV data file into Hive table:
hive> load data local inpath ‘/home/cloudera/NYSE_daily_prices_Q.csv’ into table NYSE;
This will load the dataset from the mentioned location to the Hive table ‘NYSE’ as created above but all this dataset will be stored into the Hive-controlled file system namespace on HDFS, so that it could be batch processed further by MapReduce jobs or Hive queries.
  • Calculate the Covariance
We can calculate the Covariance for the provided stock dataset for the inputted year as below using the Hive select query:
select a.STOCK_SYMBOL, b.STOCK_SYMBOL, month(a.STOCK_DATE),
(AVG(a.STOCK_PRICE_HIGH*b.STOCK_PRICE_HIGH) – (AVG(a.STOCK_PRICE_HIGH)*AVG(b.STOCK_PRICE_HIGH)))
from NYSE a join NYSE b on
a.STOCK_DATE=b.STOCK_DATE where a.STOCK_SYMBOL<b.STOCK_SYMBOL and year(a.STOCK_DATE)=2008
Group by a.STOCK_SYMBOL, b. STOCK_SYMBOL, month(a.STOCK_DATE);
This Hive select query will trigger the MapReduce job as below:


The covariance results after the above stock data analysis, are as follows:

The covariance has been calculated between two different stocks for each month on a particular date for the available year.
From the covariance results, stock brokers or fund managers can provide below recommendations:
  • For Stocks QRR and QTM, these are having more positive covariance than negative covariance, so having high probability that stocks will move together in same direction.
  • For Stocks QRR and QXM, these are mostly having negative covariance. So there exists a greater probability of stock prices moving in an inverse direction.
  • For Stocks QTM and QXM, these are mostly having positive covariance for most of all months, so these tend to move in the same direction most of the times.
So similarly we can analyze more use cases of big data and can explore all possible solutions to solve that use case and then by the comparison chart, the final best solution can be narrowed down.
Conclusion/Benefits
So this case study solves the following two important goals of big data technologies:
  • Storage
By storing the huge stock data into HDFS, the solution provided is much more robust, reliable, economical, and scalable. Whenever data size is increasing, you can just add some more nodes, configure into Hadoop and that’s all. If sometime any node is down, then even other nodes are ready to handle the responsibility due to data replication.
By managing the Hive schema into embedded database or any other standard SQL database, we are able to utilize the power of SQL as well.
  • Processing
Since Hive schema is created on a standard SQL database, you get the advantage of running SQL queries on the huge dataset also and are able to process GBs or TBs of data with simple SQL queries. Since actual data resides into HDFS, so these Hive SQL queries are being converted into MapReduce jobs and these parallelized map reduce jobs process these huge volume of data and achieve scalable and fault tolerant solutions.