Google

Showing posts with label Big data analytics. Show all posts
Showing posts with label Big data analytics. Show all posts

Saturday, 23 July 2016

Difference between Pig and Hive-The Two Key Components of Hadoop Ecosystem

Pig and Hive are the two key components of the Hadoop ecosystem. What does pig hadoop or hive hadoop solve? Pig hadoop and Hive hadoop have a similar goal- they are tools that ease the complexity of writing complex java MapReduce programs. However, when to use Pig Latin and when to use HiveQL is the question most of the have developers have. Apache HIVE and Apache PIG components of the Hadoop ecosystem are briefed. If we take a look at diagrammatic representation of the Hadoop ecosystem, HIVE and PIG components cover the same verticals and this certainly raises the question, which one is better? It’s Pig vs Hive (Yahoo vs Facebook).

There is no simple way to compare both Pig and Hive without digging deep into both in greater detail as to how they help in processing large amounts of data. This post compares some of the prominent features of Pig Hadoop and Hive Hadoop to help users understand the similarities and difference between them

Hadoop technology is the buzz word these days but most of the IT professionals still are not aware of the key components that comprise the Hadoop Ecosystem. Not only this, few of the people are as well of the thought that Big Data and Hadoop are one and the same.
Just before we jump on to a detailed discussion on the key components of the Hadoop Ecosystem and try to understand the differences between them let us have an understanding on what is Hadoop and what is Big Data.

What is Big Data and Hadoop?

Generally data to be stored in the database is categorized into 3 types namely Structured Data, Semi Structured Data and Unstructured Data.
Structured Data is nothing but data that can be stored in databases, for instance, the transaction records of any online purchase that you make can be stored in a database whereas data that can only be partially stored in the database is referred to as semi structured data, for instance, the data that is present in the XML records can be stored partially in the database.
Any other form of data that cannot be categorized as Structured or semi-structured is referred to as Unstructured Data, for instance, the data from Social Networking websites or the web logs which cannot be analyzed or stored for processing in the databases are examples of unstructured data.
We generally refer to Unstructured Data as “Big Data” and the framework that is used for processing Big Data is popularly known as Hadoop.

Hadoop Ecosystem comprises of the following key components:

1) Map Reduce Framework

2) HDFS (Hadoop Distributed File System)

3)  Hive

4) HBase

5)  Pig

6) Flume

7) Sqoop

8) Oozie

9) Zoo Keeper



Is the battle HIVE vs PIG real? Does the pair have the same advantages and disadvantages while processing enormous amounts of data? The answer is NO, there is no HIVE vs PIG in the real world, it’s just the initial ambiguity on deciding the tool which suits the need. HIVE Query language (HiveQL) suits the specific demands of analytics meanwhile PIG supports huge data operation. PIG was developed as an abstraction to avoid the complicated syntax of Java programming for MapReduce. On the other hand HIVE QL is based around SQL, which makes it easier to learn for those who know SQL. AVRO is supported by PIG making serialization faster. When it really boils down on taking decision between Pig and Hive, the suitability of the each component for the given business logic must be considered and then the  decision must be taken.
Just as there is a HIVE vs PIG, there is continued discussion on Hbase vs HIVE. This uncertainty can easily be justified by taking the representation of Hadoop ecosystem. Hbase covers more vertical than HIVE. So there is no Hbase vs HIVE. With deeper insight, HIVE uses queries which will later be converted to ensemble MapReduce technique to do operations on the database, at the same time Hbase works on the HDFS directly, although Hbase and HIVE work on structured database.

HIVE Hadoop

Hive Hadoop was founded by Jeff Hammerbacher who was working with Facebook. When working with Facebook he realized that they receive huge amounts of data on a daily basis and there needs to be a mechanism which can store, mine and help analysis of the data. This idea to mine and analyze huge amounts of data gave birth to Hive. It is Hive that has enabled Facebook to deal with 10’s of Terabytes of Data on a daily basis with ease

Hive uses SQL, Hive select, where, group by, and order by clauses are similar to SQL for relational databases.  Hive lose some ability to optimize the query, by relying on the Hive optimizer.
Hive is similar to a SQL Interface in Hadoop. The data that is stored in HBase component of the Hadoop Ecosystem can be accessed through Hive. Hive is of great use for developers who are not well-versed with the MapReduce framework for writing data queries that are transformed into Map Reduce jobs in Hadoop.
We can consider Hive as a Data Warehousing package that is constructed on top of Hadoop for analyzing huge amounts of data. Hive is mainly developed for users who are comfortable in using SQL. The best thing about Hive is that it conceptualizes the complexity of Hadoop because the users need not write MapReduce programs when using Hive so anyone who is not familiar with  Java Programming and Hadoop API’s can also make the best use of Hive.
We can summarize Hive as:
a) A Data Warehouse Infrastructure
b) Definer of a Query Language popularly known as HiveQL (similar to SQL)
c) Provides us with various tools for easy extraction, transformation and loading of data.
d) Hive allows its users to embed customized mappers and reducers.

What makes Hive Hadoop popular?

  • Hive Hadoop provides the users with strong and powerful statistics functions.
  • Hive Hadoop is like SQL, so for any SQL developer the learning curve for Hive will almost be negligible.
  • Hive Hadoop can be integrated with HBase for querying the data in HBase whereas this is not possible with Pig. In case of Pig, a function named HbaseStorage () will be used for loading the data from HBase.
  • Hive Hadoop has gained popularity as it is supported by Hue.
  • Hive Hadoop has various user groups such as CNET, Last.fm, Facebook, and Digg and so on.



PIG Hadoop

Pig Hadoop was developed by Yahoo in the year 2006 so that they can have an ad-hoc method for creating and executing MapReduce jobs on huge data sets. The main motive behind developing Pig was to cut-down on the time required for development via its multi query approach. Pig is a high level data flow system that renders you a simple language platform popularly known as Pig Latin that can be used for manipulating data and queries.
Pig is used by Microsoft, Yahoo and Google, to collect and store large data sets in the form of web crawls, click streams and search logs. Pig at times finds its usage in ad-hoc analysis and processing of information.

What makes Pig Hadoop popular?

  • Pig Hadoop follows a multi query approach thus it cuts down on the number times the data is scanned.
  • Pig Hadoop is very easy to learn read and write if you are familiar with SQL.
  • Pig provides the users with a wide range of nested data types such as Maps, Tuples and Bags that are not present in MapReduce along with some major data operations such as Ordering, Filters, and Joins.
  • Performance of Pig is on par with the performance of raw Map Reduce.
Pig has various user groups for instance 90% of Yahoo’s MapReduce is done by Pig, 80% of Twitter’s MapReduce is also done by Pig and various other companies such as Sales force, LinkedIn, AOL and Nokia also employ Pig.

The below tabular data will give you an overview on the basic difference between Pig and Hive:
Pig vs Hive 

Instead of writing Java code to implement MapReduce, one can opt between Pig Latin and Hive SQL languages to construct MapReduce programs. Benefit of coding in Pig and Hive is - much fewer lines of code, which  reduces the overall development and testing time.
Difference between pig and hive is Pig needs some mental adjustment for SQL users to learn.  Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!).
Hive is commonly used at Facebook for analytical purposes.  Facebook promotes the Hive language. However, Yahoo! is a big advocate for Pig Latin.  Yahoo! has one of the biggest Hadoop clusters in the world.  Their data engineers use Pig for data processing on their Hadoop clusters. Alternatively, you may choose one among Pig and Hive at your organization, if no standards are set.
Data engineers have better control over the dataflow (ETL) processes using Pig Latin, especially with procedural language background. A data analyst finds that one can ramp up on Hadoop faster, by using Hive, especially with previous experience of SQL.  If you really want to become a Hadoop expert, then you should learn both Pig and Hive for the ultimate flexibility.

Pig vs. Hive



Depending on your purpose and type of data you can either choose to use Hive Hadoop component or Pig Hadoop Component based on the below differences :
1) Hive Hadoop Component is used mainly by data analysts whereas Pig Hadoop Component is generally used by Researchers and Programmers.
2) Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data.
3) Hive Hadoop Component has a declarative SQLish language (HiveQL) whereas Pig Hadoop Component has a procedural data flow language (Pig Latin)
4) Hive Hadoop Component is mainly used for creating reports whereas Pig Hadoop Component is mainly used for programming.
5) Hive Hadoop Component operates on the server side of any cluster whereas Pig Hadoop Component operates on the client side of any cluster.
6) Hive Hadoop Component is helpful for ETL whereas Pig Hadoop is a great ETL tool for big data because of its powerful transformation and processing capabilities.
7) Hive can start an optional thrift based server that can send queries from any nook and corner directly to the Hive server which will execute them whereas this feature is not available with Pig.
8) Hive directly leverages SQL expertise and thus can be learnt easily whereas Pig is also SQL-like but varies to a great extent and thus it will take some time efforts to master Pig.
9) Hive makes use of exact variation of the SQL DLL language by defining the tables beforehand and storing the schema details in any local database whereas in case of Pig there is no dedicated metadata database and the schemas or data types will be defined in the script itself.
10) The Hive Hadoop component has a provision for partitions so that you can process the subset of data by date or in an alphabetical order whereas Pig Hadoop component does not have any notion for partitions though might be one can achieve this through filters.
11) Pig supports Avro whereas Hive does not.
12) Pig can be installed easily over Hive as it is completely based on shell interaction
13) Pig Hadoop Component renders users with sample data for each scenario and each step through its “Illustrate” function whereas this feature is not incorporated with the Hive Hadoop Component.
14) Hive has smart inbuilt features on accessing raw data but in case of Pig Latin Scripts we are not pretty sure that accessing raw data is as fast as with HiveQL.
15) You can join, order and sort data dynamically in an aggregated manner with Hive and Pig however Pig also provides you an additional COGROUP feature for performing outer joins.
To conclude with after having understood the difference between Pig and Hive, to me both Hive Hadoop and Pig Hadoop Component will help you achieve the same goals, we can say that Pig is a script kiddy and Hive comes in, innate for all the natural database developers. When it comes to access choices, Hive is said to have more features over Pig. Both the Hive and Pig components are reportedly having near about the same number of committers in every project and likely in the near future we are going to see great advancements in both on the development front.



Friday, 22 July 2016

TECHNIQUE TO ANALYZE DATA

There are many techniques being used to analyze datasets. In this article, we provide a list of some techniques applicable across a range of industries. This list is by no means exhaustive. Indeed, researchers continue to develop new techniques and improve on existing ones, particularly in response to the need to analyze new combinations of data. We note that not all of these techniques strictly require the use of big data—some of them can be applied effectively to smaller datasets (e.g., A/B testing, regression analysis). However, all of the techniques we list here can be applied to big data and, in general, larger and more diverse datasets can be used to generate more numerous and insightful results than smaller, less diverse ones.

1. A/B testing: A technique in which a control group is compared with a variety of test groups in order to determine what treatments (i.e., changes) will improve a given objective variable, e.g., marketing response rate. This technique is also known as split testing or bucket testing. An example application is determining what copy text, layouts, images, or colors will improve conversion rates on an e-commerce Web site. Big data enables huge numbers of tests to be executed and analyzed, ensuring that groups are of sufficient size to detect meaningful (i.e., statistically significant) differences between the control and treatment groups (see statistics). When more than one variable is simultaneously manipulated in the treatment, the multivariate generalization of this technique, which applies statistical modeling, is often called “A/B/N” testing

2. Association rule learning: A set of techniques for discovering interesting relationships, i.e., “association rules,” among variables in large databases.These techniques consist of a variety of algorithms to generate and test possible rules. One application is market basket analysis, in which a retailer can determine which products are frequently bought together and use this information for marketing (a commonly cited example is the discovery that many supermarket shoppers who buy diapers also tend to buy beer). Used for data mining.

Thursday, 21 July 2016

 

BIG DATA ANALYSIS WITH HIVE

What is HIVE?

 Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. Hive is not built to get a quick response to queries but it it is built for data mining applications. Data mining applications can take from several minutes to several hours to analysis the data and HIVE is primarily used there.

HIVE Organization

The data are organized in three different formats in HIVE.

Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.
Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.
Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.
Hive also has metastore which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). We can use MySQL database over here.



What is HiveSQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks which HQL can do easily.
  • Create and manage tables and partitions
  • Support various Relational, Arithmetic and Logical Operators
  • Evaluate functions
  • Download the contents of a table to a local directory or result of queries to HDFS directory
Here is the example of the HQL Query:

SELECT upper(name), salesprice
FROM sales;
SELECT category, count(1) 
FROM products 
GROUP BY category;

When you look at the above query, you can see they are very similar to SQL like queries