Pig Notes

Pig:

Pig was originally created at yahoo! To answer a similar need to Hive:

  • Many developer did not have the java or map reduce knowledge required to write map reduce program.
  • – but still needed to query data.

Pig is a dataflow language:

  1. Pig language is called piglatin.
  2. Relatively simple syntax.
  3. Under the covers, piglatin scripts are turned into mapreduce jobs and executed on the cluster.
  4. Installation on pig requires no modification on cluster.
  5. Pig interpreter runs in client mechine.
  6. Turns pig latin into standard java mapreduce jobs, which are then submitted to the job tracker.
  7. This is (currentlu) no shared metadata, no need for a shared metastore any fiend.

Pig Concepts;

  1. In pig, a single element of data is an atom.
  2. A collection of atoms –such as a row or a partial row is a tuple
  3. Tuples are collected together into bags
  4. Typically, a piglatin script start by loading one or more datasets into bags, and create new bags by modifying those it already has enlisted.

Pig features:

Pig supports many features which allow developers to perform sophisticated data analysis without writing java map reduce code.

  • Joining datasets
  • Grouping data
  • Referring to elements by position rather than name
  • Loading non-delimited data using a custom serde.
  • Creating of user defined functions written in java and more.

A sample Pig script:

Emps= LOAD ‘people.txt’ As (id,name,salary);

Rich=FILTER emps By salary > 100000;

Srtd = ORDER rich By salary DESC;

STORE srtd INTO ‘rich-people’;

  1. Here, we load a file into a bag called emps.
  2. Then we create a new bag called rich which contains just those.
  3. Records where the salary portion is greater than 100000
  4. Finally, we write the contents of the srtd bag to new directory in HDFS
  5. By default, the data will be written in tab separated format alternatively, to write the contents of a bag to the screen , say

Let’s look at a sample example by writing the program to calculate maximum recorded temperature by year for the weather dataset in pig latin. The complete program is only a few lines long:

–Max –temp.pig –finds the maximum temperature by year

Finds the maximum temperature by year:

Grunt > recods =LOAD’input/ncdc/micro-tab/sample.txt’

As(year:chararray, temperature:int quality: int);

Grunt > filtered_records = FILTER records By temperature ! 9999 AND >>(quality == 0 OR Quality ==1 OR Quality == 4 OR quality == 5 OR Quality ==9);

Grunt > grouped_records = GROUP filtered_records By year;

Grunt > max-temp = FOREACH grouped_records GENERATE groups;

MAX (filtered-records.temperature);

Grunt > DUMP max-temp;

The programme assumes that the i/p is tab-delimited text “ , “ with each line having just year, temperature and qulity fileds.

A table is a just like a row of data in a database table with multiple fileds in a particular order.

The LOAD function produces a set of (year, temperature, qualities) tupels that are present in the i/p file.

We write a relation with one tuple per line, where tuples are represented as comma-separated items in parenthess:

(1950, 0, 1)

(1950, 22, 1)

(1950, -11, 1)

(1949, 111, 1)

Relations are give names or aliases, so they can be referred to, this relation is given the records alias. We can examine the contents of an alias using DUMP operator

Grunt> DUMP records;

(1950, 0, 1)

(1950, 22, 1)

(1950, -11, 1)

(1949, 78, 1)

We can also see the structure of relation the relation’s schema using DESCRIBE operator on relation alias.

Grunts> DESCRIBE records;

Records: { year: chararray, temperature: int, quality: int)

The second statement removes records that have a missing temperature (indicated by a value of 9999) or unsatisfactory quality reading, small dataset no records filtered out.

Grunt> Altered-records = FILTER records By temperature ! = 9999 AND >> (quality == 0 OR quality == 1 OR Quality == 4 OR Quality == 5 OR Quality == 9);

Grunts> DUMP filtered_records;

(1950, 0, 1)

(1950, 22, 1)

(1950, -11, 1)

(1949, 111, 1)

(1949,78,1)

The statement uses the GROUP function to group the records relation by the year field let’s use DUMP to see what it provides.

Grunt> grouped_records = GROUP filtered_records BY year;

Grunt> DUMP grouped_records;

(1949,{1949,111,1),(1949,78,1)})

(1950,{1950,0,1),(1950,22,1),(1950,-11,1)})

Now we have two rows, or tuples one for each year in the i/p data, first field each tuple is the feild being grouped by (the year) second belong to tuples of bug.

A bag is just an unordered collection of tuples, which in pig latin is represented USING curly braces.

By grouping the data in this way, we have created a row per year, so now all that remains to find the maximum temperature for the tuples in a each bag. Before we do this, lets understand the structure of the grouped _records relation:

Grunt>DESCRIBE grouped-records;

Grouped-records:{group:chararray, filtered-records {year:chararray, temperature:int, quality:int}}

Grunt> max-temp = FOREACH group-records GENERATE group ,>> MAX(filtered-records .temperature);

FOREACH processes every row to generate a delivered set of rows, using a GENERATE cluse to define the fields is group, which just the year the second field little more complex.

The filter-records temperature reference is to the temperature field of filtered-records bag in the grouped records relation. MAX Built function for calculating a maximum values/ field in bag.

Grunt> DUMP max-temp;

o/p: (1949,111)

(1950,22)

Grunt> Illuiate max-temp;

Pig latin relational operator:

Category operator Description

Loading and Storing (H to P) LOAD Load data from of file system or other storage into a relation

(P à H) STORE save a relation to file system or other storage

DUMP print relation to the console

Filtering FILTER Remove unwanted a rows from a relation

DISTINICT Remove duplicate rows from a relation

FOREACH…GENERATE Adds or removes files from a relation

MAPREDUCE Runs a MapReduce job using a relation as I/P

STREAM Transforms a relation using an external program

Sample selects a random sample of a relation

Grouping and Joining Join join two or more relations

CoGroup groups the data in two or more relation

Group groups the data in a single relation

CROSS create the cross-product of two or more relations

Storing ORDER sorts a relation by one by more fileds

Limit Limits the size of a relation to a maximum no of tupes

Combining Spliting UNION combines two or more relations into one

SPLIT splits a relation into two or more relations.

Pig Latin diagnostic operators:

Operator Description

DESCRIBE print relation schema

EXPLAIN print the Logical & Physical plan

ILLUSTRATE shows a sample execution of logical plan,using a generated subset of a I/P

Intaling PIG/HIVE:

Step1: copy s/w from windows to linux(home/sajith!)

Step2: give permission chmod filename /pig/Hive

Step3: Execute using tar –xzvf filename

Pig(linux side) Hive(HDFS)

$ gedit .bashrc Create two dir in HDFS , a. Tmp b. Warehouse or durby

$ source .bashrc chmod g+w tmp , chmod g+w warehouse

Conn b/w pig&Hdfs gedit .bashrc

Pig-0.9.9/conf

$gedit pig Properties

J T

N N

(linex terminal) $ source .bashrc

$ pig conn b/w Hive & hadoop

Hadoop-1.0.3 /conf gedit hive-site.xml

J T, N N

Hadoop-1.0.3/bin $ hive

Grunt> Hive>

Create objects here pig data will store pig tmp it will deleted when pig closed.

Pig data store Hdgs.

-pig developed by yahoo hive developed by facebook

-avoid bluck coding avoid bluck coding

-pig latin language hive query language

-temparary storage temp storage(tmp), permanent storage (Derby, Decott)

-storing data in a object form table form

 

Used in weblog processing

àstructure

àunstructure

àsemistructure

Draw Back:

You should remember objects

Pig/Hive important in India:

Hive-Banking,healthcare,retails etc.

 

 

 

Command /operator for pig

Foreach –check (something need to apply)

Filter – were (remove un wanted rows)

Group/ co –group – grouping /group by

Join / joins – select one or more tables

Order /sorting order order by in robms

Distinct – remove duplicate values

Union – one or more files combination

Split – split relation between two or more relations

Stream – binary command

Dump – retrieving the data

Limit – show no of record in limit

 

FLATTER COMMAND

Hadoop in action book refer it

EXAMPLE FOR PIG

Step 1 : create file in linex

Step 2 :send to HDFS (my file name is hwiki)

Step 3 :grunt >wiki – ip= LOAD “Hwiki’ using pig storage (‘.’) as (region: chararray url: chararray, clicks: int, size: int);

Step4: dump wiki-ip;

Step5: wiki-fil-ip = FILTER wiki-ip by region == ‘england’;

Step6: grunt> wiki-grp-url = group wiki-fil-ip by url;

Grunt> wiki-sum-url = foreach wiki-grp-url generate group, sum(wiki-fil-ip.clicks);