MAP REDUCES:

Map reduce is programming model for processing data. Map reduce is a method for distributing a task across multiple nodes.

Each node is processing data, stored on loal to those nodes.

Hadoop can run map reduce program written in various languages.

Map reduce consists of two phases

  1. Map
  2. Reduce

Feature of map reduce:

  1. Automatic parallelization and distribution
  2. Fault tolerance, status and monitoring tool
  3. Its clear abstraction for a programmers
  4. Map reduce programs are usually written in Java, but it can written in any others scripting languages using hadoop sharing API
  5. Developer can

Concentrate on writing map and reduce functions.

 

Job Tracker:

 

Map reduce jobs are controlled by a software daemon know as job tracker.

The job tracker resides on a “ Master node”

  • Client submits map reduces jobs to the job tracker
  • The job tracker assings map and reduce other nodes on cluster
  • -these nodes each run a s/w daemon know as task tracker
  • The task tracker is responsible for actual instantiating the map or reduce tasks and reporting progress back to the job tracker.
  • If a task attempts fails, another will be tarted by the job tracker.

The mapper reads data in the form of (key, values) pairs.

It outputs zero or more (keys, value) pairs

Map(T/P-keys, I/P-value) (Imte-key, I-value)

Example Mapper; Uppercase matter

Change the output into uppercase letters.

Let map(k,v) = emit(k.toUpper(). V.toUpper())

(‘foo’, ‘bar’) —- (‘FOO”, ‘BAR’)

(‘foo’, ‘other’) — (‘FOO’ ‘OTHER’)

Example explode mapper:

o/p: each i/p character separately.

Let map(k,v) = foreach character in v: emit(k,v)

(‘foo’, ‘ban’) — (‘foo’, ‘b’), (‘foo’, ‘a’), (‘foo’, ‘r’)

(‘bar’, ‘other’) — (‘bar’, ‘o’), (‘bar’, ‘t’), (‘bar’, ‘h’), (‘bar’, ‘e’), (‘bar’, ‘r’)

Example changing keyspaces

o/p: the o/p by the mapper docs not to be identical to i/p keys

o/p: word length as key

let map(k, v) = emit (v.length(), v)

(‘foo’, ‘bar’) — (3, ‘bar’)

(‘baz’, ‘other’) — (5, ‘other’)

A job is a “ full program”

  • A complete execution of mapper and reducer over a data set
  • A task is the execution of a single mapper or reducer, over slice of data
  • A task attempt is a particular instance of an attempt to execute a task
  • If task attempt fails, another will be started by the job tracker is called a speculative execution

 

Reducer:

  1. After the map phase is over
  2. All intermediate values given by intermediate are combined together into a list –the list given by reducers.
  3. Single reducer or multi —-reducer —– specified part of job configuration

Sum Reducer:

Let reducer(k, values)=

sum =0;

foreach int 1 in vals;

sum + = 1;

emit (k, sum);

map reduce example:

wordcount:

count the no of occurrences of each word in a large amount of I/P data.

  • This is “hello world” of map reduce programming

Map (string I/P key, string I/P value)

Foreach word w in input value

Emit(w, 1)

Reduce (string o/p-key, iterator<int> intermediate –values)

Set count = 0

Foreach v in intermediate-values

Count ++= v

Emit (output –key, count)

 

I/P to the mapper:

The cat sat on the mat

The advark sat on the sofa

O/P from the mapper:

(the, 1), (cat, 1), (sat, 1), (on, 1), (mat, 1), (advark,1), (sat, 1), (on,1), (mat, 1), (sofa,1)

Intermediate data sent to the reducer:

advark, (1)

cat, (1)

mat, (1)

on, (1,1)

sat, (1,1)

sofa, (1)

the, (1,1,1,1)

 

Final reducer O/P:

(‘advark’, 1)

(‘cat’, 1)

(‘mat’, 1)

(‘on’, 2)

(‘sat’, 2)

(‘sofa’, 1)

(‘the’, 4)

 

Box classes in hadoop:

Intwritable

Float writable

Long writable

Double writable

Text for single

Etc.

Writable:

  • Writable is an interface which makes serialization quick and easy for hadoop
  • Any values type must unplement writable interface

 

Writable comparable:

A writable comparable is a interface which writable as well as comparable

Two writable comparable can be compared against each other

Sorting order

 

Procedure to write word count job:

Description:

To count the no of occurrence of individual words in the file.

Step1: create file

]$ cat > file

Hi how are you

How is your job

How for you are working for this

What is your experience

I have to attempt this

Step2:
put this file into hdfs

]$ hadoop -fs copyFromLocal file to file

Step3:

Write a programme to process a file

The driver code:

Introduction:

  1. The driver code run on the client machine
  2. If configures a job, then submits it to the cluster

 

 

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.mapred.TextInputFormat;

import org.apache.hadoop.mapred.TextOutputFormat;

public class wordcount extends configured implements Tool

{

Public int run(String [] args ) throws Exception

{

If(args.length ! = 2)

{

System.out.println(“please give I/P and O/P directories”);

Return -1;

}

//job conf class is used for giving all our configuration files to hadoop configuration iles

Jobconf conf = new Jobconf(WordCount.classs)

// to set the name for the job

Conf.setJobName(this.getclass().getName());

//to set I/P path

fileInputFormat.setInputpaths(conf, new Path(args[0]));

//set o/p path

fileOutputFormat.setOutputPath(conf, new Path(args[1]));

//set mapper class

Conf.setMapperClass(wordMapper.class)

// to set reducer class

Conf.setReducerClass(sumMapper.class)

// to set mapper key type

Conf.setMapOutputKey Class(Text.class);

// to set mapper value type

Conf.setMapOutputValue Class(IntWritable.class);

// to set output key from reduce

Conf.setOutputKeyClass(text.class);

// to set output value from reducer

Conf.setOutputValue class(IntWritable.class);

//job client class for submitting job to job tracker

Jobclient.runjob(conf);

Return 0;

Public static void main(String [] args) throws Exception

{

// tool runner class will take the implementation class of Tool interface as command line argument along with I/P directories

Int exit code = ToolRunner.run(new Wordcount() args);

System.exit(exitcode);

}

}

The Driver: Import statement:

import java.io.IOException;

import java.util.Iterator;

import java.util.StringTokenizer;

 

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.conf.configured;

import org.apache.hadoopUtil.Toll;

import org.apache.hadoopUtil.TollRunner;

 

//you will typically import these classes into every mapreduce job you write

//the Driver Class: Tool and ToolRunner

public class WordCount extends configured samplements Tool

{

// your driver class extends configured class and implements Tool interface this is allows the users to specifies the configuration solting on the command line.

}

Public static void main(String a[]) throws Exception

{

Int exitcode = Tool Runner.run( new WordCount() args);

// the main method supply calls ToolRunner.run() method passing in the driver class and commamd line argument the job will then configured and submitted to run method.

The job’s invacation:

Public int run(Strings[]s) throws Exception

{

If(args.length! = 2)

{

System.out.println(“Please give I/P & O/P directories”);

Return -1;

}

The first set to ensure that we are giving two command line arguments id no it prints a help message and exits.

Configuring the job with JobConf:

Jobconf conf = new JobConf(Wordcont.class)

// to configure the job create a new job conf object and specify the class which wiil be called to run the job.

 

Creating a new JobConf Object:

The job conf class allows you to set configuration options for your map reduce job

  • The class to be used for your mapper and reducer
  • The I/P & O/P directories
  • Many more options
  • Any options not explicitly set in your driver code will be read from our hadoop configuration files
  • Usually located in etc/hadoop/conf

Naming the Job:

Conf.JobName(this.getClass().getNmae());

[give meaning full name for the job]

Spcefying I/P & O/P Directories:

File InputFormat.setInputPaths(conf, new path(args[0]);

fileOutputFormat.setOutputpath(conf, new path(args[1]);

// specify the i/p file directories from which data will be read and output directories to which the final output will be written

Specifying the InputFormat:

The default inputFormat is the TextInputFormat and it will be used unless you specifies the InputFromat class explicites.

To use an InputFormat(keyvalueTextInputFormat.class)

Or

Conf.setINputFormat(sequenceFileInputFormat.class)

 

Specifies the classes for mapper and Reducer:-

Conf.setMapperclass(wordmapper.class);

Conf.setReducerclass(wordReducer.class);

// give information to the jobconf object which classes are to be instantated as mapper and reducer.

Specify the intermediate data types:

Conf.setMapoutputKeyClass(Text.class);

Conf.setMapOutputValuesClass(int_writable.class);

// specify the types of intermediate output key and output value producer by mapper

Specify the final output data types:

Conf.setOutputKeyClass(Text.class);

Conf.setOutputValueClass(Intwriable.class);

//specify the reducers output key and value types.

Running the job:

Jobclient.runjob(conf);

Jobclient.runjob(conf);

//finally run job by using runjob() of jobclient

There are two ways run your Map reduce job:-

  • Jobclient.runjob(conf);

Blocks (waits for the job to complete before continuing)

  • Jobclient.submitjob(conf):

Does not block (deriver code continues as the job is running)

Jobclient determines the proper division InputSplit of input data into.

Jobclient then sends the job information into to job tracker deamon on the cluster.

The mapper code:

 

import java.util.Iterator;

import java.util.StringTokenizer;

 

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.conf.configured;

public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>

{

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException

{

String line = value.toString();

For(String word:s.split(“|| wt”))

{

If(word.length()>0)

{

Output.collect(new Text(word), new Intwritable(1));

}

}

}

The Reducer: complete code:

public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>

{

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException

{

Int wordcount = 0;

//retrieving values for iterator

While(values.hasnext())

{

Intwritable value = values.next();

// converting values from intwritable types to int type

Word count + = value.get();

}

Output.collect (key, new intwritable(wordcount));

}

}

 

Step4:

]$ javac – classpath $HADOOP_HOME/ hadoop – core.jar * .java

Step5:

Create a jar file for all class file generated above step

Jar wc.jar *.class

Step6:

Run above create jar files.

]$ hadoop jar wc.jar WordCount file

Step7:

To see content

]$ hadoop fs –cat file/part -00000

o/p

(I, 1)

(are,2 )

(Attempt, 1)

(Eroupe, 1)

(far, 1)

(for, 1)

(now, 3)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s