What is Map Reduce in Hadoop ?

Assume like my job is Wordcount job and for this we will see map reduce flow chart. For this let me take one example of one small file name File.txt.

Just assume that the size of the file named file.txt is 200 MB and inside file.txt following information are written.

Now if i try to keep this 200 MB of file.txt file into hdfs then how many number of blocks can be given ?
Answer : If the difault block size is 64mb then 4 blocks can be given. So let me split this file into 4 input splits.

Now lets see how you are running your program. My requirements is i have to count number of occurane in this file named called file.txt. So how many time is hi is available ?
how many times how is available ?
how many times IS is available ?
how many times your is available ?
You can see it from below mentioned screenshot hi,1 how,4 is,6 and your,4

Generally i am running program like this —–> $ hadoop jar test.jar DriverCode File.txt testoutput

Here I am applying test.jar file on this 200MB of file named file.txt and that file will be split into 4 input splits among which one will be stored in first node, second one will be stored into second node, third one will be stored on third node and fourth one at fourth node.

So how many numbers of input splits is there those numbers of mappers will be there. For more clear you can see below screenshot.

This hadoop can run with map and reduce only. Your hadoop is going to stored the data in the form of hdfs (hadoop distribution file system).

Hadoop distribution file system is just there for storing your data but if you wanted to process that data how we have to process that ? Map reduce. For that we are having map reduce

Your hadoop can run your map reduce in the form of key, value base only. Your mapper can work on key,value base and your reducer will also work on key,value base . Your mapper and reducer doesn’t know anyone except that key,value base.

So definitely this input splits will seems like what ? text messages. Now i have to convert this text messages into key,value base and i have to pass those key,value base to this mappers.

So that this mappers is working and after that is also giving what ? some other key,value base and those key,value base will be passing to what ? reducer so how to convert this text line (hi how are you) to key,value base. how to split that ? for that one more interface is there. What is that ? record reader is there. one record reader is there. Its an interface. So for every input splits and mappers there will be one record readers.

What this record reader will do ? Generally in hadoop terminology we must called them as what ?records . In hadood terminology we should not called them as what ? lines. We have to call them as what ? records.

So in 1st input splits how many line is there ? 2

in 2nd input splits how many line is there ? 2

in 3rd input splits how many line is there ? 2

in last or 4th input splits how many line is there ? 1

Now how to convert that line into key,value pairs. how to pass that key,value pair to mappers. Why you are converting that into key,value pairs ? because your mappers can understand only key,value pairs and it can run only those key,value pairs and same like wise with reducer as well.. So if i want to pass this input splits lines into key,value pairs then we have to take help of what ? record readers. So for that record reader you need not write any extra code. That record reader will be taken care by hadoop frameword itself by default. Sir on what basis this record readers is converting into key,value pair ? that depends on what four format of your file. So how many line is there in 1st input splits.

This records readers only knows one thing that how to read one line at a time from the corresponding input splits . For that reason what we are calling it as what ? record reader . So we are not calling them as line in hadoop terminology. We will be calling them as record reader. That file may be of either of those four format or whatever it may be. Ultimately record reader is reading only one line but when it is going to convert that line into key,value pair it is just trying to consider what format of the file it is.

So what those four formats are

first one text input format — this are all predefined class only
second one is key,value text input format
third one is sequence file input format
fourth one is sequence file text input format.

So this file should be either of this four formats. So basically your file may be either of this four format but your record reader is reading only one line from its corresponding input splits. When it is going to convert that line into key,value pair. It is just trying to consider what is the format of the file it is out of this four. Suppose if you are not specifying any of the file format then by default it is text input format. Suppose your text is in text input format. You record reader is reading the first line and it is converting that line into key,value pair. So how line going to convert into key,value pair as byte offset, entire line.

What is byte offset here ? address of the line. As it is initially first line. address of the line is what ZERO(0). What is the entire line which is read by record reader ? (hi how are you)

(byte offset, entire line)
0 hi how are you ? ( Note : Check below image )

Means what this record reader is reading input splits first line and it is converting to that key,value pair and it is passing that key,value pair for this Mappers.

Now this mappers is just trying to take this key,value pairs and starting working on that.

Next record reader is moving back again and it is reading second line again. What is that second line ? HOW IS YOUR JOB ? and it is converting that line into key,value pair and it is passing that key,value pairs to mappers.
how this second line will be converted into key,value pair ? Again it will going to convert into what ? byte offset, entire line. So what will be the byte offset and entire line for the second line. Second lines means how many characters is there in first line ? 15 black slash and space will be consider as what ? one special character. So how many are there in totaly ? 15.

So 16 will be given what ? for byte offset for second line and entire line will be How is your job ?

(byte offset, entire line)
16 how is your job ?

So meanwhile it is giving that second key,value pair for this mappers .This mappers should be working on what first key,value pairs. How many numbers of lines is there in input splits those many number of times that mappers will be running. Now how lines are there at 1st block ? 2 lines So how many number of key,value pair will be generated ? 2 key,value pair will be generated. So for each key,value pairs this mappers will be running once. If i am having ten lines so ten key,value pairs will be generated by that record readers so ten times this mappers will be running. For each and every key,value pairs this mappers will be running once.

Now see this one. So you might have seen in java collection interface map function. Map will be running on what ? with key comma value pair. So your entire collection framework is working on or based on objectives types but not on primitive types.

What are all the primitive types in java ? int, long, float,double,string,char

Now what are all the corresponding Wrapper class types for all this primitive types ? The intention of introducing this wrapper classes is collection of framework. Your collection of framework only works on objectives types but not on primitive types. Objectives types means what ? it should have some class. Classes should be created with some objects. Those object will be only working on with that collection framework. So here also for all this primitive type you are having corresponding wrapper class. So what this wrapper classes are ??

Wrapper Type	Primitive Type
Integer	int
Long,	long
Float,	float
Double,	double
String,	String
Character	Char

===================
How you are having primitive types and wrapper class in jave ? your hadoop has also introduce box classes. So what are box classes ?

Wrapper Type	Primitive Type	Box Class
Integer	int	IntWritable
Long,	long	LongWritable
Float,	float	FloatWritable
Double,	double	DoubleWritable
String,	String	TextWritable
Character	Char	TextWritable

If i wanted to convert int to IntWritable type i have to write new IntWritable(int)
If i wanted to onvert IntWritable to int type i have to write get() method

So why we are discussing all this wrapper class, primitive types and box classes ? Here this mappers is getting what key,value pairs, This key,value pairs must be with some objectives types.
All your keys and values in any function should be with some objectives types. This two lines are with objectives types you must specify with some box class.

key is taken as what box class can u say ? int no its longWritable

and your value is taken as what ? text messages

Now lets see how mappers will give the output ?

(hi,1) (how,1) (Are,1) (you,1)
(how,1)(is,1)(your,1)(job,1)

(how,1)(is,1)(your,1)(family,1)
(how,1)(is,1)(your,1)(sister,1)
(how,1) (is,1) (your,1) (brother,1)
(what,1) (is,1) (the,1) (time,1) (now,1)

(what,1) (is,1) (the,1) (strength,1) (of,1) (hadoop.1)

You are writing only one mapper code and that mapper code will be shared with all this data nodes which means same program will be running on all this mappers for which it give the output as key,value
When can we say that this is the total output of this input file (file.txt) ? if we going to combine them into reducer.

(hi,1) (how,1) (Are,1) (you,1)
(how,1)(is,1)(your,1)(job,1)

(how,1)(is,1)(your,1)(family,1)
(how,1)(is,1)(your,1)(sister,1)
(how,1) (is,1) (your,1) (brother,1)
(what,1) (is,1) (the,1) (time,1) (now,1)

(what,1) (is,1) (the,1) (strength,1) (of,1) (hadoop.1)

So what this reducer will do ? it will try to combine all your map output key,value pairs. So below all key,value pairs will called as intermediate data. Data which is being generated between mapper and reducer
is called intermediate data.
(hi,1) (how,1) (Are,1) (you,1)
(how,1)(is,1)(your,1)(job,1)

(how,1)(is,1)(your,1)(family,1)
(how,1)(is,1)(your,1)(sister,1)
(how,1) (is,1) (your,1) (brother,1)
(what,1) (is,1) (the,1) (time,1) (now,1)

(what,1) (is,1) (the,1) (strength,1) (of,1) (hadoop.1)

Within this intermediate data you can see how,1 is repeated. is,1 is repeated. (yours,1) (is,1) is repeated so something like that. So are we having duplicate keys or not ? in map function key will not be duplicate but}
your values can be duplicate but if you look at the below output key seems to have repeated.
(hi,1) (how,1) (Are,1) (you,1)
(how,1)(is,1)(your,1)(job,1)

(how,1)(is,1)(your,1)(family,1)
(how,1)(is,1)(your,1)(sister,1)
(how,1) (is,1) (your,1) (brother,1)
(what,1) (is,1) (the,1) (time,1) (now,1)

(what,1) (is,1) (the,1) (strength,1) (of,1) (hadoop,1)

So in order to get rid from the duplicate key in that sense this intermediate data will be done with two more phases SHUFFLING and SORTING

Shuffling : is just trying to combine all your values associated to single identical key. For suppose how is repeated 5 times. So this will try to combine all this phase together as (how(1,1,1,1,1)). So shuffling means it will try to combine all your values associated to single identical key. It is trying combine all this five how in single identical key.
One more example is is there so how many times it is repeated six times (is (1,1,1,1,1,1)

All your box class has writable compatible interface by which automatically all the key is compare with each other

Now this reducer will try to work on this key,value pairs so finally it is giving output like what (how,5) (is,6) and this output will be given into record writer. how you r having record reader here also you will have what record writer.

This record write will only knows one things what how to write those key,value pairs.

One key,value pairs towards what output file.What is this output file ? generally you will going to get this one as what part-00000 is output directory . In this output directory it is giving two files and one directory
one file is what _success : its a message only

another one is _logs : it is a directory where u have some history about ur job

another one is part_00000 : this is the final output file

About Us

BSAI Academy

Community

Connect