Hdfs Dump Question model -1

HDFS:-
1. Name the most common Input Formats defined in hadoop? Which one is default?

Ans) Text input format is the hadoop default.

2. Consider case scenario: In M/R system, – HDFS block size is 64 MB

2A) Input file format is file inputformat.
we have 3files of size 64k,65mb,127mb.
3. What are some typical functions of Job Tracker?

 

3A) The job tracker do some typical functions.they are
1.take the jobs from clients.
2.jt communicate with name node to determine the location of data.
3.it locates the tasktracker nodes with nearest data.
4.it submits the work to chosen TT nodes nd monitors progress of each task by receiving heartbeat signals from TT.

4. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?

4a)It will restart the task again on some other Task Tracker and only if the task fails more than four (default setting and can be changed) times will it kill the job.

5. What is Hadoop Streaming?

5a)Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.
6. What is Distributed Cache in Hadoop?

6. Distributed Cache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.

Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via urls are already present on the FileSystem at the path specified by the url and are accessible by every machine in the cluster.
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

Reference – https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html
7. Is it possible to have Hadoop job output in multiple directories? If yes, how?

a)Yes, by using Multiple Outputs class.
8. What will a Hadoop job do if you try to run it with an output directory that is already present? Will it

8. If we try to run a Hadoop job with an output directory that is already present an exception is thrown, while it will exit from the job.
9. How will you write a custom partitioner for a Hadoop job?
10. How did you debug your Hadoop code?

10a)There can be several ways of doing this but most common ways are:-
1.By using counters.
2.The web interface provided by Hadoop framework.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s