Hadoop Questions

What is the Ensemble Methods in Hadoop
Ensemble methods refers to process of generating multiple models and combining them to solve a specific problem. The process that we follow in an ensemble method  is quite similar to what we follow in our day-to-day life. We take opinions from different experts before  arriving at a final decision.

What is the sharding in Hadoop?
Database sharding can be defined as a partitioning scheme for large databases distributed across various servers, and is responsible for new levels of database performance and scalability.It divides a database into smaller part called “shards” and replicates those across a number of distributed servers.

What is the polyglot persistence  in Hadoop?The is the big data to applied to a set of apprehensions that use several core database technologies. A polyglot is often used to solve a complex problem by breaking it into fragments and applying different database modelling  techniques.

What is the RecordReader in Hadoop?The input split defines the unit of work in a MapReduce program.But the input split does not describe the way to access the unit of work.The Recordreader class loads all required data from its source it source and converts it into a key/value pair, it is pairs that can be read by the mapper.

What is the Visualization Layer in Hadoop?
The visualization Layer handles the task of interpreting and visualizing Big Data.It can be described as viewing a piece of information from different perspectives,interpreting it in different manners.

Can you give us some more details about SSH communication between Masters and
the Slaves?
SSH is a password-less secure communication where data packets are sent across the slave.
It has some format into which data is sent across. SSH is not only between masters and
slaves but also between two hosts.

What is the CAP Theorem in Hadoop?
In case of distributed  databases, the three important aspects of the CAP theorem are consistently (A) , Availability (A),and partition  tolerance (P) . The first one refers to the number of nodes that should respond to  a read request before it is considered as a successful operation. The second is the number of nodes that should respond to a write request before its considered a successful operation. The third is the number of nodes where the data is replicated or copied.

Differentiate between FileSink and FileRollSink ?
The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.

What is BloomMapFile used for ?
The BloomMapFile is a class that extends MapFile. BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase format.

How often do you need to reformat the namenode ?
Never. The namenode needs to formatted only once in the beginning . Reformating of the namenode will lead to lost of the data on entire.
The namenode is the only system that needs to be formatted only once. It will create the directory structure for file  system metadata and create namespaceID for the entire file system.

What is the Hive DDL in Hadoop?
Data Definition Language (DDL) is used to describe data and data structures of a database. Hive has it’s  own DDL, such as SQL DDL, which is used for managing, creating, altering and dropping databases, tables and other objects in a database.

What is the role of “Zookeeper” in a Hadoop cluster?
Ans: the purpose of Zookeeper is cluster management. Zookeeper will help you achieve coordination between Hadoop nodes. Zookeeper also help to:
a. Manage configuration across nodes
b. Implement reliable messaging
c. Implement redundant services
d. Synchronize process execution

What is the Oozie Activity  in Hadoop?
An Oozie activity is any possible entry that can be tracked in Oozie functional subsystems and Hadoop jobs. The Oozie SLA defines, and tracks the desired  SLA information for any Oozie activity.

What is the Oozie SLA in Hadoop?
Oozie SLA specifies the quality of an Oozie application in measurable terms. SLA can be determined after taking the business requirements and nature of software into consideration.

What is the checkpoint in Hadoop?
Checkpoint  functionality, the Backup node maintains the current state of all the HDFS block metadata in memory, just like the name node. If you r using the Backup node you can’t run the checkpoint node there is no need to do so, because the checkpointing process is already being taken care of.
U can say … THE checkpoint node is the replacement for the secondary namenode

Execution of Asynchronous Action in Oozie?Any asynchronous action in the Hadoop cluster can be executed in the form of Hadoop MapReduce jobs. This makes Oozie scalable. When you use Hadoop to perform processing/computation tasks triggered by a workflow action, workflow jobs must wait until the completion of these tasks before moving to next node in the workflow.

What is the Oozie Recovery capabilities in Hadoop?
Oozie can recover workflow jobs in two ways. First, when action starts successfully, Oozie applies the MapReduce retry mechanisms for recovery. On the other hand, if an action fails to start, Oozie uses some other recovery techniques according to the nature of the failure.

What is the Oozie Bundle in Hadoop?
The Oozie bundle is a top-level abstraction. In other words, it is a bundle or set of co-ordinator applications.The Oozie bundle enables a user to start, stop, suspend, resume , or rerun a job at the bundle level. It provides better operational control over the set of co-coordinator applications.

What is the Split and Flatten in Pig
Split means operator partitions a given relation into two or more relations
Flatten means operator is used for un-nesting as well as collecting tuples .
The FLATTEN operator seems syntactically similar to a user-defined function statement.

What is the Oozie Coodinator in Hadoop?
The Oozie co-odinator is used to specify the conditions for a workflow in the form of predicates. It triggers the execution of the workflow at the specified time, on regular intervals, or on the basis of the available data.

What is difference between static and dynamic partitions in hive?
Static partition: the name of the partition is hard coded in the insert statement.
Dynamic: hive will automatically determine the partition based on the value of the partition filed.

How do we include native libraries in YARN jobs?
Ans: By using -Djava.library. path option on the command  or else by setting LD_LIBRARY_PATH in .bashrc file

How does Hadoop’s CLASSPATH plays virtual role in starting or stopping in Hadoop deamons?
Ans: class path will contain list of directories containing jar files required to stop/start deamons.
Ex: HADOOP_HOME/share/Hadoop/common/lib contains all the common utility jar files

How can you set an arbitrary number of Reducer’s to be created for a job in Hadoop?

Ans: You can either do it programmatically by using method setNumReduce tasks in the jobconf Class or set it up as a configuration setting.

How do you overwrite replication factor?

Ans: Three are few ways to do this. Look the below illustration .

Hadoop fs -setrep -w5 -R hadoop- test

Hadoop fs-Ddfs.replication=5 -cp hadoop- test/test.csv hadoop- test/test_with_rep5.csv

Map reduce jobs are failing on a cluster that was just restarted. They worked before restarted. What could be wrong?

Ans: the cluster is in a safe mode. The administrator needs to wait for name node to exit the safe mode before restarting the jobs again.
when there is no secondary name node on   the cluster and the cluster has not been restarted in a long time. The name node will go into safe mode and combine the edit log and current file systems.

Web-UI shows that half of the data nodes are in  decommissioning mode. What does that mean? Is it safe to remove those nodes from the network?

Ans: This means that name node is trying retrieve data data from those data nodes by moving replicas to remaining data nodes. There is a possibility that data can be lost if administrator removes those data nodes before decommissioning finished.
Due to replication strategy it is possible to lose some data due to data nodes removal en masse prior to completing the decommissioning process. Decommissioning refers to name node trying to retrieve data from data nodes by moving replicas to remaining data nodes.

Map Reduce jobs take too long. What can be done to improve the performance of the cluster?

Ans: One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the tasks. The number tasks has to match the number of available slots on the cluster.
Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators to make sure that the resource supply and demand match.

How is the distance between two nodes defined in Hadoop?
Ans: Measuring bandwidth is difficult in Hadoop so network is denoted as a tree in Hadoop. The distance between two nodes in the tree plays a vital role in forming a Hadoop cluster and is defined by the network topology and Java interface DNStoSwitchMapping. The distance is equal to the sum of the distance to the closest common ancestor of both the nodes. The method getDistance(Node node1, Node node2) is used to calculate the distance between two nodes with the assumption that the distance from a node to its parent node is always 1.

What is the Backward Compatibility with YARN ?
Yarn is backward compatible means that the code developed using MapReduce can run on YARN Hadoop 2 without any or some minor changes. This is a very important feature as applications that are developed using MapReduce usally cater to a large user base and run on widespread distributed systems.

After increasing the replication level, I still see that data is under replicated. What could be wrong?
Ans: Data replication  takes time due to large quantities of data. The Hadoop administrator should allow sufficient time for data replication.
Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy data around and if data size is big enough it is not uncommon that replication will take from a few minutes to few hours.

How would an Hadoop administrator deploy various components of Hadoop in production?

Ans: Deploy namenode and job tracker on the master node, and deploy data nodes and tasktrackers on multiple slave nodes.
There is a need for only one namenode and job tracker on the system. The number of data nodes depends on the available hardware.

How would an Hadoop administrator deploy various components of Hadoop in production?

Ans: Deploy namenode and job tracker on the master node, and deploy data nodes and tasktrackers on multiple slave nodes.
There is a need for only one namenode and job tracker on the system. The number of data nodes depends on the available hardware.

What are the hardware requirements for a Hadoop cluster ( primary and secondary namenodes and data nodes)?

Ans: There are no requirements for data nodes. However, the name nodes require a specified amount of RAM to store filesystem image in memory based on the design of the primary name node and secondary name node, entire filesystem information will be stored in memory. Therefore, both name nodes to have enough memory to contain the entire filesystem image.

What is the Operationlized Analytics and Monetized Analytics?
Operetionalized Analytics means making Analytics an important part of the business process. For instance, an insurance company can use a model to predict the probability of a claim being fraudulent.

Monetized Analytics helps businesses to take important and better decisions and helps earn revenues.
What is the Split and Flatten in Pig
Split means operator partitions a given relation into two or more relations
Flatten means operator is used for un-nesting as well as collecting tuples .
The FLATTEN operator seems syntactically similar to a user-defined function statement.

What is the Oozie Coodinator in Hadoop?
The Oozie co-odinator is used to specify the conditions for a workflow in the form of predicates. It triggers the execution of the workflow at the specified time, on regular intervals, or on the basis of the available data.

What is the Oozie Bundle in Hadoop?
The Oozie bundle is a top-level abstraction. In other words, it is a bundle or set of co-ordinator applications.The Oozie bundle enables a user to start, stop, suspend, resume , or rerun a job at the bundle level. It provides better operational control over the set of co-ordinator applications.

What is the Oozie Recovery capabilities in Hadoop?
Oozie can recover workflow jobs in two ways. First, when action starts successfully, Oozie applies the MapReduce retry mechanisms for recovery. On the other hand, if an action fails to start, Oozie uses some other recovery techniques according to the nature of the failure.

Execution of Asynchronous Action in Oozie?Any asynchronous action in the Hadoop cluster can be executed in the form of Hadoop MapReduce jobs. This makes Oozie scalable. When you use Hadoop to perform processing/computation tasks triggered by a workflow action, workflow jobs must wait until the completion of these tasks before moving to next node in the workflow.

What is the Oozie SLA in Hadoop?
Oozie SLA specifies the quality of an Oozie application in measurable terms. SLA can be determined after taking the business requirements and nature of software into consideration.
Prwatech.in
What is the checkpoint in Hadoop?
Checkpoint  functionality, the Backup node maintains the current state of all the HDFS block metadata in memory, just like the name node. If you r using the Backup node you can’t run the checkpoint node there is no need to do so, because the checkpointing process is already being taken care of.
U can say … THE checkpoint node is the replacement for the secondary namenode

What is the role of “Zookeeper” in a Hadoop cluster?
Ans: the purpose of Zookeeper is cluster management. Zookeeper will help you achieve coordination between Hadoop nodes. Zookeeper also help to:
a. Manage configuration across nodes
b. Implement reliable messaging
c. Implement redundant services
d. Synchronize process execution

What is the Oozie Activity  in Hadoop?
An Oozie activity is any possible entry that can be tracked in Oozie functional subsystems and Hadoop jobs. The Oozie SLA defines, and tracks the desired  SLA information for any Oozie activity.

What is the Monetized Analytics in Hadoop?
Monetized analytics helps businesses to take important and better decisions and helps earn revenues. However, Big Data analytics is also used to derive revenue beyond the insights the insights it provides you might be able to get a unique data set that is valuable for other companies.

What is the Sqoop merge tool?
The sqoop merge tool works hand in hand with the incremental import last modified mode.each import creates new file , so if you want to keep the table data together in one file, you use the merge tool.

What is the Oozie Activity in Hadoop?An Oozie activity is any possible entity that can tracked in Oozie functional subsystems and Hadoop jobs .The Oozie the oozie SLA defines ,stores ,  information for any oozie activity.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s