Real time hadoop interview questions from different readers

Real Time Hadoop Interview Questions From Different Readers

Advertisements

Hive Interview Questions and Answers by sudheerhadoopdev.wordpress.com

Here i’m updating the list of interview questions that are asked during several interviews in HIVE

Q. What is SerDe in Hive?

The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

Available serdes in HIve

Q. can you access HBase Table from Hive.

https://www.confusedcoders.com/bigdata/hive/hbase-hive-integration-querying-hbase-via-hive

Q. can we do insert, update and delete in Hive and from which version on wards it supports(Ans: 0.14 onwards)

INSERT … VALUES, UPDATE, and DELETE SQL statements are supported in Apache Hive 0.14 and later. The INSERT … VALUES statement enable users to write data to Apache Hive from values provided in SQL statements. The UPDATE and DELETE statements enable users to modify and delete values already written to Hive. All three statements support auto-commit, which means that each statement is a separate transaction that is automatically committed after the SQL statement is executed.

The INSERT … VALUES, UPDATE, and DELETE statements require the following property values in the hive-site.xml configuration file:

Table properties:

If a table is to be used in ACID writes (insert, update, delete) then the table property “transactional=true” must be set on that table, starting with Hive 0.14.0. Note, once a table has been defined as an ACID table via TBLPROPERTIES (“transactional”=”true”), it cannot be converted back to a non-ACID table, i.e., changing TBLPROPERTIES (“transactional”=”false”) is not allowed. Also, hive.txn.manager must be set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager either in hive-site.xml or in the beginning of the session before any query is run. Without those, inserts will be done in the old style; updates and deletes will be prohibited

Configuration Property Required Value

hive.enforce.bucketing         true

hive.exec.dynamic.partition.modenonstrict

hive.support.concurrency           true

Q. What is bucketing in Hive.

https://acadgild.com/blog/bucketing-in-hive/

to enable partitioning with bucketing  need to set below properties

hive> set hive.exec.dynamic.partition=true;

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> set hive.exec.max.dynamic.partitions=20000;

hive> set hive.exec.max.dynamic.partitions.pernode=20000;

set hive.enforce.bucketing=true;
Advantages of Bucketing:

Bucketed tables allows much more efficient sampling than the non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purpose when the original data sets are very huge. Here, the user can fix the size of buckets according to the need.

Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. Since the data files are equal sized parts, map-side joins will be faster on the bucketed tables.

Q.What are the different file formats in Hive.

Answer RC File, ORC File, TextFile, Parquet, Avro, Sequence file
Q. can we access data in hdfs using external tables in hive

Yes
Q. hive file structure hierarchy in hdfs

/user/hive/warehouse
Q. if we create a table with buckets(without partition) file storage hierarchy in HDFS

/user/hive/warehouse/part* files

Q. can we set mappers and reducers in hive if yes how

Set mapred.map.tasks=xx (in 1.x)

Set mapreduce.job.maps=xx (in 2.x)

SET mapred.reduce.tasks=XX // In Hadoop 1.X

SET mapreduce.job.reduces=XX // In Hadoop 2.X (YARN)

Or  we can set in hive-site.xml file also

Q. what are configuration properties in hive that related to partitions and buckets

hive.exec.dynamic.partition=true

hive.exec.dynamic.partition.mode=nonstrict

hive.exec.max.dynamic.partitions=20000;
hive.exec.max.dynamic.partitions.pernode=20000.

Hive.enforce.bucketing=true

Q. about transform in hive

http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html

http://maheshwaranm.blogspot.in/2013/07/hive-with-python-example.html

Q. syntax to create a table with partitions and buckets in hive 

Create table tablename(a int, b string) row format delimited fields terminated by “,” partitioned by(c int) clustered by(b) stored as textfile;
Q. why we use partitions and buckets in hive

Partitions we will use for improving query performance  ..

Bucketing is used mainly for sampling, testing and debugging.

Partitioning with bucketing we can retrieve the results some what faster.

Q. can we create buckets without partitions

Yes

Q. difference between static partition and dynamic partition in hive

 Static Partition in Hive

Insert input data files individually into a partition table is Static Partition

Usually when loading files (big files) into Hive Tables static partitions are preferred

Static Partition saves your time in loading data compared to dynamic partition

You “statically” add a partition in table and move the file into the partition of the table.

We can alter the partition in static partition

Dynamic Partition in Hive

single insert to partition table is known as dynamic partition

Usually dynamic partition load the data from non partitioned table

Dynamic Partition takes more time in loading data compared to static partition

When you have large data stored in a table then Dynamic partition is suitable.

If you want to partition number of column but you don’t know how many columns then also dynamic partition is suitable

Q. How will you process json data in hive and pig

http://help.mortardata.com/technologies/pig/json

first ensure that your files are formatted with exactly one JSON object per line.
The simplest way to use the JsonLoader is with no schema definition:

with no schema:

json_objects = LOAD ‘s3n://my-s3-bucket/path/to/json/file’

USING org.apache.pig.piggybank.storage.JsonLoader();

This loads each JSON document into a field called object with a Pig data type of map[]. All of the top-level keys in your JSON object will appear in that map.

Providing a Schema

Once you know which fields you need from your data, you can provide JsonLoader with a schema. You pass a schema to JsonLoader as a string parameter, like this

json_objects = LOAD ‘s3n://my-s3-bucket/path/to/json/file’

USING org.apache.pig.piggybank.storage.JsonLoader(

‘a: int, b: tuple(i: int, j: int), c: bag{t: tuple(i: int, j: int)}’

);

Note that this schema is passed as a parameter, not put into the LOAD statement AS clause

JsonLoader will extract these fields from your data. If a field does not parse correctly, an error will be logged and the field will be read as null. Any fields that can’t be found in the data will also be read as null.

Storing JSON data with JsonStorage:

You can store the result of a Pig script in JSON format using JsonStorage. It will store each tuple from a Pig relation as a JSON object on a single line. The schema of the relation to be stored must be fully specified

— Pig 0.9

STORE result INTO ‘s3n://my-s3-bucket/path/to/json/output’

USING org.apache.pig.piggybank.storage.JsonStorage();

 

— Pig 0.12

STORE result INTO ‘s3n://my-s3-bucket/path/to/json/output’

USING org.apache.pig.builtin.JsonStorage();

Hive with JSON:

CREATE EXTERNAL TABLE json_serde_table (

id string,

person struct<email:string, first_name:string, last_name:string, location:struct<address:string, city:string, state:string, zipcode:string>, text:string, url:string>)

ROW FORMAT SERDE ‘org.apache.hive.hcatalog.data.JsonSerDe’

LOCATION ‘/tmp/json/’;

 

SELECT id, person.first_name, person.last_name, person.email,

person.location.address, person.location.city, person.location.state,

person.location.zipcode, person.text, person.url

FROM json_serde_table LIMIT 5;

Openserde:

https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master

http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/    —- examples

Q. Where hive data resides
if we specify the location during table creation then the data will be placed in that HDFS directory otherwise it will put the data in /user/hive/warehouse directory

Q. Difference between hive sort and order by

Order by uses single reducer and orders all the data by specified column

Sort by uses the multiple reducers and sorts the data specific to reducer output …we cannot assure the global sorting

Q. ORC file what is it how does it work ?

ORC offers a number of features not available in RC files:

* Better encoding of data.  Integer values are run length encoded.  Strings and dates are

stored in a dictionary (and the resulting pointers then run length encoded).

* Internal indexes and statistics on the data.  This allows for more efficient reading of

the data as well as skipping of sections of the data not relevant to a given query.  These

indexes can also be used by the Hive optimizer to help plan query execution.

* Predicate push down for some predicates.  For example, in the query “select * from user

where state = ‘ca’”, ORC could look at a collection of rows and use the indexes to see that

no rows in that group have that value, and thus skip the group altogether.

* Tight integration with Hive’s vectorized execution, which produces much faster processing

of rows

* Support for new ACID features in Hive (transactional insert, update, and delete).

* It has a much faster read time than RCFile and compresses much more efficiently.

 

Whether ORC is the best format for what you’re doing depends on the data you’re storing and how you are querying it.If you are storing data where you know the schema and you are doing analytic type queries it’s the best choice (in fairness, some would dispute this and choose Parquet, though much of what I said above about ORC vs RC applies to Parquet as well).

If you are doing queries that select the whole row each time columnar formats like ORC won’t be your friend.  Also, if you are storing self structured data such as JSON or Avro you may find text or Avro storage to be a better format.

 

Q. Types of table in hive and difference

  • Use EXTERNAL tables when:
  • Thedata is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
  • Data needs to remain in the underlying location even after a DROP TABLE.This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
  • Hive should not own data and control settings, dirs, etc., you may have another program or process that will do those things.
  • You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:

  • Thedata is temporary.
  • You wantHive to completely manage the life-cycle of the table and data.

Q. Tuning in Hive how to do it in details (partitioning bucketing indexing)

Partitioning – creating the directories for every value of the column and placing data under it.

Bucketing – creating files and storing the same hash value data in those files…..number of buckets we need to specify

Hive Partitioning:

Partition divides large amount of data into multiple slices based on value of a table column(s).

Assume that you are storing information of people in entire world spread across 196+ countries spanning around 500 crores of entries. If you want to query people from a particular country (Vatican city), in absence of partitioning, you have to scan all 500 crores of entries even to fetch thousand entries of a country. If you partition the table based on country, you can fine tune querying process by just checking the data for only one country partition. Hive partition creates a separate directory for a column(s) value.

Pros:

Distribute execution load horizontally

Faster execution of queries in case of partition with low volume of data. e.g. Get the population from “Vatican city” returns very fast instead of searching entire population of world.

Cons:

Possibility of too many small partition creations – too many directories.

Effective for low volume data for a given partition. But some queries like group by on high volume of data still take long time to execute. e.g. Grouping of population of China will take long time compared to grouping of population in Vatican city. Partition is not solving responsiveness problem in case of data skewing towards a particular partition value.

Hive Bucketing:

Bucketing decomposes data into more manageable or equal parts.

With partitioning, there is a possibility that you can create multiple small partitions based on column values. If you go for bucketing, you are restricting number of buckets to store the data. This number is defined during table creation scripts.

Pros

Due to equal volumes of data in each partition, joins at Map side will be quicker.

Faster query response like partitioning

Cons

You can define number of buckets during table creation but loading of equal volume of data has to be done manually by programmers

Hive Indexing :

The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like ‘WHERE tab1.col1 = 10’ load the entire table or partition and process all the rows. But if an index exists for col1, then only a portion of the file needs to be loaded and processed.

The improvement in query speed that an index can provide comes at the cost of additional processing to create the index and disk space to store the index.

Q. How to process xml in hive

XML:

[hdfs@sandbox ~]$ hdfs dfs -mkdir -p /test/sequences

[hdfs@sandbox ~]$

[hdfs@sandbox ~]$ vi sample.xml

<sequences period=”5″>

<moment game-clock=”300.00″ time=”1433732215737″ game-event-id=”” shot-clock=”24.00″ locations=”-1,-1,96.95182,1.98648,5.75987;9,173004,45.54661,17.35545,0;9,338365,24.04722,25.67399,0;9,457611,46.95292,27.93478,0;9,468895,48.59834,33.96586,0;9,552336,33.73381,24.05929,0;5,214152,59.26872,24.12006,0;5,253997,45.71551,17.41071,0;5,457186,48.59834,33.96586,0;5,531447,78.09629,34.24688,0;5,552806,47.79678,22.8155,0″/>

<moment game-clock=”300.00″ time=”1433732215794″ game-event-id=”” shot-clock=”24.00″ locations=”-1,-1,97.79683,.89407,3.67626;9,173004,45.62283,17.34854,0;9,338365,24.04248,25.6784,0;9,457611,46.84978,27.8463,0;9,468895,48.52017,33.89189,0;9,552336,33.74064,24.03493,0;5,214152,59.27282,24.07895,0;5,253997,45.87101,17.38532,0;5,457186,48.52017,33.89189,0;5,531447,78.06394,34.2155,0;5,552806,47.8269,22.81393,0″/>

<moment game-clock=”300.00″ time=”1433732215829″ game-event-id=”” shot-clock=”24.00″ locations=”-1,-1,97.78946,.91006,3.68332;9,173004,45.61995,17.35703,0;9,338365,24.03815,25.68402,0;9,457611,46.71077,27.71191,0;9,468895,48.37095,33.77756,0;9,552336,33.74769,24.00829,0;5,214152,59.27627,24.06055,0;5,253997,46.00077,17.36555,0;5,457186,48.37095,33.77756,0;5,531447,78.0439,34.20521,0;5,552806,47.84297,22.83292,0″/>

<moment game-clock=”300.00″ time=”1433732215856″ game-event-id=”” shot-clock=”24.00″ locations=”-1,-1,97.73786,1.02206,3.73271;9,173004,45.57851,17.34979,0;9,338365,24.04207,25.61049,0;9,457611,46.63871,27.56226,0;9,468895,48.2033,33.7142,0;9,552336,33.75497,23.97935,0;5,214152,59.27906,24.06485,0;5,253997,46.10481,17.35141,0;5,457186,48.29748,33.63262,0;5,531447,78.03618,34.216,0;5,552806,47.84498,22.87247,0″/>

<moment game-clock=”300.00″ time=”1433732215905″ game-event-id=”” shot-clock=”24.00″ locations=”-1,-1,97.59781,1.32606,3.8668;9,173004,45.57865,17.34643,0;9,338365,24.04224,25.61058,0;9,457611,46.56615,27.44014,0;9,468895,48.01722,33.7018,0;9,552336,33.76247,23.94813,0;5,214152,59.27976,24.07223,0;5,253997,46.26668,17.38672,0;5,457186,48.29974,33.45708,0;5,531447,78.02931,34.2208,0;5,552806,47.86752,22.85019,0″/>

</sequences>

[hdfs@sandbox ~]$

hdfs dfs -put sample.xml /test/sequences/

[hdfs@sandbox ~]$

[hdfs@sandbox ~]$ wget http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hivexmlserde-1.0.5.3.jar

[hdfs@sandbox ~]$ mv remotecontent?filepath=com%2Fibm%2Fspss%2Fhive%2Fserde2%2Fxml%2Fhivexmlserde%2F1.0.5.3%2Fhivexmlserde-1.0.5.3.jar hivexmlserde-1.0.5.3.jar

[

hdfs@sandbox ~]$ pwd

/home/hdfs

[hdfs@sandbox ~]$ mv hivexmlserde-1.0.5.3.jar /tmp

hive

hive> add jar /tmp/hivexmlserde-1.0.5.3.jar;

Added [/tmp/hivexmlserde-1.0.5.3.jar] to class path

Added resources: [/tmp/hivexmlserde-1.0.5.3.jar]

hive> drop table sequences;

OK

Time taken: 3.717 seconds

CREATE EXTERNAL TABLE sequences(

gameclock double,

time bigint,

gameeventid string,

shotclock double,

locations string

)

ROW FORMAT SERDE ‘com.ibm.spss.hive.serde2.xml.XmlSerDe’

WITH SERDEPROPERTIES (

“column.xpath.gameclock”=”/moment/@game-clock”,

“column.xpath.time”=”/moment/@time”,

“column.xpath.gameeventid”=”/moment/@game-event-id”,

“column.xpath.shotclock”=”/moment/@shot-clock”,

“column.xpath.locations”=”/moment/@locations”

)

STORED AS

INPUTFORMAT ‘com.ibm.spss.hive.serde2.xml.XmlInputFormat’

OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’

LOCATION ‘/test/sequences’

TBLPROPERTIES (

“xmlinput.start”=”<moment “,

“xmlinput.end”=”/”

)

;

OK

Time taken: 2.871 seconds

hive>

hive> select * from sequences;

OK

300.01433732215737 24.0 -1,-1,96.95182,1.98648,5.75987;9,173004,45.54661,17.35545,0;9,338365,24.04722,25.67399,0;9,457611,46.95292,27.93478,0;9,468895,48.59834,33.96586,0;9,552336,33.73381,24.05929,0;5,214152,59.26872,24.12006,0;5,253997,45.71551,17.41071,0;5,457186,48.59834,33.96586,0;5,531447,78.09629,34.24688,0;5,552806,47.79678,22.8155,0

300.01433732215794 24.0 -1,-1,97.79683,.89407,3.67626;9,173004,45.62283,17.34854,0;9,338365,24.04248,25.6784,0;9,457611,46.84978,27.8463,0;9,468895,48.52017,33.89189,0;9,552336,33.74064,24.03493,0;5,214152,59.27282,24.07895,0;5,253997,45.87101,17.38532,0;5,457186,48.52017,33.89189,0;5,531447,78.06394,34.2155,0;5,552806,47.8269,22.81393,0

300.01433732215829 24.0 -1,-1,97.78946,.91006,3.68332;9,173004,45.61995,17.35703,0;9,338365,24.03815,25.68402,0;9,457611,46.71077,27.71191,0;9,468895,48.37095,33.77756,0;9,552336,33.74769,24.00829,0;5,214152,59.27627,24.06055,0;5,253997,46.00077,17.36555,0;5,457186,48.37095,33.77756,0;5,531447,78.0439,34.20521,0;5,552806,47.84297,22.83292,0

300.01433732215856 24.0 -1,-1,97.73786,1.02206,3.73271;9,173004,45.57851,17.34979,0;9,338365,24.04207,25.61049,0;9,457611,46.63871,27.56226,0;9,468895,48.2033,33.7142,0;9,552336,33.75497,23.97935,0;5,214152,59.27906,24.06485,0;5,253997,46.10481,17.35141,0;5,457186,48.29748,33.63262,0;5,531447,78.03618,34.216,0;5,552806,47.84498,22.87247,0

300.01433732215905 24.0 -1,-1,97.59781,1.32606,3.8668;9,173004,45.57865,17.34643,0;9,338365,24.04224,25.61058,0;9,457611,46.56615,27.44014,0;9,468895,48.01722,33.7018,0;9,552336,33.76247,23.94813,0;5,214152,59.27976,24.07223,0;5,253997,46.26668,17.38672,0;5,457186,48.29974,33.45708,0;5,531447,78.02931,34.2208,0;5,552806,47.86752,22.85019,0

Time taken: 2.8 seconds,

Fetched: 5 row(s)

hive>

> select gameclock, time,gameeventid, shotclock, concat(substr(locations,1,10), ‘ …’) as sequence_truncated_for_email_to_anoop from sequences;

OK

300.01433732215737 24.0 -1,-1,96.9 …

300.01433732215794 24.0 -1,-1,97.7 …

300.01433732215829 24.0 -1,-1,97.7 …

300.01433732215856 24.0 -1,-1,97.7 …

300.01433732215905 24.0 -1,-1,97.5 …

Time taken: 0.898 seconds, Fetched: 5 row(s)

hive>

Hadoop Written test FAQ by Sudheer

The below list is the commonly asked questions in written test for Hadoop developer position. These questions are from merilytics,Hyd.

  • Responsibilities of jobtracker?
  • Namenode, jobtracker, tasktraker ports?
  • Difference between nfs, namenode, journalnode?
  • Difference between hive and hbase?
  • Write custom partitioner class?
  • Write the differences between sort and shuffle phase expain with examples
  • What is zookeeper?
  • If the output directory is already exits what will happen in hdfs?
  • 64k, 65mb, 175mb size files are there how many splits will happen in hdfs?
  • Write combiner for the avg value of

Mapper1 : (1950,4) (1950,7)

Mapper2: (1950 , 8)

  • What is the purpose of tasktracker?
  • Incremental import in sqoop?
  • To list all files with blocks in hdfs which command we will use?
  • Row deletion process in hbase?
  • Inputformats commonly used?
  • How to debug your code?
  • How to add native libs in yarnjobs?
  • What is serialization and deserialization..how to achieve these in hadoop?