Large datasets for hadoop


my reference links for study

grant all on *.*m
Python awesome reference

Data sets
Hadoop QUestions & answers
———————————————- ————Hadoop Questions ————————————–great reference for spark & scala training basics              – great tutorial for spark streamining                – scala expirements   JAVA   & ANDROID

Hive Question & Answers

1.What is Hive?
2.What is Hive Metastore?
3.What kind of datawarehouse application is suitable for Hive?
4.How can the columns of a table in hive be written to a file?
5.CONCAT function in Hive with Example?
6.REPEAT function in Hive with example?
7.TRIM function in Hive with example?
8.REVERSE function in Hive with example?
9.LOWER or LCASE function in Hive with example?
10.UPPER or UCASE function in Hive with example?
11.Double type in Hive – Important points?
12.Rename a table in Hive – How to do it?
13.How to change a column data type in Hive?
14.Difference between order by and sort by in hive?
15.RLIKE in Hive?
16.Difference between external table and internal table in HIVE ?

Answers –
1. It is a data Warehousing package built on top of Hadoop which is used for data analysis, targeted towards users comfortable with SQL. HIve was developed by Facebook. It is used to process structured data.

2.The metastore is service and database that can be configured in different ways. The default Hive configuration is that Hive driver, metastore interface and the db (derby) all use the same JVM. Metastore has all the table schema and partition details.

3. Data Warehouse application which supports web and JDBC clients is suitable for hive, and also the one which is similar to SQL languages.

5. CONCAT( string str1, string str2… )

The CONCAT function concatenates all the stings.
Example: CONCAT(‘hadoop’,’-‘,’hive’) returns ‘hadoop-hive’

6. REPEAT( string str, int n )

The REPEAT function repeats the specified string n times.
Example: REPEAT(‘hive’,2) returns ‘hivehive’

7. TRIM( string str )

The TRIM function removes both the trailing and leading spaces from the string.
Example: LTRIM(‘ hive ‘) returns ‘hive’

8. REVERSE( string str )

The REVERSE function gives the reversed string
Example: REVERSE(‘hive’) returns ‘evih’

9. LOWER( string str ), LCASE( string str )

The LOWER or LCASE function converts the string into lower case letters.
Example: LOWER(‘HiVe’) returns ‘hive’

10. UPPER( string str ), UCASE( string str )

The UPPER or LCASE function converts the string into upper case letters.
Example: UPPER(‘HiVe’) returns ‘HIVE’

11. Double data type in hive – An 8-byte (double precision) floating-point data type used in CREATE TABLE and ALTER TABLE statements.
Range: 4.94065645841246544e-324d .. 1.79769313486231570e+308, positive or negative, The data type REAL is an alias for DOUBLE

12. Renaming a table in Hive –
Syntax :
For example if we have to rename a table from employee to emp,
hive> ALTER TABLE employee RENAME TO emp;

13. To change a cloumn data type in hive –
Syntax –
ALTER TABLE name CHANGE column_name new_name new_type

14. Order by

The ORDER BY clause is used to retrieve the details based on one column and sort the result set by ascending or descending order.
SELECT [ALL | DISTINCT] select_expr, select_expr, …
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number]

Sort by

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the
column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type,
then the sort order will be lexicographical order

SELECT key, value FROM src SORT BY key ASC, value DESC

15. RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true.
It also obeys Java regular expression pattern. Users don’t need to put % symbol for a simple match in RLIKE

Reference –

16. The main difference is that when you drop an external table, the underlying data files stay intact. This is because
the user is expected to manage the data files and directories. With a managed table, the underlying directories and
data get wiped out when the table is dropped.

Reference –


Pig Questions & Answeswers

1.Why Pig ?

1.Ease of programming
2.Optimization opportunities.
2.Advantages of Using Pig ?

i) Pig can be treated as a higher level language
a) Increases Programming Productivity
b) Decreases duplication of Effort
c) Opens the M/R Programming system to more uses

ii) Pig Insulates against hadoop complexity
a) Hadoop version Upgrades
b) Job configuration Tunning
3.Pig Features ?
1.Data Flow Language User Specifies a Sequence of Steps where each step specifies only a single high-level data transformation.
2. User Defined Functions
3.Debugging Environment
4.Nested data Model
4.Difference Between Pig and SQL ?
1.Data Flow Language 1.Structured Query Language for OLTP
2.It can support Pegabytes,Tera bytes 2.It can’t support Pegabytes,Terabytes
3.It is a Scripting Language 3.It is procedure,triggers,funcions
5.What are the scalar datatypes in pig?
6.What are the complex data types in pig?

Pig has three complex types: maps, tuples and bags.
Map: A map is a char-array to data element mapping which is expressed in key-value pairs.
Tuple: Tuples are fixed length, ordered collection of Pig data elements.
Bag: Bags are unordered collection of tuples.

8.What is the purpose of ‘dump’ keyword in pig?

Dump display the output on the screen
Dump ‘processed’.
9.what are relational operations in pig latin?

Relational operators are the main tools Pig Latin provides to operate on your data. They allow you to transform it by sorting, grouping, joining, projecting, and filtering.
10.How to use ‘foreach’ operation in pig scripts?

Foreach takes a set of expressions and applies them to every record in the data pipeline
11.How to write ‘foreach’ statement for map datatype in pig scripts?

for map we can use hash(‘#’)
12.How to write ‘foreach’ statement for tuple datatype in pig scrip

for tuple we can use dot(‘.’)
13.How to write ‘foreach’ statement for bag datatype in pig scripts?

when you project fields in a bag, you are creating a new bag with only those fields:
14.why should we use ‘filters’ in pig scripts?
Filters are similar to where clause in SQL.filter which contain predicate.If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.predicate contain different operators like ==,>=,<=,!,== and != can be applied to maps and tuples.