Understanding HIVE for data science people

I have been working as Statistical analyst for the last 1.5 years and fortunately I got to work on Hadoop on one of my initial projects. Hadoop sounds scary to a lot of people and I am no exception. In this post, I would make an attempt to explain HIVE-the data warehouse for Hadoop ecosystem.

What is HIVE? And why as an analyst I should care about it?

HIVE is the part of the Hadoop infrastructure where data gets stored and you can write your query to fetch data from HIVE (tables). Now when I had started working on Hadoop, I knew a little SQL and a tad little of data science. I had take a few courses on Statistics during my college but then who really studies in college! Having allocated to a project which required an understanding of Hadoop, I planned how I need to understand this system so that I can contribute my best to the project.

The first thing you need to build any model is data. Comes to picture-HIVE. As I said, I knew SQL and as I started working on HIVE I got to know that if one knows SQL then one knows HIVE, it’s just that he is not aware of it yet.

Why do I say so? Because, HIVE was built with the sole purpose that database people who are comfortable working on SQL need not require to learn a completely new language to fetch and interact with data in the Hadoop ecosystem. And I believe the founders of Hadoop have achieved this successfully.

HIVE hides the map-reduce processes from the user and all you need to worry about is writing your SQL query. You don’t need to worry at all about mappers and reducers. All these information is hidden from the user and if one wants to know more on these, there are ways to know that as well (advanced users of HIVE).

I will cover most common queries that we deal with when working as an analyst on Hadoop- creating a table (external/internal), getting information about table, setting common HIVE settings and a few others.

Creating a table Table in HIVE can be of 2 types: Internal or External Stay with me for next few lines and I will explain the differences. By default, the table that we create in most of the cases is internal table. However, external tables come into picture when you want to build a table over some file.

The difference between external and internal table is in terms of what happens when we drop a table. External table: — If you drop the table, the table and the metadata of the data is dropped but not the data The data is located in hdfs (and not in local file system) and since this data is accessed by many tables we don’t want the data to get dropped.

Just add the keyword external to specify that we want to build an external table. — CREATE EXTERNAL TABLE external_table_name

Internal table: — If you drop the table, the table, metadata and even the data is dropped — If the data for the table resides in the local file system, you should go for creating internal table

— CREATE INTERNAL TABLE internal_table or just CREATE TABLE internal_table Once the table gets created, we want to get the data from the table for our analysis. Before we dive into getting the data from the table, it is a good idea to get an idea about the overall structure of the table — column names, column types, whether external or internal, owner of the table.

All of these information can be retrieved by using this: describe formatted table_name; This gives you a lot of information in a formatted manner. If you are in a hurry and just want to see the column names and no other details, use desc table_name;

External tables are created over some file. This file should be located in hdfs-file system for hadoop. A few command lines would easily move the file to hdfs.

How to move a file to hdfs?

Below are the steps: - Create a directory where you want to move the file hadoop fs -mkdir directory_name - Check if the directory got created hadoop fs -ls ; you should see your directory name here This directory is empty right now. You can check this using hadoop fs -ls directory_name; It's empty - Move the file from local to hdfs hadoop fs -copyFromLocal 'path where file is stored' directory_name - Check if the file has been copied to directory hadoop fs -ls directory_name; You should see the file now.

Open vim editor and write the below codes in any file. I will name my file create_table.hql vim create_table.hql Creating table. Code below

drop external table if exists db_name.table_name; create external table if not exists db_name.table_name ( ID string, WorkStatus string, Score int, Residence_Region string, income string, Gender smallint, Alcohol_Consumption string, Happy string ) row format delimited fields terminated by ‘\t’ stored as textfile location ‘/user/mbarnwa/data1’ ;

Dropping a database Many a times, you may want to drop a database that you don’t need anymore. Say, the database to be dropped is ‘userdb’, then if you do this: drop database userdb; Now the above command will work if your database is empty i.e. the database, ‘userdb’ has no tables in it. But that is rarely a case, so either you can go ahead on deleting each table in the database by using this command drop table ‘tableName’; or addition of a simple keyword — CASCADE will solve your purpose. So the final command to drop a database (even if it has tables) is drop database ‘userdb’ cascade;

Advertiser Disclosure: This post contains affiliate links, which means I receive a commission if you make a purchase using this link. Your purchase helps support my work.

Manish Barnwal

...just another human

Understanding HIVE for data science people

How to move a file to hdfs?

Comments