Apache Hive is designed to give data engineers and data scientist a SQL like access to the big data available in Hadoop cluster, so we can think of it as a normal RDBMS, in normal RDBMS we have database, and tables, in Hive we have the same except in Hive we have two kinds of tables, and in this article we will help you choose the best type for your use case.
There are two types of tables in Hive:
- Managed Table (Internal)
- External Table
Managed (Internal) Table
In Managed table Apache Hive is responsible for the table data and metadata, and any action on tables data will affect physical files of data.
External Table (Un-Managed Tables)
In External table, Apache Hive is responsible ONLY for the table metadata, and any action on the table will affect only table metadata, for example, if you deleted a tables partition actually partition metadata will be deleted from Hive warehouse only, but actual data will still be available on its current location. You can write data to external tables, but actions such as drop, delete, and so on will affect only table metadata, not the actual data.
There are some differences between the two types:
- when you drop a table, if it is managed table hive deletes both data and metadata, if it is external table Hive only deletes metadata.
- Internal Table Supports TRUNCATE command but external not support TRUNCATE
- Internal table supports ACID Transactions but external is not
- In internal table query result cashing is work but the external table doesn’t cash the query results
Note: External table is a way to protect data against accidental drop commands.
The syntax for creating Managed table is as highlighted:
CREATE TABLE <table name> (column1 <datatype>, column2 <datatype>, column(n) <datatype>,...) ROW FORMTAT <row format> FIELDS TERMINATED BY <delimiter such as '\'> STORED as <file format>;
The syntax for creating an External table is as highlighted:
CREATE EXTERNAL TABLE <table name> (column1 <datatype> column2 <datatype>, column(n) <datatype>,...) ROW FORMTAT <row format> FIELDS TERMINATED BY <delimiter such as '\'> STORED as <file format> LOCATION <data locations>;
You can create external table using the same structure of Managed Table using the following syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] [database_name.table name]
[LOCATION <data path>];
Note: The location of data can be specified by the “LOCATION” keyword otherwise it will be considered in the default location (warehouse directory) which default value is set by configuration parameter hive.metastore.warehouse.dir, for example, “hive.metastore.warehouse.dir = /user/hive/warehouse”
Now lets see real example of both types:
We can create managed table using “create table” statement as following:
CREATE TABLE employee_managed (employeeId int, employeeName String, employeeSalary String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
1. The field delimiter in this example is Tab or “/t “
2. The table stored as a text file in HDFS
3. We can use STORED AS for many types of files such as (Sequence files, ORC files, RC files, Parquet files, Avro files)
Also for External Table we can create table using “create external table” statement as following:
CREATE EXTERNAL TABLE employee_external (employeeId int, employeeName String, employeeSalary String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE ;
We can show our stored tables using “show tables” command then we can find our created tables
And you can find the table structure using “describe” command