Apache Hive is designed to give data engineers and data scientists a SQL like access to the big data available in the Hadoop cluster, so we can think of it as a normal RDBMS, in normal RDBMS we have a database, and tables, in Hive we have the same except in Hive we have two kinds of tables, and in this article, we will help you choose the best type for your use case.
There are two types of tables in Hive:
- Managed Table (Internal)
- External Table
Managed (Internal) Table
In the Managed table, Apache Hive is responsible for the table data and metadata, and any action on tables data will affect physical files of data.
External Table (Un-Managed Tables)
In the External table, Apache Hive is responsible ONLY for the table metadata, and any action on the table will affect only table metadata, for example, if you deleted a tables partition actually partition metadata will be deleted from Hive warehouse only, but actual data will still be available on its current location. You can write data to external tables, but actions such as drop, delete, and so on will affect only table metadata, not the actual data.
There are some differences between the two types:
- when you drop a table, if it is managed table hive deletes both data and metadata, if it is external table Hive only deletes metadata.
- Internal Table Supports TRUNCATE command but external not support TRUNCATE
- Internal table supports ACID Transactions but external is not
- In internal table query result cashing is work but the external table doesn’t cash the query results
Note: External table is a way to protect data against accidental drop commands.
The syntax for creating a Managed table is as highlighted:
CREATE TABLE <table name> (column1 <datatype>, column2 <datatype>, column(n) <datatype>,...) ROW FORMTAT <row format> FIELDS TERMINATED BY <delimiter such as ''> STORED as <file format>;
The syntax for creating an External table is as highlighted:
CREATE EXTERNAL TABLE <table name> (column1 <datatype> column2 <datatype>, column(n) <datatype>,...) ROW FORMTAT <row format> FIELDS TERMINATED BY <delimiter such as ''> STORED as <file format> LOCATION <data locations>;
You can create an external table using the same structure of Managed Table using the following syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] [database_name.table name]
[LOCATION <data path>];
Note: The location of data can be specified by the “LOCATION” keyword otherwise it will be considered in the default location (warehouse directory) which default value is set by configuration parameter hive.metastore.warehouse.dir, for example, “hive.metastore.warehouse.dir = /user/hive/warehouse”
Now let’s see real examples of both types:
We can create managed table using the “create table” statement as follows:
CREATE TABLE employee_managed (employeeId int, employeeName String, employeeSalary String) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE;
1. The field delimiter in this example is Tab or “/t “
2. The table stored as a text file in HDFS
3. We can use STORED AS for many types of files such as (Sequence files, ORC files, RC files, Parquet files, Avro files)
Also for External Table, we can create a table using the “create external table” statement as follows:
CREATE EXTERNAL TABLE employee_external (employeeId int, employeeName String, employeeSalary String) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE ;
We can show our stored tables using the “show tables” command then we can find our created tables
And you can find the table structure using the “describe” command