Hive - External Table With Partitions
A normal Hive table
A normal hive table can be created by executing this script,
1 | CREATE TABLE user ( |
It's empty, util we load some data into it, it's useless.
1 | LOAD INPATH '/user/chris/data/testdata' OVERWRITE INTO TABLE user |
By default, when data file is loaded,
/user/${USER}/warehouse/user
is created automatically.
For me, it's /user/chris/warehouse/user
,
user
is the table name, data files of user
table are all located in this folder.
Now, we can freely to use SQLs to analyze the data.
What if
What if we want to process the data by some ETL programs, and load the result data to hive, but we don't want to load them manually?
What if the data is not only used by hive, but also some other applications, maybe it still need to be MapReduced?
External table comes to save us by creating table using following syntax,
1 | CREATE EXTERNAL TABLE user ( |
The location is where we put the data files in, name of the folder
must be the same as the table name (like normal table). In this example,
table name is user
.
Then, we can put any file which satisfy the pattern declared by
user
table inside user
folder.
All data can be accessed by hive SQLs right away.
Something not cool
When the data files become huge (number and size), we might need to
use Partition
to improved the efficiency of data
processing.
1 | CREATE TABLE user ( |
data String
is moved to PARTITIONED BY
,
when we need to load data into hive, partition must be assigned.
1 | LOAD INPATH '/user/chris/data/testdata' OVERWRITE INTO TABLE user PARTITION (date='2012-02-22') |
After data is loaded, we can see a new folder named
date=2010-02-22
is create inside
/user/chris/warehouse/user/
So, how can we do it using external table?
Same as before, first declare the external table user
,
and assign the location.
1 | CREATE EXTERNAL TABLE user ( |
Then, create the folder date=2010-02-22
inside
/user/chris/datastore/user/
At last, put the data files of date 2010-02-22 into the folder, done.
But,
When we executes select * from user;
nothing
appears.
Why?
I spent a long time searching for an answer.
Finally, solved.
Because when external table is declared, default table path is changed to specified location in hive metadata which contains in metastore, but about partition, nothing is changed, so, we must manually add those metadata.
1 | ALTER TABLE user ADD PARTITION(date='2010-02-22'); |
Every time a new data=...
folder (partition) is created,
we must manually alter the table to add partition information.
That is really not cool!
But luckily, we have Hive JDBC/Thrift, let script do that.