Extra Cookies

Yet Another Programmer's Blog

Hive - External Table With Partitions

A normal Hive table

A normal hive table can be created by executing this script,

1
2
3
4
5
6
7
CREATE TABLE user (
  userId BIGINT,
  type INT,
  level TINYINT,
  date String
)
COMMENT 'User Infomation'

It’s empty, util we load some data into it, it’s useless.

1
LOAD INPATH '/user/chris/data/testdata' OVERWRITE INTO TABLE user

By default, when data file is loaded, /user/${USER}/warehouse/user is created automatically.

For me, it’s /user/chris/warehouse/user, user is the table name, data files of user table are all located in this folder.

Now, we can freely to use SQLs to analyze the data.

What if

What if we want to process the data by some ETL programs, and load the result data to hive, but we don’t want to load them manually?

What if the data is not only used by hive, but also some other applications, maybe it still need to be MapReduced?

External table comes to save us by creating table using following syntax,

1
2
3
4
5
6
7
8
CREATE EXTERNAL TABLE user (
  userId BIGINT,
  type INT,
  level TINYINT,
  date String
)
COMMENT 'User Infomation'
LOCATION '/user/chris/datastore/user/';

The location is where we put the data files in, name of the folder must be the same as the table name (like normal table). In this example, table name is user.

Then, we can put any file which satisfy the pattern declared by user table inside user folder.

All data can be accessed by hive SQLs right away.

Something not cool

When the data files become huge (number and size), we might need to use Partition to improved the efficiency of data processing.

1
2
3
4
5
6
7
CREATE TABLE user (
  userId BIGINT,
  type INT,
  level TINYINT,
)
COMMENT 'User Infomation'
PARTITIONED BY (date String)

data String is moved to PARTITIONED BY, when we need to load data into hive, partition must be assigned.

1
LOAD INPATH '/user/chris/data/testdata' OVERWRITE INTO TABLE user PARTITION (date='2012-02-22')

After data is loaded, we can see a new folder named date=2010-02-22 is create inside /user/chris/warehouse/user/

So, how can we do it using external table?

Same as before, first declare the external table user, and assign the location.

1
2
3
4
5
6
7
8
9
CREATE EXTERNAL TABLE user (
  userId BIGINT,
  type INT,
  level TINYINT,
  date String
)
COMMENT 'User Infomation'
PARTITIONED BY (date String)
LOCATION '/user/chris/datastore/user/';

Then, create the folder date=2010-02-22 inside /user/chris/datastore/user/

At last, put the data files of date 2010-02-22 into the folder, done.

But,

When we executes select * from user; nothing appears.

Why?

I spent a long time searching for an answer.

Finally, solved.

Because when external table is declared, default table path is changed to specified location in hive metadata which contains in metastore, but about partition, nothing is changed, so, we must manually add those metadata.

1
ALTER TABLE user ADD PARTITION(date='2010-02-22');

Every time a new data=... folder (partition) is created, we must manually alter the table to add partition information.

That is really not cool!

But luckily, we have Hive JDBC/Thrift, let script do that.

Comments