Handling Dates in Hadoop Pig

Importing dates with pig into hive from a csv can be tricky. Depending on the format, the default datetime pig parser likely won't work.

No worries though, just import the date field as a standard chararray and then parse the string using the ToDate format.

For example, say we have a csv in the format:

"Name","City","State","Close Date"
"My Cool Co","San Francisco","CA","31-Jan-10"
"My Fast Co","Palo Alto","CA","15-Feb-12"

Create the table in hive:

CREATE TABLE financings
  name STRING,
  city STRING,
  state STRING,
  close_date TIMESTAMP

Here's the pig script, you'll want to run it with pig -useHCatalog:

-- load the piggybank library
REGISTER 'hdfs:///usr/lib/pig/piggybank.jar'
-- and the from the piggybank, the standard csvloader
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();

-- import the csvs, KEY is your iam key and SECRET the associated secret, bucket-name is your bucket
-- we could import csvs from an hdfs path as well, depending on where your data is stored
-- this assumes my csvs are all in s3 in bucket-name/my/cool/path and are named in the format "ABunchOfCsvs-2015-01-30.csv"
raw_financings_csvs = LOAD 's3n://KEY:[email protected]/my/cool/path/ABunchOfCsvs*.csv' using CSVLoader() AS (name:chararray, city:chararray, state:chararray, close_date:chararray);

-- since the CSVLoader doesn't skip the csv headers, filter them out
filtered_raw_financings_csvs = FILTER raw_financings_csvs BY (close_date != 'Close Date');

-- now parse the date string, transforming it into a pig date type
-- we want to return null if close_date is empty, hence the ternary operation
financings_csvs = FOREACH filtered_raw_financings_csvs GENERATE name, city, state, (close_date==''?NULL:ToDate(close_date, 'dd-MMM-yy')) AS close_date;

--  write the results to hive
STORE financings_csvs INTO 'financings' USING org.apache.hive.hcatalog.pig.HCatStorer();