Performance test of Pig vs Hive with code examples

Performance testing high level Hadoop query languages with example scripts.

Analysis of NOAA weather data: Western-European weather stations from 1980 to 2014, daily dataset of temperature (tmin and tmax) and precipitation data (prcp).

Dataset is a structured table, non-existent measurement cells are filled with -9999.

Example:
STATION,STATION_NAME,DATE,PRCP,TMAX,TMIN
GHCND:NLE00109300,STAVENISSE NL,19800101,53,-9999,-9999
GHCND:NLE00109300,STAVENISSE NL,19800102,21,-9999,-9999
GHCND:NLE00109300,STAVENISSE NL,19800103,133,-9999,-9999

GHCND:NLE00109202,MARUM NL,20080602,0,-9999,-9999
GHCND:NLE00109202,MARUM NL,20080603,36,-9999,-9999
GHCND:NLE00109202,MARUM NL,20080604,4,-9999,-9999

Data size: 1 Gb / 4 Gb / 8 Gb (the same 333 Mb data file replicated 3 / 12 / 24 times)

HDFS block size: 128 Mb

Platform:

  • HDP 2.1.3
    • Hadoop 2.4.0
    • Pig 0.12.1
    • Hive 0.13.0
    • HBase 0.98.0

Cluster:

  • 4 Dell workstations of
    • Quad core Core I5-2400
    • 4-8 Gb RAM
    • 1 TB SATA HDDs
    • Hadoop
      • Master and Slaves not separated (next task: separate them and redo
        measurement)
      • Machine 1 (8 Gb RAM)
        • KDE, Nagios, Ganglia, NodeManager, Zookeeper, RegionServer, Hue, Datanode
      • Machine 2 (4 Gb RAM)
        • DataNode
      • Machine 3 (8 Gb RAM)
        • HiveServer, NameNode, Oozie, Zookeeper, Datanode
      • Machine 4 (8 Gb RAM)
        • SNameNode, ResourceManager, Datanode
  • Gigabit LAN network (Hadoop traffic only)

Measured running scripts with Bash Shell, time command (e.g. time pig 1.pig)

Weather analysis tasks with Pig, Hive and HBase

Task 1 – example Pig and Hive scripts.

Table ordered by max HighTemperature

Table:(Station, Date, MaxTemp)

Pig script

A = LOAD '/user/hdfs/Weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999;
D1 = ORDER C1 by tmax;
STORE D1 into '/user/hdfs/Weather_Pig1.csv' using PigStorage(',');


Hive script – A variant

CREATE EXTERNAL TABLE weather1 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather1;
CREATE TABLE weather_res1 (station STRING, datefield STRING, tmax INT);
INSERT OVERWRITE TABLE weather_res1
SELECT station, datefield, tmax FROM weather1 WHERE tmax <> -9999 ORDER BY tmax DESC;

 

Hive script – B variant

Data pre-loaded

INSERT OVERWRITE TABLE weather_res1
SELECT station, datefield, tmax FROM weather1 WHERE tmax <> -9999 ORDER BY tmax DESC;

Task 2 – example Pig and Hive scripts

Max temperatures per year per station

Table:(Distinct:Station, Year, MaxHighTemp)

Pig script

A = LOAD '/user/hdfs/Weather/Weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999;
D1 = GROUP C1 by (station, year);
E1 = FOREACH D1 GENERATE group.station as station, (int) group.year, MAX(C1.tmax);
STORE E1 into '/user/hdfs/Weather_Pig2.csv' using PigStorage(',');

 

Hive script – A variant

CREATE EXTERNAL TABLE weather2 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather2;
CREATE TABLE weather_temp2 (station STRING, datefield INT, tmax INT); 
INSERT OVERWRITE TABLE weather_temp2
SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax FROM weather2 WHERE tmax <> -9999;
CREATE TABLE weather_res2 (station STRING, datefield INT, tmax INT);
INSERT OVERWRITE TABLE weather_res2
SELECT station, datefield, MAX(tmax) FROM weather_temp2 GROUP BY station, datefield;


Hive script – B variant

Data pre-loaded, leaving out temp table creation, running type casting twice

INSERT OVERWRITE TABLE weather_res2
SELECT station, cast(SUBSTR(datefield,1,4) as INT), MAX(tmax) FROM weather2WHERE tmax <> -9999 GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);

 

Task 3 – example Pig and Hive scripts

Average temperatures per year per station

Table:(Distinct:Station, Year, AvgHighTemp)

Pig script

A = LOAD '/user/hdfs/Weather/Weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999;
D1 = GROUP C1 by (station, year);
E1 = FOREACH D1 GENERATE group.station as station, (int) group.year, AVG(C1.tmax);
STORE E1 into '/user/hdfs/Weather_Pig3.csv' using PigStorage(',');

Hive script – A variant

CREATE EXTERNAL TABLE weather3 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather3;
CREATE TABLE weather_temp3 (station STRING, datefield INT, tavg INT);
INSERT OVERWRITE TABLE weather_temp3
SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax FROM weather3 WHERE tmax <> -9999;
CREATE TABLE weather_res3 (station STRING, datefield INT, tavg INT);
INSERT OVERWRITE TABLE weather_res3
SELECT station, datefield, AVG(tavg) FROM weather_temp3 GROUP BY station, datefield;

 

Hive script – B variant

Data pre-loaded, leaving out temp table creation, running type casting twice

CREATE TABLE weather_res3 (station STRING, datefield INT, tavg INT);
INSERT OVERWRITE TABLE weather_res3
SELECT station, cast(SUBSTR(datefield,1,4) as INT), AVG(tmax) FROM weather3 WHERE tmax <> -9999 GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);

 

Task 4 – example Pig and Hive scripts

Max mean yearly temperature

Table:

(Distinct:Station, MaxYear, MaxMeanTemp)
MeanTemp: AVG((MaxTemp-MinTemp)/2)

Pig script

A = LOAD '/user/hdfs/Weather/Weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999 AND tmin > -9999;
D1 = FOREACH C1 GENERATE station as station, year as year, ((tmax-tmin)/2) as avgt;
DESCRIBE D1;
E1 = GROUP D1 by (station, year);
F1 = FOREACH E1 GENERATE group.station as station, (int) group.year, MAX(D1.avgt) as avgt;
G1 = GROUP F1 by station;
H1 = FOREACH G1 {
	I1 = ORDER F1 BY avgt DESC;
	J1 = LIMIT I1 1;
	GENERATE group, J1.(year, avgt);
};
STORE H1 into '/user/hdfs/Weather_Pig4.csv' using PigStorage(',');

 

Hive script – A variant

CREATE EXTERNAL TABLE weather4 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather4;
CREATE EXTERNAL TABLE weather4_temp (station STRING, year INT, tmax DOUBLE, tmin DOUBLE);
INSERT OVERWRITE TABLE weather4_temp SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax, tmin FROM weather4 WHERE tmax > -9999 and tmin > -9999; 
CREATE EXTERNAL TABLE weather4_temp2 (station STRING, year INT, meanTemp DOUBLE);
INSERT OVERWRITE TABLE weather4_temp2 SELECT station, year, AVG((tmax-tmin)/2) FROM weather4_temp GROUP BY station, year; 
CREATE EXTERNAL TABLE weather4_temp3 (station STRING, maxMeanTemp DOUBLE);
INSERT OVERWRITE TABLE weather4_temp3 SELECT station, MAX(meanTemp) FROM weather4_temp2 GROUP BY station; 
CREATE TABLE weather_res4 (station STRING, year INT, tavg INT);
INSERT OVERWRITE TABLE weather_res4
SELECT w.station, w.year, ww.maxMeanTemp FROM weather4_temp2 w JOIN weather4_temp3 ww ON w.meanTemp = ww.maxMeanTemp;

Hive script – B variant

Data pre-loaded, leaving out temp table creation, running type casting twice.

CREATE EXTERNAL TABLE weather4_temp (station STRING, year INT, maxMeanTemp DOUBLE);
INSERT OVERWRITE TABLE weather4_temp1 SELECT station, cast(SUBSTR(datefield,1,4) as INT), AVG((tmax-tmin)/2) FROM weather4 WHERE tmax > -9999 and tmin > -9999 GROUP BY station, cast(SUBSTR(datefield,1,4) as INT); 
CREATE EXTERNAL TABLE weather4_temp2 (station STRING, maxMeanTemp DOUBLE);
INSERT OVERWRITE TABLE weather4_temp2 SELECT station, MAX(meanTemp) FROM weather4_temp2 GROUP BY station; 
CREATE TABLE weather_res4 (station STRING, year INT, tavg INT);
INSERT OVERWRITE TABLE weather_res4
SELECT w.station, w.year, ww.maxMeanTemp FROM weather4_temp1 w JOIN weather4_temp2 ww ON w.meanTemp = ww.maxMeanTemp;

 

Task 5 – example Pig and Hive scripts

Average temperatures per year of station STAVENISSE

Select one only weather station and average all time precipitation data.

Table:(STAVENISSE, Year, AvgHighTemp)

 

Pig script

A = LOAD '/user/hdfs/weather/weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999 AND station MATCHES '.*STAVENISSE.*';
D1 = GROUP C1 by station;
E1 = FOREACH D1 GENERATE group as station, AVG(C1.tmax);
STORE E1 into '/user/hdfs/Weather_Pig5.csv' using PigStorage(',');

 

Hive script – A variant

CREATE EXTERNAL TABLE weather5 (stationcode STRING, station STRING, dateield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather5;
CREATE TABLE weather_temp5 (station STRING, datefield INT, tmax INT); 
INSERT OVERWRITE TABLE weather_temp5
SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax FROM weather2 WHERE tmax <> -9999 AND station LIKE '%STAVENISSE%';
CREATE EXTERNAL TABLE weather_res5 (station STRING, datefield INT, tmax INT);
INSERT OVERWRITE TABLE weather_res5
SELECT station, datefield, AVG(tmax) FROM weather_temp5 GROUP BY station, datefield;


Hive script – B variant

Data pre-loaded, leaving out temp table creation, running type casting twice

INSERT OVERWRITE TABLE weather_res5
SELECT station, cast(SUBSTR(datefield,1,4) as INT), AVG(tmax) FROM weather5 WHERE tmax <> -9999 AND station LIKE '%STAVENISSE%' GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);

 

Performance assessment – results

 

Performance Comparison

Performance Comparison

 

Performance results? –
0.3 Gb file (runtime of script, sec)
Pig 0.12.1 Hive 0.13.0 – A Hive 0.13.0 – B
Task 1 97.003 63.732 59.258
Task 2 33.047 65.317 49.654
Task 3 33.132 64.401 50.756
Task 4 56.667 105.685 42.856
Task 5 27.961 61.276 39.968

 

Performance results? –
1 Gb file (runtime of script, sec)
Pig 0.12.1 Hive 0.13.0 – A Hive 0.13.0 – B
Task 1 117.98 85.177 88.75
Task 2 103.498 108.132 80.482
Task 3 98.341 108.109 84.423
Task 4 121.446 131.528 97.637
Task 5 103.435 80.039 77.061

 

Performance results? –
4 Gb file (runtime of script, sec)
Pig 0.12.1 Hive 0.13.0 – A Hive 0.13.0 – B
Task 1 332.186 317.524 274.322
Task 2 231.333 222.432 200.421
Task 3 235.11 221.321 199.657
Task 4 278.645 248.512 211.236
Task 5 189.288 187.425 184.549

 

Performance results? –
8 Gb file?(runtime of script, sec)
Pig 0.12.1 Hive 0.13.0 – A Hive 0.13.0 – B
Task 1 650.598 621.599 492.499
Task 2 448.898 355.509 321.543
Task 3 444.240 352.421 326.700
Task 4 489.455 432.078 334.770
Task 5 358.91 269.815 287.089