Performance testing high level Hadoop query languages with example scripts.
Analysis of NOAA weather data: Western-European weather stations from 1980 to 2014, daily dataset of temperature (tmin and tmax) and precipitation data (prcp).
Dataset is a structured table, non-existent measurement cells are filled with -9999.
Example:
STATION,STATION_NAME,DATE,PRCP,TMAX,TMIN
GHCND:NLE00109300,STAVENISSE NL,19800101,53,-9999,-9999
GHCND:NLE00109300,STAVENISSE NL,19800102,21,-9999,-9999
GHCND:NLE00109300,STAVENISSE NL,19800103,133,-9999,-9999
…
GHCND:NLE00109202,MARUM NL,20080602,0,-9999,-9999
GHCND:NLE00109202,MARUM NL,20080603,36,-9999,-9999
GHCND:NLE00109202,MARUM NL,20080604,4,-9999,-9999
…
Data size: 1 Gb / 4 Gb / 8 Gb (the same 333 Mb data file replicated 3 / 12 / 24 times)
HDFS block size: 128 Mb
Platform:
- HDP 2.1.3
- Hadoop 2.4.0
- Pig 0.12.1
- Hive 0.13.0
- HBase 0.98.0
Cluster:
- 4 Dell workstations of
- Quad core Core I5-2400
- 4-8 Gb RAM
- 1 TB SATA HDDs
- Hadoop
- Master and Slaves not separated (next task: separate them and redo
measurement) - Machine 1 (8 Gb RAM)
- KDE, Nagios, Ganglia, NodeManager, Zookeeper, RegionServer, Hue, Datanode
- Machine 2 (4 Gb RAM)
- DataNode
- Machine 3 (8 Gb RAM)
- HiveServer, NameNode, Oozie, Zookeeper, Datanode
- Machine 4 (8 Gb RAM)
- SNameNode, ResourceManager, Datanode
- Master and Slaves not separated (next task: separate them and redo
- Gigabit LAN network (Hadoop traffic only)
Measured running scripts with Bash Shell, time command (e.g. time pig 1.pig)
Weather analysis tasks with Pig, Hive and HBase
Task 1 – example Pig and Hive scripts.
Table ordered by max HighTemperature
Table:(Station, Date, MaxTemp)
Pig script
A = LOAD '/user/hdfs/Weather.csv' using PigStorage(','); B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin; C1 = FILTER B BY tmax > -9999; D1 = ORDER C1 by tmax; STORE D1 into '/user/hdfs/Weather_Pig1.csv' using PigStorage(',');
Hive script – A variant
CREATE EXTERNAL TABLE weather1 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather1; CREATE TABLE weather_res1 (station STRING, datefield STRING, tmax INT); INSERT OVERWRITE TABLE weather_res1 SELECT station, datefield, tmax FROM weather1 WHERE tmax <> -9999 ORDER BY tmax DESC;
Hive script – B variant
Data pre-loaded
INSERT OVERWRITE TABLE weather_res1 SELECT station, datefield, tmax FROM weather1 WHERE tmax <> -9999 ORDER BY tmax DESC;
Task 2 – example Pig and Hive scripts
Max temperatures per year per station
Table:(Distinct:Station, Year, MaxHighTemp)
Pig script
A = LOAD '/user/hdfs/Weather/Weather.csv' using PigStorage(','); B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin; C1 = FILTER B BY tmax > -9999; D1 = GROUP C1 by (station, year); E1 = FOREACH D1 GENERATE group.station as station, (int) group.year, MAX(C1.tmax); STORE E1 into '/user/hdfs/Weather_Pig2.csv' using PigStorage(',');
Hive script – A variant
CREATE EXTERNAL TABLE weather2 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather2; CREATE TABLE weather_temp2 (station STRING, datefield INT, tmax INT); INSERT OVERWRITE TABLE weather_temp2 SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax FROM weather2 WHERE tmax <> -9999; CREATE TABLE weather_res2 (station STRING, datefield INT, tmax INT); INSERT OVERWRITE TABLE weather_res2 SELECT station, datefield, MAX(tmax) FROM weather_temp2 GROUP BY station, datefield;
Hive script – B variant
Data pre-loaded, leaving out temp table creation, running type casting twice
INSERT OVERWRITE TABLE weather_res2 SELECT station, cast(SUBSTR(datefield,1,4) as INT), MAX(tmax) FROM weather2WHERE tmax <> -9999 GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);
Task 3 – example Pig and Hive scripts
Average temperatures per year per station
Table:(Distinct:Station, Year, AvgHighTemp)
Pig script
A = LOAD '/user/hdfs/Weather/Weather.csv' using PigStorage(','); B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin; C1 = FILTER B BY tmax > -9999; D1 = GROUP C1 by (station, year); E1 = FOREACH D1 GENERATE group.station as station, (int) group.year, AVG(C1.tmax); STORE E1 into '/user/hdfs/Weather_Pig3.csv' using PigStorage(',');
Hive script – A variant
CREATE EXTERNAL TABLE weather3 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather3; CREATE TABLE weather_temp3 (station STRING, datefield INT, tavg INT); INSERT OVERWRITE TABLE weather_temp3 SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax FROM weather3 WHERE tmax <> -9999; CREATE TABLE weather_res3 (station STRING, datefield INT, tavg INT); INSERT OVERWRITE TABLE weather_res3 SELECT station, datefield, AVG(tavg) FROM weather_temp3 GROUP BY station, datefield;
Hive script – B variant
Data pre-loaded, leaving out temp table creation, running type casting twice
CREATE TABLE weather_res3 (station STRING, datefield INT, tavg INT); INSERT OVERWRITE TABLE weather_res3 SELECT station, cast(SUBSTR(datefield,1,4) as INT), AVG(tmax) FROM weather3 WHERE tmax <> -9999 GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);
Task 4 – example Pig and Hive scripts
Max mean yearly temperature
Table:
(Distinct:Station, MaxYear, MaxMeanTemp)
MeanTemp: AVG((MaxTemp-MinTemp)/2)
Pig script
A = LOAD '/user/hdfs/Weather/Weather.csv' using PigStorage(','); B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin; C1 = FILTER B BY tmax > -9999 AND tmin > -9999; D1 = FOREACH C1 GENERATE station as station, year as year, ((tmax-tmin)/2) as avgt; DESCRIBE D1; E1 = GROUP D1 by (station, year); F1 = FOREACH E1 GENERATE group.station as station, (int) group.year, MAX(D1.avgt) as avgt; G1 = GROUP F1 by station; H1 = FOREACH G1 { I1 = ORDER F1 BY avgt DESC; J1 = LIMIT I1 1; GENERATE group, J1.(year, avgt); }; STORE H1 into '/user/hdfs/Weather_Pig4.csv' using PigStorage(',');
Hive script – A variant
CREATE EXTERNAL TABLE weather4 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather4; CREATE EXTERNAL TABLE weather4_temp (station STRING, year INT, tmax DOUBLE, tmin DOUBLE); INSERT OVERWRITE TABLE weather4_temp SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax, tmin FROM weather4 WHERE tmax > -9999 and tmin > -9999; CREATE EXTERNAL TABLE weather4_temp2 (station STRING, year INT, meanTemp DOUBLE); INSERT OVERWRITE TABLE weather4_temp2 SELECT station, year, AVG((tmax-tmin)/2) FROM weather4_temp GROUP BY station, year; CREATE EXTERNAL TABLE weather4_temp3 (station STRING, maxMeanTemp DOUBLE); INSERT OVERWRITE TABLE weather4_temp3 SELECT station, MAX(meanTemp) FROM weather4_temp2 GROUP BY station; CREATE TABLE weather_res4 (station STRING, year INT, tavg INT); INSERT OVERWRITE TABLE weather_res4 SELECT w.station, w.year, ww.maxMeanTemp FROM weather4_temp2 w JOIN weather4_temp3 ww ON w.meanTemp = ww.maxMeanTemp;
Hive script – B variant
Data pre-loaded, leaving out temp table creation, running type casting twice.
CREATE EXTERNAL TABLE weather4_temp (station STRING, year INT, maxMeanTemp DOUBLE); INSERT OVERWRITE TABLE weather4_temp1 SELECT station, cast(SUBSTR(datefield,1,4) as INT), AVG((tmax-tmin)/2) FROM weather4 WHERE tmax > -9999 and tmin > -9999 GROUP BY station, cast(SUBSTR(datefield,1,4) as INT); CREATE EXTERNAL TABLE weather4_temp2 (station STRING, maxMeanTemp DOUBLE); INSERT OVERWRITE TABLE weather4_temp2 SELECT station, MAX(meanTemp) FROM weather4_temp2 GROUP BY station; CREATE TABLE weather_res4 (station STRING, year INT, tavg INT); INSERT OVERWRITE TABLE weather_res4 SELECT w.station, w.year, ww.maxMeanTemp FROM weather4_temp1 w JOIN weather4_temp2 ww ON w.meanTemp = ww.maxMeanTemp;
Task 5 – example Pig and Hive scripts
Average temperatures per year of station STAVENISSE
Select one only weather station and average all time precipitation data.
Table:(STAVENISSE, Year, AvgHighTemp)
Pig script
A = LOAD '/user/hdfs/weather/weather.csv' using PigStorage(','); B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin; C1 = FILTER B BY tmax > -9999 AND station MATCHES '.*STAVENISSE.*'; D1 = GROUP C1 by station; E1 = FOREACH D1 GENERATE group as station, AVG(C1.tmax); STORE E1 into '/user/hdfs/Weather_Pig5.csv' using PigStorage(',');
Hive script – A variant
CREATE EXTERNAL TABLE weather5 (stationcode STRING, station STRING, dateield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather5; CREATE TABLE weather_temp5 (station STRING, datefield INT, tmax INT); INSERT OVERWRITE TABLE weather_temp5 SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax FROM weather2 WHERE tmax <> -9999 AND station LIKE '%STAVENISSE%'; CREATE EXTERNAL TABLE weather_res5 (station STRING, datefield INT, tmax INT); INSERT OVERWRITE TABLE weather_res5 SELECT station, datefield, AVG(tmax) FROM weather_temp5 GROUP BY station, datefield;
Hive script – B variant
Data pre-loaded, leaving out temp table creation, running type casting twice
INSERT OVERWRITE TABLE weather_res5 SELECT station, cast(SUBSTR(datefield,1,4) as INT), AVG(tmax) FROM weather5 WHERE tmax <> -9999 AND station LIKE '%STAVENISSE%' GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);
Performance assessment – results
Performance results? – 0.3 Gb file (runtime of script, sec) |
Pig 0.12.1 | Hive 0.13.0 – A | Hive 0.13.0 – B |
Task 1 | 97.003 | 63.732 | 59.258 |
Task 2 | 33.047 | 65.317 | 49.654 |
Task 3 | 33.132 | 64.401 | 50.756 |
Task 4 | 56.667 | 105.685 | 42.856 |
Task 5 | 27.961 | 61.276 | 39.968 |
Performance results? – 1 Gb file (runtime of script, sec) |
Pig 0.12.1 | Hive 0.13.0 – A | Hive 0.13.0 – B |
Task 1 | 117.98 | 85.177 | 88.75 |
Task 2 | 103.498 | 108.132 | 80.482 |
Task 3 | 98.341 | 108.109 | 84.423 |
Task 4 | 121.446 | 131.528 | 97.637 |
Task 5 | 103.435 | 80.039 | 77.061 |
Performance results? – 4 Gb file (runtime of script, sec) |
Pig 0.12.1 | Hive 0.13.0 – A | Hive 0.13.0 – B |
Task 1 | 332.186 | 317.524 | 274.322 |
Task 2 | 231.333 | 222.432 | 200.421 |
Task 3 | 235.11 | 221.321 | 199.657 |
Task 4 | 278.645 | 248.512 | 211.236 |
Task 5 | 189.288 | 187.425 | 184.549 |
Performance results? – 8 Gb file?(runtime of script, sec) |
Pig 0.12.1 | Hive 0.13.0 – A | Hive 0.13.0 – B |
Task 1 | 650.598 | 621.599 | 492.499 |
Task 2 | 448.898 | 355.509 | 321.543 |
Task 3 | 444.240 | 352.421 | 326.700 |
Task 4 | 489.455 | 432.078 | 334.770 |
Task 5 | 358.91 | 269.815 | 287.089 |