Performance testing high level Hadoop query languages with example scripts. Analysis of NOAA weather data: Western-European weather stations from 1980 to 2014, daily dataset of temperature (tmin and tmax) and precipitation data (prcp). Dataset is a structured table, non-existent measurement cells are filled with -9999. Example: STATION,STATION_NAME,DATE,PRCP,TMAX,TMIN GHCND:NLE00109300,STAVENISSE NL,19800101,53,-9999,-9999 GHCND:NLE00109300,STAVENISSE NL,19800102,21,-9999,-9999 GHCND:NLE00109300,STAVENISSE NL,19800103,133,-9999,-9999 … GHCND:NLE00109202,MARUM NL,20080602,0,-9999,-9999 GHCND:NLE00109202,MARUM NL,20080603,36,-9999,-9999 GHCND:NLE00109202,MARUM NL,20080604,4,-9999,-9999 … Data size: 1 Gb / 4 Gb / 8 Gb (the same 333 Mb data file replicated 3 / 12 / 24 times) HDFS block size: 128 Mb Platform:
- HDP 2.1.3
- Hadoop 2.4.0
- Pig 0.12.1
- Hive 0.13.0
- HBase 0.98.0
Cluster:
- 4 Dell workstations of
- Quad core Core I5-2400
- 4-8 Gb RAM
- 1 TB SATA HDDs
- Hadoop
- Master and Slaves not separated (next task: separate them and redo measurement)
- Machine 1 (8 Gb RAM)
- KDE, Nagios, Ganglia, NodeManager, Zookeeper, RegionServer, Hue, Datanode
- Machine 2 (4 Gb RAM)
- DataNode
- Machine 3 (8 Gb RAM)
- HiveServer, NameNode, Oozie, Zookeeper, Datanode
- Machine 4 (8 Gb RAM)
- SNameNode, ResourceManager, Datanode
- Gigabit LAN network (Hadoop traffic only)
Measured running scripts with Bash Shell, time command (e.g. time pig 1.pig)
Weather analysis tasks with Pig, Hive and HBase
Task 1 - example Pig and Hive scripts.
Table ordered by max HighTemperature Table:(Station, Date, MaxTemp) Pig script
A = LOAD '/user/hdfs/Weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999;
D1 = ORDER C1 by tmax;
STORE D1 into '/user/hdfs/Weather_Pig1.csv' using PigStorage(',');
__ Hive script - A variant
CREATE EXTERNAL TABLE weather1 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather1;
CREATE TABLE weather_res1 (station STRING, datefield STRING, tmax INT);
INSERT OVERWRITE TABLE weather_res1
SELECT station, datefield, tmax FROM weather1 WHERE tmax <> -9999 ORDER BY tmax DESC;
Hive script - B variant Data pre-loaded __
INSERT OVERWRITE TABLE weather_res1
SELECT station, datefield, tmax FROM weather1 WHERE tmax <> -9999 ORDER BY tmax DESC;
Task 2 - example Pig and Hive scripts
Max temperatures per year per station Table:(Distinct:Station, Year, MaxHighTemp) Pig script
A = LOAD '/user/hdfs/Weather/Weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999;
D1 = GROUP C1 by (station, year);
E1 = FOREACH D1 GENERATE group.station as station, (int) group.year, MAX(C1.tmax);
STORE E1 into '/user/hdfs/Weather_Pig2.csv' using PigStorage(',');
Hive script - A variant
CREATE EXTERNAL TABLE weather2 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather2;
CREATE TABLE weather_temp2 (station STRING, datefield INT, tmax INT);
INSERT OVERWRITE TABLE weather_temp2
SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax FROM weather2 WHERE tmax <> -9999;
CREATE TABLE weather_res2 (station STRING, datefield INT, tmax INT);
INSERT OVERWRITE TABLE weather_res2
SELECT station, datefield, MAX(tmax) FROM weather_temp2 GROUP BY station, datefield;
__ Hive script - B variant Data pre-loaded, leaving out temp table creation, running type casting twice
INSERT OVERWRITE TABLE weather_res2
SELECT station, cast(SUBSTR(datefield,1,4) as INT), MAX(tmax) FROM weather2WHERE tmax <> -9999 GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);
Task 3 - example Pig and Hive scripts
Average temperatures per year per station Table:(Distinct:Station, Year, AvgHighTemp) Pig script
A = LOAD '/user/hdfs/Weather/Weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999;
D1 = GROUP C1 by (station, year);
E1 = FOREACH D1 GENERATE group.station as station, (int) group.year, AVG(C1.tmax);
STORE E1 into '/user/hdfs/Weather_Pig3.csv' using PigStorage(',');
Hive script - A variant
CREATE EXTERNAL TABLE weather3 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather3;
CREATE TABLE weather_temp3 (station STRING, datefield INT, tavg INT);
INSERT OVERWRITE TABLE weather_temp3
SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax FROM weather3 WHERE tmax <> -9999;
CREATE TABLE weather_res3 (station STRING, datefield INT, tavg INT);
INSERT OVERWRITE TABLE weather_res3
SELECT station, datefield, AVG(tavg) FROM weather_temp3 GROUP BY station, datefield;
Hive script - B variant Data pre-loaded, leaving out temp table creation, running type casting twice
CREATE TABLE weather_res3 (station STRING, datefield INT, tavg INT);
INSERT OVERWRITE TABLE weather_res3
SELECT station, cast(SUBSTR(datefield,1,4) as INT), AVG(tmax) FROM weather3 WHERE tmax <> -9999 GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);
Task 4 - example Pig and Hive scripts
Max mean yearly temperature Table: (Distinct:Station, MaxYear, MaxMeanTemp) MeanTemp: AVG((MaxTemp-MinTemp)/2) Pig script
A = LOAD '/user/hdfs/Weather/Weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999 AND tmin > -9999;
D1 = FOREACH C1 GENERATE station as station, year as year, ((tmax-tmin)/2) as avgt;
DESCRIBE D1;
E1 = GROUP D1 by (station, year);
F1 = FOREACH E1 GENERATE group.station as station, (int) group.year, MAX(D1.avgt) as avgt;
G1 = GROUP F1 by station;
H1 = FOREACH G1 {
I1 = ORDER F1 BY avgt DESC;
J1 = LIMIT I1 1;
GENERATE group, J1.(year, avgt);
};
STORE H1 into '/user/hdfs/Weather_Pig4.csv' using PigStorage(',');
Hive script - A variant
CREATE EXTERNAL TABLE weather4 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather4;
CREATE EXTERNAL TABLE weather4_temp (station STRING, year INT, tmax DOUBLE, tmin DOUBLE);
INSERT OVERWRITE TABLE weather4_temp SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax, tmin FROM weather4 WHERE tmax > -9999 and tmin > -9999;
CREATE EXTERNAL TABLE weather4_temp2 (station STRING, year INT, meanTemp DOUBLE);
INSERT OVERWRITE TABLE weather4_temp2 SELECT station, year, AVG((tmax-tmin)/2) FROM weather4_temp GROUP BY station, year;
CREATE EXTERNAL TABLE weather4_temp3 (station STRING, maxMeanTemp DOUBLE);
INSERT OVERWRITE TABLE weather4_temp3 SELECT station, MAX(meanTemp) FROM weather4_temp2 GROUP BY station;
CREATE TABLE weather_res4 (station STRING, year INT, tavg INT);
INSERT OVERWRITE TABLE weather_res4
SELECT w.station, w.year, ww.maxMeanTemp FROM weather4_temp2 w JOIN weather4_temp3 ww ON w.meanTemp = ww.maxMeanTemp;
Hive script - B variant Data pre-loaded, leaving out temp table creation, running type casting twice.
CREATE EXTERNAL TABLE weather4_temp (station STRING, year INT, maxMeanTemp DOUBLE);
INSERT OVERWRITE TABLE weather4_temp1 SELECT station, cast(SUBSTR(datefield,1,4) as INT), AVG((tmax-tmin)/2) FROM weather4 WHERE tmax > -9999 and tmin > -9999 GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);
CREATE EXTERNAL TABLE weather4_temp2 (station STRING, maxMeanTemp DOUBLE);
INSERT OVERWRITE TABLE weather4_temp2 SELECT station, MAX(meanTemp) FROM weather4_temp2 GROUP BY station;
CREATE TABLE weather_res4 (station STRING, year INT, tavg INT);
INSERT OVERWRITE TABLE weather_res4
SELECT w.station, w.year, ww.maxMeanTemp FROM weather4_temp1 w JOIN weather4_temp2 ww ON w.meanTemp = ww.maxMeanTemp;
Task 5 - example Pig and Hive scripts
Average temperatures per year of station STAVENISSE Select one only weather station and average all time precipitation data. Table:(STAVENISSE, Year, AvgHighTemp) Pig script
A = LOAD '/user/hdfs/weather/weather.csv' using PigStorage(',');
B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2, 0, 4) as year, (int) $4 as tmax, (int) $5 as tmin;
C1 = FILTER B BY tmax > -9999 AND station MATCHES '.*STAVENISSE.*';
D1 = GROUP C1 by station;
E1 = FOREACH D1 GENERATE group as station, AVG(C1.tmax);
STORE E1 into '/user/hdfs/Weather_Pig5.csv' using PigStorage(',');
Hive script - A variant
CREATE EXTERNAL TABLE weather5 (stationcode STRING, station STRING, dateield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/user/hdfs/Weather.csv' OVERWRITE INTO TABLE weather5;
CREATE TABLE weather_temp5 (station STRING, datefield INT, tmax INT);
INSERT OVERWRITE TABLE weather_temp5
SELECT station, cast(SUBSTR(datefield,1,4) as INT), tmax FROM weather2 WHERE tmax <> -9999 AND station LIKE '%STAVENISSE%';
CREATE EXTERNAL TABLE weather_res5 (station STRING, datefield INT, tmax INT);
INSERT OVERWRITE TABLE weather_res5
SELECT station, datefield, AVG(tmax) FROM weather_temp5 GROUP BY station, datefield;
Hive script - B variant Data pre-loaded, leaving out temp table creation, running type casting twice __
INSERT OVERWRITE TABLE weather_res5
SELECT station, cast(SUBSTR(datefield,1,4) as INT), AVG(tmax) FROM weather5 WHERE tmax <> -9999 AND station LIKE '%STAVENISSE%' GROUP BY station, cast(SUBSTR(datefield,1,4) as INT);
Performance assessment - results
Performance Comparison Performance results? - 0.3 Gb file(runtime of script, sec) | Pig 0.12.1 | Hive 0.13.0 - A | Hive 0.13.0 - B |
|---|---|---|---|
| Task 1 | 97.003 | 63.732 | 59.258 |
| Task 2 | 33.047 | 65.317 | 49.654 |
| Task 3 | 33.132 | 64.401 | 50.756 |
| Task 4 | 56.667 | 105.685 | 42.856 |
| Task 5 | 27.961 | 61.276 | 39.968 |
| Performance results? - 1 Gb file(runtime of script, sec) | Pig 0.12.1 | Hive 0.13.0 - A | Hive 0.13.0 - B |
| — | — | — | — |
| Task 1 | 117.98 | 85.177 | 88.75 |
| Task 2 | 103.498 | 108.132 | 80.482 |
| Task 3 | 98.341 | 108.109 | 84.423 |
| Task 4 | 121.446 | 131.528 | 97.637 |
| Task 5 | 103.435 | 80.039 | 77.061 |
| Performance results? - 4 Gb file(runtime of script, sec) | Pig 0.12.1 | Hive 0.13.0 - A | Hive 0.13.0 - B |
| — | — | — | — |
| Task 1 | 332.186 | 317.524 | 274.322 |
| Task 2 | 231.333 | 222.432 | 200.421 |
| Task 3 | 235.11 | 221.321 | 199.657 |
| Task 4 | 278.645 | 248.512 | 211.236 |
| Task 5 | 189.288 | 187.425 | 184.549 |
| Performance results? - 8 Gb file?(runtime of script, sec) | Pig 0.12.1 | Hive 0.13.0 - A | Hive 0.13.0 - B |
| — | — | — | — |
| Task 1 | 650.598 | 621.599 | 492.499 |
| Task 2 | 448.898 | 355.509 | 321.543 |
| Task 3 | 444.240 | 352.421 | 326.700 |
| Task 4 | 489.455 | 432.078 | 334.770 |
| Task 5 | 358.91 | 269.815 | 287.089 |
Performance Comparison Performance results? - 0.3 Gb file(runtime of script, sec)