Tutorial big data analysis: Weather changes in the Carpathian-Basin from 1900 to 2014 – Part 4/9
Analysis with Hadoop and Pig
Igniting Hortonworks Hadoop environment
- Starting the virtual machine
- Logging in from the host machine?s browser: 127.0.0.1:8888
- Log in with credentials supplied on login screen: hue / 1111
- A straightforward menu is given to manage stored files ? HDFS ?, Pig or Hive analysis scripts and to check the jobs that are running
- Upload data to the home folder/Data as zip
- The system auto unzips it to Weather.csv and Weather_Mini.csv (this one is the same data format with minimal set of rows for experimenting)
- Start to write and run your analyzer script, download results stored by the analyzers
Writing the analyzer PIG script
Pig is an SQL like query language for Hadoop MapReduce.
- Easy to learn and use
- Converts queries to map-reduce programs
- Might be less performance effective than a well written Java map-reduce program
Pig language reference can be found here.
I have built my analysis scripts around this showcased one, which analysis precipitation data. Unfortunately, only the precipitation and temperature analysis gave a big enough dataset, snow data columns has been such few that I could not get useful information out of it.
The analyzer PIG example script with comments:
- — Is the comment mark for one line, /* is the comment mark for multiple lines */
- After each transformation Describe is used to write the new schema to stdout
-- Load the file, delimited with , A = LOAD 'Data/Weather.csv' using PigStorage(','); -- Generate a scheme, $0 is the first "cell", while $1 is the next, after the delimitation "," mark -- Note that I only use years from the dates by using the SUBSTRING command B = FOREACH A GENERATE (chararray) $0 as stationcode, $1 as station, $2 as elevation, $3 as lat, $4 as lon, (int) SUBSTRING($5, 0, 4) as year, (int) $6 as prcp, (int) $7 as snwd, (int) $8 as snow, (int) $9 as tmax, (int) $10 as tmin, (int) $11 as wesd; -- Describe gives us the schema of the transformed dataset DESCRIBE B; -- Get only valid PRCP data C1 = FILTER B BY prcp > -9999; DESCRIBE C1; -- Group station data D1 = GROUP C1 by (station, year, lat, lon); DESCRIBE D1; -- Get the averages of groups, calculate where no whole year is available E1 = FOREACH D1 GENERATE group.station as station, group.lat as lat, group.lon as lon, ((int) group.year, (COUNT(C1. prcp) > 364 ? SUM(C1. prcp)/10 : SUM(C1.prcp)/COUNT(C1.prcp)/10*365)) as data; DESCRIBE E1; -- Store all station data grouped by stations and their location F1 = GROUP E1 by (station, lat, lon); DESCRIBE F1; -- Filter the data to station, location - yearly averages G1 = FOREACH F1 GENERATE group, E1.data; DESCRIBE G1; DUMP G1; -- Store to file STORE G1 INTO 'out_prcp'; /* All commented: I have tried to use this part to make a sorting on the dataset of year-average PRCP data. As map-reduce is a parallel process, data is not processed in an ordered list of years, even though we have ordered it as before. As sorting of complex data types is not available in Pig, the below code results in an error (I would have to sort pairs of tuple data, - even could be converted to maps – by map keys, which are years. I had to find a workaround, which the easiest to use was Python instead of giving up on Pig and using a Java map-reduce job. For details, see the next section. H1 = foreach G1{ sorted = order by data.year; GENERATE group.station, lat, lon, sorted; } DESCRIBE F1; DUMP F1; */
Output file out_prcp:
((ARAD RO,46.1331,21.35),{((1882,742)),((1883,680)),((1884,656)),((1885,656)),((1886,770)),((1887,718)),((1888,467)),((1889,893)),((1890,570)), ? ))}) ((DEVA RO,45.8667,22.9),?