Tutorial big data analysis: Weather changes in the Carpathian-Basin from 1900 to 2014 – Part 3/9
Preparation – Analysis Environment
As the analyzed data is relatively small to get it processed on a single machine, I have spared some time to set-up a new Hadoop cluster ? I have administrative access to a smaller cluster of regular PCs chained into a Hadoop cluster but this one was reserved the time I made the experiment.
Anyway, analyzing a small dataset with some big data tools is resulting in the same development efforts as analyzing Petabytes of data on a cluster of thousands of machines? Only it takes less CPU time ? one can still learn the basics on small datasets.
Setting up the environment, choosing the tools
OS: Ubuntu 13.10
Chosen Linux as the power of the shell is great for data manipulation. The “magic” toolset needed is the following:
- Bash shell and AWK for easy text file procession
- Python for data gathering and manipulation
- A web server to play with for the blog post
- Tools for GIS playground: geographical information system for map based data visualization
Linux is more suitable and easy to handle for development tasks like the above.
I have installed Python and Kartograph: as the weather stations are geographically distributed, I would like to make some graphs based on a map.
The open-source Kartograph GIS framework provided me an easy to use alternative for map creation and web-based visualization. It is more or less well documented and have some nice tutorials.
Download it and install using this guide.
I have used a virtual machine to run Hortonwork?s Hadoop ? it is a pre-configure environment with a handy web-based UI. Pre-installed are Hadoop, Pig, Hive: all the goodies needed for an analysis ? open-source and free, sparing you a lot of hassle by eliminating the need for a lengthy sysadmin session on Hadoop.
Installed Virtualbox on Ubuntu. Downloaded Hortonworks Hadoop 2 Sandbox bundle for Virtualbox. The analysis was done using a single virtual Hadoop machine.