This articles covers basic tools and technologies to use when conducting the first steps on big data analysis.
- Linux as the base OS
- For basic data processing:
- Bash shell: environment for running multiple command-line Linux tools for data manipulation
- Comes pre-installed with Linux; check Reference manual for usage
- Bash might be not the default Linux shell, see how to switch to it
- Learn Bash by examples
- Most important Linux Commands and phenomenons to master for data manipulation:
- AWK - simple data reformatter with compact coding features
- Python - easy to learn, effective programming language with a huge amount of libraries available for various tasks. Great for data manipulation used from the command line.
- Bash shell: environment for running multiple command-line Linux tools for data manipulation
- And the big data analysis framework chosen based on the type of data analyzed. For the first step tutorials our suggestion would be:
- Hadoop, single cluster setup (can be downloaded pre-installed to a virtual appliance)
- Java based MapReduce programs
- Pig MapReduce query language