It’s easy to install Hadoop on a single machine, to try it out. The quickest way is to download and run a binary release from an Apache Software Foundation Mirror.
Hadoop can run on Windows and on Linux. Linux is the only supported production platform, but it (including Mac OS X) can also be used to run Hadoop for development. Windows supports only as a development platform, and it additionally requires Cygwin to run. During the Cygwin installation process, we should include the openssh package if we run Hadoop in pseudodistributed mode.
Hadoop is written in Java, so there is a need of Java installed on machine, version 6 or later. Sun’s JDK is the one most widely used with Hadoop.
It is easy to run Hadoop on a single machine using your own user account. From the Apache Hadoop releases page , download a stable release, which is packaged as a zipped tar file and then unpack it somewhere on your file system:
% tar xzf hadoop-x.y.z.tar.gz
Before you run Hadoop, location is needed where Java is installed.
If Java has been installed, this should display the version details as illustrated in the following image:
If you have the JAVA_HOME environment variable then set to point to a suitable Java installation, that will be used and you don’t have to configure anything further. (It is often set in a shell start-up file, such as ~/.bash_profile or ~/.bashrc.) Otherwise, we can set the Java installation that Hadoop uses by editing conf/hadoop-env.sh and specifying the JAVA_HOME variable. For example, on my Mac, I changed the line to read:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1 6/Home/
It points to the latest 1.6 version of Java. On Ubuntu you can use:
It’s very easy to create an environment variable that is used to point to the Hadoop installation directory (HADOOP_INSTALL, say) and to put the Hadoop binary directory on command-line path. (In Hadoop 2.0, there is a need to put the sbin directory on the path too.) For example:
% export HADOOP_INSTALL=/home/tom/hadoop-x.y.z
% export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
Check that Hadoop runs by typing:
% hadoop version
Compiled by hortonfo on Thu Dec 15 16:36:35 UTC 2011
Each component in Hadoop is configured by using an XML file. Common properties are in core-site.xml, HDFS properties are in hdfs-site.xml, and MapReduce properties are in mapred-site.xml. These files are all placed in the conf subdirectory.
In Hadoop 2.0 and later, MapReduce is used to run on YARN and there is an additional configuration file called yarn-site.xml. All the configuration files should go in the etc/ hadoop subdirectory.
Hadoop can be run in one of the three modes:
• Fully distributed mode
The Hadoop daemons run on a machine’s cluster.
• Standalone (or local) mode
There are no daemons running and everything runs in a single JVM. It is suitable for running MapReduce programs during development, and it is easy to test and debug them.
• Pseudodistributed mode
The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale.
To run Hadoop in a particular mode,we need to do two things:
– Set the appropriate properties
– Start the Hadoop daemons.
Below diagram is used to show the minimal set of properties to configure each mode. In standalone mode, the local filesystem and the local MapReduce job runner are used, whereas in the distributed modes MapReduce (or YARN) daemons and the HDFS are started.
- Figure: Key configuration properties for different modes
In this mode, there is no further action to take, and the default properties are set for standalone mode and there are no daemons to run.
The configuration files should be created with the following contents and placed in the conf directory (although we can place configuration files in any directory as long as we start the daemons with the –config option):
If you are running YARN, use the yarn-site.xml file:
In pseudodistributed mode, we have to start daemons, and for this, we need to have SSH installed. Hadoop doesn’t actually distinguish between fully distributed modes and pseudodistributed; it merely starts daemons on the set of hosts in the cluster which is defined by the slaves file by SSH-ing to each host and starting a daemon process.
Pseudodistributed mode is a special case of fully distributed mode in which the host is localhost, so we need to make sure that we can SSH to localhost and log in without having to enter a password. First, make sure that SSH is installed and a server is running. On Ubuntu, for example, this is achieved by using:-
% sudo apt-get install ssh
Then, to enable password-less login, generate a new SSH key with an empty passphrase:
% ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Test this with:
% ssh localhost
If successful, you should not have to type in a password.
Formatting the HDFS filesystem
Before it can be used, new HDFS installation is needed to be formatted. The formatting process is used to create an empty filesystem by creating the storage directories and the initial versions of the namenode’s persistent data structures. Datanodes are not involved in the initial formatting process, since the namenode is used to manage all of the filesystem’s metadata, and datanodes can leave or join the cluster dynamically. For the same reason, we don’t need to say how large a filesystem to create, since this is determined by the number of datanodes in the cluster, which can be increased as needed, long after the filesystem is formatted.
Formatting HDFS is a fast operation. Just type the following:
% hadoop namenode –format
Starting and stopping the daemons (MapReduce 1)
To start MapReduce daemons and the HDFS, type:
The following daemons will be started on our local machine: a namenode, a a datanode, a jobtracker, secondary namenode, and a tasktracker. We can check whether the daemons started successfully by looking at the logfiles in the logs directory (in the Hadoop installation directory) or by looking at the web UIs, at http://localhost:50030/ for the jobtracker and at http://localhost:50070/ for the namenode. We can also use Java’s jps command to see whether they are running.
Stopping the daemons is done in the obvious way:
Starting and stopping the daemons (MapReduce 2)
To start the HDFS and YARN daemons, type:
These commands will start the HDFS daemons, and for YARN, a node manager and a resource manager. The resource manager web UI is at http://localhost:8088/.
You can stop the daemons with:
Hope this helps you in Hadoop Installation. Do share your experience and if any step taken differently than explained.