How to Install Hadoop on Ubuntu

How to Install Hadoop on Ubuntu for Big Data Processing

A Comprehensive Guide How to Install Hadoop on Ubuntu for Big Data Processing. In the realm of big data processing, Apache Hadoop stands as a powerful and widely used framework. Installing Hadoop on Ubuntu is a crucial step for individuals and organizations looking to harness the capabilities of distributed computing. This comprehensive guide will walk you through the step-by-step process of installing Hadoop on Ubuntu, unlocking the potential to process and analyze massive datasets.

Understanding Hadoop

Before delving into the installation process, let’s briefly understand what Hadoop is and its significance in big data processing.

What is Hadoop?
Hadoop is an open-source framework designes for distributed storage and processing of large datasets across clusters of pc. It consists of the Hadoop Distributed File Method (HDFS) for storage and the MapReduce programming model for processing data in parallel.

Prerequisites

Before installing Hadoop, ensure that you have the following prerequisites:

Ubuntu Installation: A machine running Ubuntu 18.04 or later. You can download the latest version of Ubuntu from the official website and follow the installation instructions.

Java Development Kit (JDK): Hadoop requires Java. Install the JDK by running the following commands in the terminal:

bash – Copy code
sudo apt update
sudo apt install default-jdk

SSH Configuration: Set up SSH for passwordless access between nodes if you are working with a multi-node cluster. Generate SSH keys using the following commands:

bash – Copy code
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

 

Installing Hadoop

1. Download and Extract Hadoop:
Visit the official Apache Hadoop website to download the latest stable release: Apache Hadoop Releases.

Alternatively, you can use the following commands in the terminal to download and extract Hadoop:

bash – Copy code
wget https://downloads.apache.org/hadoop/common/hadoop-X.X.X/hadoop-X.X.X.tar.gz
tar -xzvf hadoop-X.X.X.tar.gz
sudo mv hadoop-X.X.X /usr/local/hadoop

Replaces “X.X.X” with the version number you downloaded.

2. Configure Environment Variables:

Edit the ~/.bashrc file to add the Hadoop environment variables:

bash – Copy Code
nano ~/.bashrc

Add the following lines at the end of the file:

bash – Copy code
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME

Save the file and exit the editor. Then, run:

bash – Copy code
source ~/.bashrc
3. Configure Hadoop:

Navigate to the Hadoop configuration directory:

bash – Copy code
cd $HADOOP_HOME/etc/hadoop

Edit the hadoop-env.sh file:

bash – Copy code
nano hadoop-env.sh

Set the Java home by adding the following line:

bash – Copy code
export JAVA_HOME=/usr/lib/jvm/default-java

Save the file and exit the editor.

4. Configure Hadoop XML Files:

Edit the core-site.xml file:

bash – Copy code
nano core-site.xml

Add the following configuration:

xml – Copy code
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Save the file and exit.

Edit the hdfs-site.xml file:

bash – Copy code
nano hdfs-site.xml

Add the following configuration:

xml – Copy code
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/data/datanode</value>
</property>
</configuration>

Save the file and exit.

5. Format HDFS:

Run the following command to format the Hadoop Distributed File System (HDFS):

bash – Copy code
hdfs namenode -format
6. Start Hadoop Services:

Start the Hadoop services:

bash – Copy code
start-dfs.sh
start-yarn.sh
7. Verify Hadoop Installation:

Open a web browser and navigate to http://localhost:9870 to access the Hadoop NameNode web interface. This confirms that Hadoop is successfully running.

Running a Simple MapReduce Job
To validate the Hadoop installation, let’s run a simple MapReduce job.

1. Create Input Directory and Sample Data:

bash – Copy code
hdfs dfs -mkdir /input
echo "Hello Hadoop" | hdfs dfs -put - /input/sample.txt
2. Create and Compile a Java MapReduce Program:

Create a simple Java program for WordCount:

java – Copy code
// WordCount.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount {public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{private final static IntWritable one = new IntWritable(1);
private Text word = new Text();public void map(Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}} public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}result.set(sum);context.write(key, result);}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();
Job job = Job.getInstance(conf, “word count”);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(“/input”));
FileOutputFormat.setOutputPath(job, new Path(“/output”));
System.exit(job.waitForCompletion(true) ? 0 : 1);}}Compile the program:
bash – Copy code
javac -cp $HADOOP_HOME/share/hadoop/common/hadoop-common-X.X.X.jar:$HADOO
3. Create a JAR File:
bash – Copy code
jar cf wc.jar WordCount*.class
4. Run the MapReduce Job:
bash – Copy code
hadoop jar wc.jar WordCount /input /output
5. View Output:
bash – Copy code
hdfs dfs -cat /output/part-r-00000

This should display the word count results.

Conclusion

Congratulations! You’ve successfully installed Hadoop on Ubuntu and run a basic MapReduce job. Hadoop’s scalability and distributed computing capabilities make it a key player in the big data landscape. As you explore further consider customizing configurations and explore additional Hadoop ecosystem components. And integrating Hadoop into your data processing workflows. This installation lays the foundation for leveraging the power of distributed computing to analyze and process large datasets efficiently.

Scroll to Top