A Comprehensive Guide How to Install Hadoop on Ubuntu for Big Data Processing. In the realm of big data processing, Apache Hadoop stands as a powerful and widely used framework. Installing Hadoop on Ubuntu is a crucial step for individuals and organizations looking to harness the capabilities of distributed computing. This comprehensive guide will walk you through the step-by-step process of installing Hadoop on Ubuntu, unlocking the potential to process and analyze massive datasets.
Understanding Hadoop
Before delving into the installation process, let’s briefly understand what Hadoop is and its significance in big data processing.
What is Hadoop?
Hadoop is an open-source framework designes for distributed storage and processing of large datasets across clusters of pc. It consists of the Hadoop Distributed File Method (HDFS) for storage and the MapReduce programming model for processing data in parallel.
Prerequisites
Before installing Hadoop, ensure that you have the following prerequisites:
Ubuntu Installation: A machine running Ubuntu 18.04 or later. You can download the latest version of Ubuntu from the official website and follow the installation instructions.
Java Development Kit (JDK): Hadoop requires Java. Install the JDK by running the following commands in the terminal:
sudo apt update
sudo apt install default-jdkSSH Configuration: Set up SSH for passwordless access between nodes if you are working with a multi-node cluster. Generate SSH keys using the following commands:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Installing Hadoop
1. Download and Extract Hadoop:
Visit the official Apache Hadoop website to download the latest stable release: Apache Hadoop Releases.
Alternatively, you can use the following commands in the terminal to download and extract Hadoop:
wget https://downloads.apache.org/hadoop/common/hadoop-X.X.X/hadoop-X.X.X.tar.gz
tar -xzvf hadoop-X.X.X.tar.gz
sudo mv hadoop-X.X.X /usr/local/hadoopReplaces “X.X.X” with the version number you downloaded.
2. Configure Environment Variables:
Edit the ~/.bashrc file to add the Hadoop environment variables:
nano ~/.bashrcAdd the following lines at the end of the file:
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOMESave the file and exit the editor. Then, run:
source ~/.bashrc3. Configure Hadoop:
Navigate to the Hadoop configuration directory:
cd $HADOOP_HOME/etc/hadoopEdit the hadoop-env.sh file:
nano hadoop-env.shSet the Java home by adding the following line:
export JAVA_HOME=/usr/lib/jvm/default-javaSave the file and exit the editor.
4. Configure Hadoop XML Files:
Edit the core-site.xml file:
nano core-site.xmlAdd the following configuration:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>Save the file and exit.
Edit the hdfs-site.xml file:
nano hdfs-site.xmlAdd the following configuration:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/data/datanode</value>
</property>
</configuration>Save the file and exit.
5. Format HDFS:
Run the following command to format the Hadoop Distributed File System (HDFS):
hdfs namenode -format6. Start Hadoop Services:
Start the Hadoop services:
start-dfs.sh
start-yarn.sh7. Verify Hadoop Installation:
Open a web browser and navigate to http://localhost:9870 to access the Hadoop NameNode web interface. This confirms that Hadoop is successfully running.
Running a Simple MapReduce Job
To validate the Hadoop installation, let’s run a simple MapReduce job.
1. Create Input Directory and Sample Data:
hdfs dfs -mkdir /input
echo "Hello Hadoop" | hdfs dfs -put - /input/sample.txt2. Create and Compile a Java MapReduce Program:
Create a simple Java program for WordCount:
// WordCount.java
import java.io.IOException;
import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount {public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{private final static IntWritable one = new IntWritable(1);
private Text word = new Text();public void map(Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}} public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}result.set(sum);context.write(key, result);}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();
Job job = Job.getInstance(conf, “word count”);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(“/input”));
FileOutputFormat.setOutputPath(job, new Path(“/output”));
System.exit(job.waitForCompletion(true) ? 0 : 1);}}Compile the program:
javac -cp $HADOOP_HOME/share/hadoop/common/hadoop-common-X.X.X.jar:$HADOO3. Create a JAR File:
jar cf wc.jar WordCount*.class4. Run the MapReduce Job:
hadoop jar wc.jar WordCount /input /output5. View Output:
hdfs dfs -cat /output/part-r-00000This should display the word count results.
Conclusion
Congratulations! You’ve successfully installed Hadoop on Ubuntu and run a basic MapReduce job. Hadoop’s scalability and distributed computing capabilities make it a key player in the big data landscape. As you explore further consider customizing configurations and explore additional Hadoop ecosystem components. And integrating Hadoop into your data processing workflows. This installation lays the foundation for leveraging the power of distributed computing to analyze and process large datasets efficiently.



