Big data must know and know: Hadoop pseudo-distributed installation

Preparations before installation

OS preparation

The operating system used in this installation is Ubuntu 20.04.

Update the package list.

sudo apt-get update
copy

Install Java 8+

Install Java 8 using command.

sudo apt-get install -y openjdk-8-jdk
copy

Configure environment variables.

vi ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
copy

Let the environment variable take effect.

source ~/.bashrc
copy

Download the Hadoop installation package

From Hadoop official website Apache Hadoop Download the installation package software.

Or download directly through the command.

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
copy

Pseudo-distributed installation

Pseudo-distributed is to run multiple processes on one node to simulate a cluster.

Configure password-free login

To run the Hadoop pseudo-distributed cluster, you need to configure a key pair to implement password-free login.

  • Create a public-private key pair
$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/wux_labs/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/wux_labs/.ssh/id_rsa
Your public key has been saved in /home/wux_labs/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:rTJMxXd8BoyqSpLN0zS15j+rRKBWiZB9jOcmmWz4TFs wux_labs@wux-labs-vm
The key's randomart image is:
+---[RSA 3072]----+
|  .o o     o.    |
|  ..o.+o. ....   |
|   o.*+.oo. o o  |
|  . BoEo+o . o   |
|   OoB.=S .      |
|  o.Ooo...       |
|   o o+ o.       |
|    .  +  o      |
|        ...o     |
+----[SHA256]-----+
copy
  • copy public key
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
copy

Unzip the installation package

Unzip the installation package to the target path.

mkdir -p apps
tar -xzf hadoop-3.3.4.tar.gz -C apps
copy

The bin directory stores common commands related to Hadoop, such as the hdfs command for operating HDFS, and commands such as hadoop and yarn.

The Hadoop configuration files are stored in the etc directory, and the configurations for HDFS, MapReduce, YARN, and the cluster node list are all in this.

The sbin directory stores commands related to cluster management, such as commands for starting the cluster, starting HDFS, starting YARN, and stopping the cluster.

The share directory stores some Hadoop-related resources, such as documents and Jar packages of each module.

Configure environment variables

Configure environment variables, mainly HADOOP_HOME and PATH.

vi ~/.bashrc

export HADOOP_HOME=/home/wux_labs/apps/hadoop-3.3.4
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
copy

Let the environment variable take effect:

source ~/.bashrc
copy

configuration file

In addition to configuring environment variables, pseudo-distributed mode also needs to configure Hadoop configuration files.

  • hadoop-env.sh configuration
$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/wux_labs/apps/hadoop-3.3.4
export HADOOP_CONF_DIR=/home/wux_labs/apps/hadoop-3.3.4/etc/hadoop
copy
  • core-site.xml configuration
$ vi $HADOOP_HOME/etc/hadoop/core-site.xml

<configuration>
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://wux-labs-vm:8020</value>
    </property>
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/home/wux_labs/data/hadoop/temp</value>
    </property>
    <property>
      <name>hadoop.proxyuser.hadoop.hosts</name>
      <value>*</value>
    </property>
    <property>
      <name>hadoop.proxyuser.hadoop.groups</name>
      <value>*</value>
    </property>
</configuration>
copy
  • hdfs-site.xml configuration
$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>/home/wux_labs/data/hadoop/hdfs/name</value>
    </property>
    <property>
      <name>dfs.datanode.data.dir</name>
      <value>/home/wux_labs/data/hadoop/hdfs/data</value>
    </property>
</configuration>
copy
  • mapred-site.xml configuration
$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>
copy
  • yarn-site.xml configuration
$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

<configuration>
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
      <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>wux-labs-vm</value>
    </property>
</configuration>
copy

Format NameNode

Before starting the cluster, you need to format the NameNode, the command is as follows:

hdfs namenode -format
copy

Start the cluster

Execute the following command to start the cluster.

start-all.sh
copy

Verify Hadoop

access HDFS

Upload a file to HDFS.

hdfs dfs -put .bashrc /
copy

Open the HDFS Web UI to view related information, the default port is 9870.

Visit YARN

Open the YARN Web UI to view related information, the default port is 8088.

Related commands

HDFS related commands

The command used to operate HDFS is hdfs, and the command format is:

Usage: hdfs [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
copy

The supported Client commands mainly include:

    Client Commands:

classpath            prints the class path needed to get the hadoop jar and the required libraries
dfs                  run a filesystem command on the file system
envvars              display computed Hadoop environment variables
fetchdt              fetch a delegation token from the NameNode
getconf              get config values from configuration
groups               get the groups which users belong to
lsSnapshottableDir   list all snapshottable dirs owned by the current user
snapshotDiff         diff two snapshots of a directory or diff the current directory contents with a snapshot
version              print the version
copy

YARN-related commands

The command used to operate HDFS is yarn, and the command format is:

Usage: yarn [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    yarn [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class
copy

The supported Client commands mainly include:

    Client Commands:

applicationattempt   prints applicationattempt(s) report
app|application      prints application(s) report/kill application/manage long running application
classpath            prints the class path needed to get the hadoop jar and the required libraries
cluster              prints cluster information
container            prints container(s) report
envvars              display computed Hadoop environment variables
fs2cs                converts Fair Scheduler configuration to Capacity Scheduler (EXPERIMENTAL)
jar <jar>            run a jar file
logs                 dump container logs
nodeattributes       node attributes cli client
queue                prints queue information
schedulerconf        Updates scheduler configuration
timelinereader       run the timeline reader server
top                  view cluster information
version              print the version
copy

yarn jar can execute a jar file.

  • Verify case 1, count the strings containing "dfs"

Create an input directory.

hdfs dfs -mkdir /input
copy

Copy the Hadoop configuration file to the input directory.

hdfs dfs -put apps/hadoop-3.3.4/etc/hadoop/*.xml /input/
copy

The following command is used to execute a sample program that comes with Hadoop, count the strings containing dfs in the input directory, and output the results to the output directory.

yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar grep /input /output 'dfs[a-z.]+'
copy

The submitted Job can be seen on YARN.

The execution result is:

$ hdfs dfs -cat /output/*
1       dfsadmin
1       dfs.replication
1       dfs.namenode.name.dir
1       dfs.datanode.data.dir
copy
  • Verify case 2, calculate pi

Also execute the case that comes with Hadoop to calculate pi.

yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar pi 10 10
copy

The execution result is:

$ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar pi 10 10
Number of Maps  = 10
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
... ...
Job Finished in 43.768 seconds
Estimated value of Pi is 3.20000000000000000000
copy

The submitted Job can be seen on YARN.

Tags: node.js Distribution Hadoop xml Yarn

Posted by mniessen on Tue, 07 Feb 2023 10:11:54 +0530