Preparations before installation
OS preparation
The operating system used in this installation is Ubuntu 20.04.
Update the package list.
copysudo apt-get update
Install Java 8+
Install Java 8 using command.
copysudo apt-get install -y openjdk-8-jdk
Configure environment variables.
copyvi ~/.bashrc export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Let the environment variable take effect.
copysource ~/.bashrc
Download the Hadoop installation package
From Hadoop official website Apache Hadoop Download the installation package software.
Or download directly through the command.
copywget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
Pseudo-distributed installation
Pseudo-distributed is to run multiple processes on one node to simulate a cluster.
Configure password-free login
To run the Hadoop pseudo-distributed cluster, you need to configure a key pair to implement password-free login.
- Create a public-private key pair
copy$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/wux_labs/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/wux_labs/.ssh/id_rsa Your public key has been saved in /home/wux_labs/.ssh/id_rsa.pub The key fingerprint is: SHA256:rTJMxXd8BoyqSpLN0zS15j+rRKBWiZB9jOcmmWz4TFs wux_labs@wux-labs-vm The key's randomart image is: +---[RSA 3072]----+ | .o o o. | | ..o.+o. .... | | o.*+.oo. o o | | . BoEo+o . o | | OoB.=S . | | o.Ooo... | | o o+ o. | | . + o | | ...o | +----[SHA256]-----+
- copy public key
copycp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
Unzip the installation package
Unzip the installation package to the target path.
copymkdir -p apps tar -xzf hadoop-3.3.4.tar.gz -C apps
The bin directory stores common commands related to Hadoop, such as the hdfs command for operating HDFS, and commands such as hadoop and yarn.
The Hadoop configuration files are stored in the etc directory, and the configurations for HDFS, MapReduce, YARN, and the cluster node list are all in this.
The sbin directory stores commands related to cluster management, such as commands for starting the cluster, starting HDFS, starting YARN, and stopping the cluster.
The share directory stores some Hadoop-related resources, such as documents and Jar packages of each module.
Configure environment variables
Configure environment variables, mainly HADOOP_HOME and PATH.
copyvi ~/.bashrc export HADOOP_HOME=/home/wux_labs/apps/hadoop-3.3.4 export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
Let the environment variable take effect:
copysource ~/.bashrc
configuration file
In addition to configuring environment variables, pseudo-distributed mode also needs to configure Hadoop configuration files.
- hadoop-env.sh configuration
copy$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=/home/wux_labs/apps/hadoop-3.3.4 export HADOOP_CONF_DIR=/home/wux_labs/apps/hadoop-3.3.4/etc/hadoop
- core-site.xml configuration
copy$ vi $HADOOP_HOME/etc/hadoop/core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://wux-labs-vm:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/wux_labs/data/hadoop/temp</value> </property> <property> <name>hadoop.proxyuser.hadoop.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hadoop.groups</name> <value>*</value> </property> </configuration>
- hdfs-site.xml configuration
copy$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/wux_labs/data/hadoop/hdfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/wux_labs/data/hadoop/hdfs/data</value> </property> </configuration>
- mapred-site.xml configuration
copy$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*</value> </property> </configuration>
- yarn-site.xml configuration
copy$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>wux-labs-vm</value> </property> </configuration>
Format NameNode
Before starting the cluster, you need to format the NameNode, the command is as follows:
copyhdfs namenode -format
Start the cluster
Execute the following command to start the cluster.
copystart-all.sh
Verify Hadoop
access HDFS
Upload a file to HDFS.
copyhdfs dfs -put .bashrc /
Open the HDFS Web UI to view related information, the default port is 9870.
Visit YARN
Open the YARN Web UI to view related information, the default port is 8088.
Related commands
HDFS related commands
The command used to operate HDFS is hdfs, and the command format is:
copyUsage: hdfs [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
The supported Client commands mainly include:
copyClient Commands: classpath prints the class path needed to get the hadoop jar and the required libraries dfs run a filesystem command on the file system envvars display computed Hadoop environment variables fetchdt fetch a delegation token from the NameNode getconf get config values from configuration groups get the groups which users belong to lsSnapshottableDir list all snapshottable dirs owned by the current user snapshotDiff diff two snapshots of a directory or diff the current directory contents with a snapshot version print the version
YARN-related commands
The command used to operate HDFS is yarn, and the command format is:
copyUsage: yarn [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS] or yarn [OPTIONS] CLASSNAME [CLASSNAME OPTIONS] where CLASSNAME is a user-provided Java class
The supported Client commands mainly include:
copyClient Commands: applicationattempt prints applicationattempt(s) report app|application prints application(s) report/kill application/manage long running application classpath prints the class path needed to get the hadoop jar and the required libraries cluster prints cluster information container prints container(s) report envvars display computed Hadoop environment variables fs2cs converts Fair Scheduler configuration to Capacity Scheduler (EXPERIMENTAL) jar <jar> run a jar file logs dump container logs nodeattributes node attributes cli client queue prints queue information schedulerconf Updates scheduler configuration timelinereader run the timeline reader server top view cluster information version print the version
yarn jar can execute a jar file.
- Verify case 1, count the strings containing "dfs"
Create an input directory.
copyhdfs dfs -mkdir /input
Copy the Hadoop configuration file to the input directory.
copyhdfs dfs -put apps/hadoop-3.3.4/etc/hadoop/*.xml /input/
The following command is used to execute a sample program that comes with Hadoop, count the strings containing dfs in the input directory, and output the results to the output directory.
copyyarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar grep /input /output 'dfs[a-z.]+'
The submitted Job can be seen on YARN.
The execution result is:
copy$ hdfs dfs -cat /output/* 1 dfsadmin 1 dfs.replication 1 dfs.namenode.name.dir 1 dfs.datanode.data.dir
- Verify case 2, calculate pi
Also execute the case that comes with Hadoop to calculate pi.
copyyarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar pi 10 10
The execution result is:
copy$ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar pi 10 10 Number of Maps = 10 Samples per Map = 10 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job ... ... Job Finished in 43.768 seconds Estimated value of Pi is 3.20000000000000000000
The submitted Job can be seen on YARN.