brief introduction
Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Take full advantage of the power of clusters for high-speed operations and storage. Hadoop implements a distributed file system (HDFS). HDFS is fault-tolerant and designed to be deployed on low-cost hardware. It also provides high throughput to access application data, suitable for applications with large data set s. HDFS relax es POSIX requirements to allow streaming access to data in the file system. The core design of Hadoop's framework is HDFS and MapReduce. HDFS provides storage for vast amounts of data, while MapReduce provides calculation for vast amounts of data
advantage
1)Hadoop is a software framework capable of distributed processing of large amounts of data. Hadoop handles data in a reliable, efficient, scalable manner
2)Hadoop is reliable because it assumes that computing elements and storage will fail, so it maintains multiple copies of working data to ensure that it can be redistributed for failed nodes
3)Hadoop is efficient because it works in parallel and speeds up processing through parallel processing.
4)Hadoop is also scalable and can handle PB-level data
5)Hadoop relies on community services, so it's cheaper for anyone to use
6)Hadoop is a distributed computing platform that allows users to easily architect and use applications that handle large amounts of data.
It has the following main advantages:
1. High Reliability: Hadoop's ability to store and process data bitwise is trusted
2. High scalability: Hadoop distributes data and performs computing tasks among available computer clusters that can easily be extended to thousands of nodes
3. Efficiency: Hadoop can dynamically move data between nodes and keep the dynamic balance of each node, so the processing speed is very fast
4. High fault tolerance: Hadoop can automatically save multiple copies of the data and automatically reassign failed tasks
5. Low cost: hadoop is open source, so the software cost of the project will be greatly reduced
6:Hadoop has a framework written in Java, so it is ideal to run on Linux production platforms. Applications on Hadoop can also be written in other languages, such as C++.
Significance of Hadoop Large Data Processing
Hadoop is widely used in large data processing applications thanks to its natural advantages in data extraction, distortion, and loading (ETL). The distributed architecture of Hadoop allows large data processing engines to be stored as close to each other as possible, which is appropriate for batch operations such as ETL, where batch results can be stored directly. Hadoop's Map Reduce function smashes a single task, sends a fragmented task (Map) to multiple nodes, and then loads it as a single dataset into the data warehouse
preparation in advance
Prepare three Centos7 virtual machines as nodes of the Hadoop cluster, configure IP address and Hostname, turn off firewalls and selinux, synchronize time, and configure mapping of IP address and hostname
hostname | ip |
---|---|
192.168.29.143 | node1 |
192.168.29.142 | node2 |
192.168.29.144 | node3 |
Configure ssh Secret-Free Logon
[root@node1 ~]#ssh-keygen [root@node2 ~]#ssh-keygen [root@node3 ~]#ssh-keygen [root@node1 ~]#ssh-copy-id root@192.168.29.142 [root@node1 ~]#ssh-copy-id root@192.168.29.143 [root@node1 ~]#ssh-copy-id root@192.168.29.144 [root@node2 ~]#ssh-copy-id root@192.168.29.142 [root@node2 ~]#ssh-copy-id root@192.168.29.143 [root@node2 ~]#ssh-copy-id root@192.168.29.144 [root@node3 ~]#ssh-copy-id root@192.168.29.143 [root@node3 ~]#ssh-copy-id root@192.168.29.144 [root@node3 ~]#ssh-copy-id root@192.168.29.142 #Verify Secret Login [root@node1 ~]#ssh root@ip [root@node2 ~]#ssh root@ip [root@node3 ~]#ssh root@ip
Install Java Environment
Download and unzip a jdk package from the official website
All three nodes have a java environment installed
#Add environment variables [root@node1 ~]# vi /etc/profile JAVA_HOME=/usr/local/java CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar PATH=$PATH:$JAVA_HOME/bin export PATH JAVA_HOME CLASS_PATH #Reread Environment Variables [root@node1 ~]# source /etc/profile #View java environment configuration [root@node1 ~]# java -version java version "1.8.0_241" Java(TM) SE Runtime Environment (build 1.8.0_241-b07) Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)
Install Deployment Hadoop
Download and unzip the Hadoop binary zip package from the Hadoop website
#decompression [root@node1 ~]# tar -zxvf hadoop-3.1.3.tar.gz -C /usr/local/hadoop #Adding environment variables makes it easy to start a Hadoop cluster [root@node1 ~]# vi /etc/profile PATH=$PATH:/usr/local/hadoop/bin:$JAVA_HOME/bin:/usr/local/hadoop/sbin export PATH JAVA_HOME CLASS_PATH #Reload environment variables [root@node1 ~]# source /etc/profile
Noe1 Configuration Hadoop
Configure hadoop-env.sh
#Add JAVA environment variable [root@node1 ~]# vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh export JAVA_HOME=/usr/local/java
Configure yarn-env.sh
#Add JAVA environment variable [root@node1 ~]# vi /usr/local/hadoop/etc/hadoop/yarn-env.sh export JAVA_HOME=/usr/local/java
Configure core-site.xml
[root@node1 ~]# vi /usr/local/hadoop/etc/hadoop/core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://192.168.29.143:9000</value> </property> <property> <name>hadoop.tmp.dir</name> #Need to create folders ahead of time <value>/usr/local/hadoop/tmp</value> </property> </configuration>
Configure hdfs-site.xml
[root@node1 ~]# vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.http-address</name> <value>192.168.29.143:50070</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> #Set up a secretary node <value>192.168.29.142:50090</value> </property> </configuration>
Configure mapred-site.xml
[root@node1 ~]# vi /usr/local/hadoop/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>1500</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>3000</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1200m</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx2600m</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Configure yarn-site.xml
[root@node1 ~]# vi /usr/local/hadoop/etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>192.168.29.143</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.application.classpath</name> <value>/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>22528</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1500</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>16384</value> </property> </configuration>
Configure workers
Note: hadoop-2.x is slaves
[root@node1 ~]# vi /usr/local/hadoop/etc/hadoop/workers 192.168.29.143 192.168.29.142 192.168.29.144
Transfer the configured Hadoop directly to Noe2 and Noe3
[root@node1 ~]# scp -r /usr/local/hadoop root@192.168.29.142:/usr/local/hadoop [root@node1 ~]# scp -r /usr/local/hadoop root@192.168.29.144:/usr/local/hadoop
Start Hadoop Cluster
#Start cluster on node1 node [root@node1 ~]# start-all.sh #View cluster startup status #node1 node [root@node1 ~]# jps 3154 ResourceManager 3651 Jps 2586 NameNode 3293 NodeManager 2718 DataNode #node2 node [root@node2 ~]# jps 2203 NodeManager 2315 Jps 2013 DataNode 2109 SecondaryNameNode #node3 node [root@node3 ~]# jps 2126 NodeManager 2015 DataNode 2239 Jps #Node distribution is consistent with configuration file and cluster deployment is successful #Close Cluster [root@node1 ~]# stop-all.sh
Visualize Hadoop Cluster
Visit node1:8088 to view cluster status
Visit node1:50070 to see an overview of the cluster and the storage content of hdfs
Click Browse the file system to view file system storage, upload files to and download files from the file system
Test Validation
Hadoop officially provides a test template for calculating pi values
#Execute hadoop program [root@node1 ~]# hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar pi 20 50 #View Running Program Results from Command Line Job Finished in 172.314 seconds 2020-06-06 20:13:54,426 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false Estimated value of Pi is 3.14800000000000000000
web Page View Execution
web Page View Execution Results
Hadoop cluster environment has been successfully deployed since then