伪分布式环境搭建步骤


环境准备

  • CentOS7

  • JDK8.x

  • hadoop-2.7.6.tar.gz

    tar zvxf hadoop-2.7.6.tar.gz
    

上传至/opt/soft目录中[修改soft 777权限]

sudo chmod -R 777 /opt/soft

JDK配置

  • 先卸载默认的openjdk

    • 查看openjdk的位置

      -- 查看所有软件包是否安装
      [success@localhost ~]$ rpm -qa | grep java
      java-1.8.0-openjdk-headless-1.8.0.131-11.b12.el7.x86_64
      javapackages-tools-3.4.1-11.el7.noarch
      tzdata-java-2017b-1.el7.noarch
      java-1.7.0-openjdk-headless-1.7.0.141-2.6.10.5.el7.x86_64
      java-1.7.0-openjdk-1.7.0.141-2.6.10.5.el7.x86_64
      java-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64
      python-javapackages-3.4.1-11.el7.noarch
      
    • 卸载 - root账户

      除了.noarch不需要卸载,其余全部卸载

      [root@localhost success]# rpm -e --nodeps java-1.8.0-openjdk-headless-1.8.0.131-11.b12.el7.x86_64
      [root@localhost success]# rpm -e --nodeps java-1.7.0-openjdk-headless-1.7.0.141-2.6.10.5.el7.x86_64
      [root@localhost success]# rpm -e --nodeps java-1.7.0-openjdk-1.7.0.141-2.6.10.5.el7.x86_64
      [root@localhost success]# rpm -e --nodeps java-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64
      [root@localhost success]# rpm -qa | grep java
      javapackages-tools-3.4.1-11.el7.noarch
      tzdata-java-2017b-1.el7.noarch
      python-javapackages-3.4.1-11.el7.noarch
      
  • 在/opt/目录下新建soft目录,并且修改权限为777

    [success@localhost opt]$ mkdir soft
    [success@localhost opt]$ sudo chmod 777 soft
    
  • 利用xftp工具将jdk压缩包进行上传 - 解压

  • 在/etc/profile文件中的末尾添加内容如下:

    -rw-r--r--.  1 root root     1795 11月  6 2016 profile
    

    授权

    [success@localhost etc]$ sudo chmod 777 profile
    

    进行编辑,添加内容如下:

    export JAVA_HOME=/opt/soft/jdk1.8.0_172
    export PATH=$PATH:${JAVA_HOME}/bin
    

    比如以后要配置mysql

    export JAVA_HOME=/opt/soft/jdk1.8.0_172
    export PATH=$PATH:${JAVA_HOME}/bin
    export PATH=$PATH:/opt/soft/mycat/bin
    
  


重新将/etc/profile文件设置成原来的权限[忽略]

​~~~cmd
[success@localhost etc]$ java -version
bash: java: 未找到命令...

说明/etc/profile文件的设置并没有生效

[success@localhost etc]$ source profile
[success@localhost etc]$ java -version
bash: /opt/soft/jdk1.8.0_172/bin/java: 权限不够

jdk目录下的bin下的命令没有x权限[可执行权限]

-rw-rw-r--. 1 success success   7734 10月 14 16:19 java
[success@localhost bin]$ pwd
/opt/soft/jdk1.8.0_172/bin
[success@localhost bin]$ sudo chmod +x java
[sudo] success 的密码:
对不起,请重试。
[sudo] success 的密码:
[success@localhost bin]$ java -version
java version "1.8.0_172"
Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)
[success@localhost bin]$ sudo chmod +x javac
  • 测试

    [success@localhost bin]$ java -version
    java version "1.8.0_172"
    Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
    Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)
    

解压hadoop

  • 解压到当前目录

    [success@localhost soft]$ tar -zxvf hadoop-2.7.6.tar.gz
    
  • 将hadoop-2.7.6/share/doc目录删除,节省磁盘空间.[可选操作]

  • 修改如下三个文件

    hadoop2.7.6/etc/hadoop目录

    [success@localhost hadoop]$ vi hadoop-env.sh # 第25[success@localhost hadoop]$ vi mapred-env.sh # 第16[success@localhost hadoop]$ vi yarn-env.sh   # 在24行处添加
    [success@localhost hadoop]$ 
    

    在每个文件中添加:

    export JAVA_HOME=/opt/soft/jdk1.8.0_172
    
  • hadoop中的四个核心模块对应四个默认配置文件

四个核心文件配置

虚拟机主机名和虚拟机的ip地址之间进行映射

vi /etc/hosts
添加内容如下:
192.168.2.30 localhost.success
  • core-site.xml

    官方的参考的地址:

    https://hadoop.apache.org/docs/r2.7.6/hadoop-project-dist/hadoop-common/core-default.xml

    <!-- 指定HDFS中NameNode的地址 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost.success:9000</value>
    </property>
    <!-- 指定hadoop运行时产生文件的存储目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/soft/hadoop-2.7.6/data/tmp</value>
    </property>
    
  • hdfs-site.xml

     <!-- 设置dfs副本数,不设置默认是3个   -->
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    

    笔试题:一个文件160M,副本数2,块大小128m,实际存储空间多少?块数量多少?

    块数量-160/128->1 …. 32M -> 需要2个块 * 副本数2 = 4个块

    实际存储空间->160*4=640

  • 修改slave文件 - 指定从节点的位置

    主节点是namenode,从节点是datanode

    localhost
    
  • 格式化文件系统 - 仅仅需要一次

    如果哪天hdfs崩溃了,必须先干掉**/opt/soft/hadoop-2.7.6/data/tmp**,再重新格式化,还需要干掉/opt/soft/hadoop-2.7.6/logs

    success@localhost hadoop-2.7.6]$ bin/hdfs namenode -format
    

    出现如下info信息,表示成功

    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at localhost.success/192.168.2.22
    ************************************************************/
    

    之后可以观察之前创建的data/tmp目录中新增的变化

  • Start NameNode daemon and DataNode daemon

    [success@localhost hadoop-2.7.6]$ sbin/hadoop-daemon.sh start namenode
    starting namenode, logging to /opt/soft/hadoop-2.7.6/logs/hadoop-success-namenode-localhost.success.out
    [success@localhost hadoop-2.7.6]$ sbin/hadoop-daemon.sh start datanode
    starting datanode, logging to /opt/soft/hadoop-2.7.6/logs/hadoop-success-datanode-localhost.success.out
    
  • 查看HDFS外部UI界面

    windows中,通过ip:端口

    http://192.168.2.22:50070
    

    linux中,还可以通过主机名:端口号

    http://localhost.success:50070
    
  • 查看当前的进程

    • 如果遇到如下错误:

      [success@localhost hadoop-2.7.6]$ jps
      -bash: /opt/soft/jdk1.8.0_172/bin/jps: 权限不够
      
      [success@localhost jdk1.8.0_172]$ cd bin
      [success@localhost bin]$ ls
      appletviewer  javac           javaws    jinfo       jsadebugd     orbd         serialver
      ControlPanel  javadoc         jcmd      jjs         jstack        pack200      servertool
      extcheck      javafxpackager  jconsole  jmap        jstat         policytool   tnameserv
      idlj          javah           jcontrol  jmc         jstatd        rmic         unpack200
      jar           javap           jdb       jmc.ini     jvisualvm     rmid         wsgen
      jarsigner     javapackager    jdeps     jps         keytool       rmiregistry  wsimport
      java          java-rmi.cgi    jhat      jrunscript  native2ascii  schemagen    xjc
      [success@localhost bin]$ sudo chmod +x jps
      [sudo] success 的密码:
      
    [success@localhost hadoop-2.7.6]$ jps
    8289 Jps
    3636 NameNode
    3690 DataNode
    

    打开windows浏览器:http://localhost.success:50070即可

测试HDFS环境

  • 创建目录,HDFS中有用户主目录的概念,和LINUX中一样

    当执行这句命令的时候,在hdfs中自动创建用户主目录/user/success

    success/input就会自动在/user/success目录中

    [success@localhost hadoop-2.7.6]$ bin/hdfs dfs -mkdir -p success/input
    
  • 删除hdfs目录

    [success@localhost hadoop-2.7.6]$ bin/hdfs dfs -rm -r success/input
    
  • 上传文件到HDFS

    [success@localhost hadoop-2.7.6]$ bin/hdfs dfs -put etc/hadoop/core-site.xml etc/hadoop/hdfs-site.xml /
    

  • 读取hdfs的文件

    [success@localhost hadoop-2.7.6]$ bin/hdfs dfs -text /core-site.xml
    
  • 下载文件到本地

    [success@localhost hadoop-2.7.6]$ bin/hdfs dfs -get /core-site.xml /home/success/get-site.xml
    

    注意:

    • HDFS不支持多用户并发写入
    • HDFS不适合存储量小文件
  • 删除文件

    [success@localhost hadoop-2.7.6]$ bin/hdfs dfs -rm master/input/student.txt
    

yarn模块设置

  • Configure parameters as follows:etc/hadoop/mapred-site.xml:

    <configuration>
        <!-- 指定mr运行在yarn上 -->
        <property>
         <name>mapreduce.framework.name</name>
         <value>yarn</value>
        </property> 
    </configuration>
    
  • etc/hadoop/yarn-site.xml:

    <!-- reducer获取数据的方式
         指定启动运行mapreduce上的nodemanager的运行服务 
    -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!-- 指定YARN的ResourceManager的地址
         指定resourcemanager主节点机器 
    -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>localhost.success</value>
    </property>
    
  • Start ResourceManager daemon and NodeManager daemon

    [success@localhost hadoop-2.7.6]$ sbin/yarn-daemon.sh start resourcemanager
    starting resourcemanager, logging to /opt/soft/hadoop-2.7.6/logs/yarn-success-resourcemanager-localhost.success.out
    [success@localhost hadoop-2.7.6]$ sbin/yarn-daemon.sh start nodemanager
    starting nodemanager, logging to /opt/soft/hadoop-2.7.6/logs/yarn-success-nodemanager-localhost.success.out
    [success@localhost hadoop-2.7.6]$ jps
    9842 Jps
    3636 NameNode
    3690 DataNode
    9658 ResourceManager
    9743 NodeManager
    
    • 启动浏览器

HelloWorld案例

测试环境:运行一个mapreduce,wordcount单词统计案例

mapreduce分成五个节点

input->map()->shuffle->reduce()->output

操作步骤

  • 将mapreduce运行在yarn上,需要打jar包

  • 新建一个数据文件,用于测试mapreduce

  • 将数据文件file_text.txt从本地上传到HDFS

    [success@localhost hadoop-2.7.6]$ bin/hdfs dfs -put /opt/data/file_text.txt /user/success/success/input
    
  • 使用官方提供的示例jar包

    /opt/soft/hadoop-2.7.6/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar

    • 执行

      注意hdfs中不能提前存在输出路径/user/success/success/output

      [success@localhost hadoop-2.7.6]$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /user/success/success/input /user/success/success/output
      

    • 查看输出文件

      [success@localhost hadoop-2.7.6]$ bin/hdfs dfs -text /user/success/success/output/p*
      ABCDA    1
      Hello    1
      World    1
      
    • 删除输出文件

      bin/hdfs dfs -rm -r /user/success/success/out*
      

    分析map和reduce[补充]

    文件:

    [success@localhost datas]$ cat wordcount.txt
    python hello java
    java mysql hello
    python java
    
    Map-Reduce Framework
            Map input records=3
            Map output records=8
            Map output bytes=79
            Map output materialized bytes=54
            Input split bytes=133
            Combine input records=8
            Combine output records=4
            Reduce input groups=4
            Reduce shuffle bytes=54
            Reduce input records=4
            Reduce output records=4
            Spilled Records=8
            Shuffled Maps =1
            Failed Shuffles=0
            Merged Map outputs=1
            GC time elapsed (ms)=327
            CPU time spent (ms)=2960
            Physical memory (bytes) snapshot=283295744
            Virtual memory (bytes) snapshot=4159811584
            Total committed heap usage (bytes)=138981376
    
  • wordcount单词统计经过的过程阶段

    input->map->shuffle->reduce->output

  • Map input records=3

    数据是怎么进入到map(),map是方法

    <0,python hello java>
    <17,java mysql hello>
    <偏移量,python java>
    
  • Map output records=8

    map阶段做的事情 - 不会涉及到统计,排序.

    经过map()函数处理完之后,数据的格式

    确定key

        
      
    
  • shuffle阶段[洗牌] - 由框架自动完成了 - 无需程序员自己写代码的

    
    
    
    
    
  • Reduce input groups=4

    负责统计

     
    
    
    
    
  • Reduce output records=4

今天需要掌握的内容

  • namenode和datanode的含义
  • wordcount整个执行的阶段[mapreduce的一个过程]

文章作者: 码农耕地人
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 码农耕地人 !
  目录