环境安装

[TOC]

版本选择

  • flume-ng-1.6.0-cdh5.8.0.tar.gz

  • hadoop-2.6.0-cdh5.8.0.tar.gz

  • hbase-1.2.0-cdh5.8.0.tar.gz

  • hbase-solr-1.5-cdh5.8.0.tar.gz

  • hive-1.1.0-cdh5.8.0.tar.gz

  • hue-3.9.0-cdh5.8.0.tar.gz

  • oozie-4.1.0-cdh5.8.0.tar.gz

  • pig-0.12.0-cdh5.8.0.tar.gz

  • solr-4.10.3-cdh5.8.0.tar.gz

  • spark-1.6.0-cdh5.8.0.tar.gz

  • sqoop-1.4.6-cdh5.8.0.tar.gz

  • sqoop2-1.99.5-cdh5.8.0.tar.gz

  • zookeeper-3.4.5-cdh5.8.0.tar.gz

准备

10.19.138.198   thadoop-uelrcx-host1                     namenode              resourcemanager     hmaster        hiveserver2        master
10.19.134.88    thadoop-uelrcx-host2    zk journalnode   namenode/datanode  nodemanager         regionserver                    worker
10.19.164.182   thadoop-uelrcx-host3    zk journalnode   datanode              nodemanager         regionserver                    worker
10.19.78.105    thadoop-uelrcx-host4    zk journalnode   datanode              nodemanager         regionserver                    worker

设置免密码登录

zookeeper 集群安装配置

修改zk配置文件

mv zoo_sample.cfg zoo.cfg

修改zk参数

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/tmp/zookeeper
clientPort=2181
server.2=thadoop-uelrcx-host2:2888:3888
server.3=thadoop-uelrcx-host3:2888:3888
server.4=thadoop-uelrcx-host4:2888:3888

拷贝zk到thadoop-uelrcx-host2,thadoop-uelrcx-host3,thadoop-uelrcx-host3三台机器上

在三台机器的/tmp/zookeeper下创建myid, 并分别填充2,3,4

分别启动zk实例

bin/zkServer.sh start

验证是否启动成功

bin/zkServer.sh status

连接zk服务

bin/zkCli.sh -server *********

hadoop 安装

修改core-site.xml

fs.defaultFShdfs://thadoopclusterio.file.buffer.size131072ha.zookeeper.quorumthadoop-uelrcx-host2:2181,thadoop-uelrcx-host3:2181,thadoop-uelrcx-host4:2181hadoop.tmp.dir/data/hadoop/tmpAbase for other temporary directories.hadoop.proxyuser.hadoop.hosts*hadoop.proxyuser.hadoop.groups*

修改hdfs-site.xml

dfs.nameservicesthadoopclusterdfs.ha.namenodes.thadoopclusternn1, nn2dfs.namenode.rpc-address.thadoopcluster.nn1thadoop-uelrcx-host1:9000dfs.namenode.rpc-address.thadoopcluster.nn2thadoop-uelrcx-host2:9000dfs.namenode.http-address.thadoopcluster.nn1thadoop-uelrcx-host1:50070dfs.namenode.http-address.thadoopcluster.nn2thadoop-uelrcx-host2:50070dfs.namenode.shared.edits.dirqjournal://thadoop-uelrcx-host2:8485;thadoop-uelrcx-host3:8485;thadoop-uelrcx-host4:8485/thadoopclusterdfs.client.failover.proxy.provider.thadoopclusterorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProviderdfs.ha.fencing.methodssshfencedfs.ha.fencing.ssh.private-key-files/home/hadoop/.ssh/id_rsadfs.journalnode.edits.dir/data/hadoop/journal/node/local/datadfs.ha.automatic-failover.enabledtruedfs.namenode.name.dir/data/hadoop/namenodedfs.datanode.data.dir/data/hadoop/datanodedfs.replication3dfs.webhdfs.enabledtrue

修改mapred-site.xml

mapreduce.framework.nameyarn

修改yarn-site.xml

yarn.resourcemanager.connect.retry-interval.ms2000yarn.resourcemanager.ha.enabledtrueyarn.resourcemanager.ha.rm-idsrm1,rm2yarn.resourcemanager.ha.automatic-failover.enabledtrueyarn.resourcemanager.hostname.rm1thadoop-uelrcx-host1yarn.resourcemanager.hostname.rm2thadoop-uelrcx-host2yarn.resourcemanager.ha.idrm1If we want to launch more than one RM in single node, we need this configurationyarn.resourcemanager.recovery.enabledtrueyarn.resourcemanager.zk-state-store.addressthadoop-uelrcx-host2:2181,thadoop-uelrcx-host3:2181,thadoop-uelrcx-host4:2181yarn.resourcemanager.store.classorg.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStoreyarn.resourcemanager.zk-addressthadoop-uelrcx-host2:2181,thadoop-uelrcx-host3:2181,thadoop-uelrcx-host4:2181yarn.resourcemanager.cluster-idthadoopcluster-yarnyarn.app.mapreduce.am.scheduler.connection.wait.interval-ms5000yarn.resourcemanager.address.rm1thadoop-uelrcx-host1:8132yarn.resourcemanager.scheduler.address.rm1thadoop-uelrcx-host1:8130yarn.resourcemanager.webapp.address.rm1thadoop-uelrcx-host1:8188yarn.resourcemanager.resource-tracker.address.rm1thadoop-uelrcx-host1:8131yarn.resourcemanager.admin.address.rm1thadoop-uelrcx-host1:8033yarn.resourcemanager.ha.admin.address.rm1thadoop-uelrcx-host1:23142yarn.resourcemanager.address.rm2thadoop-uelrcx-host2:8132yarn.resourcemanager.scheduler.address.rm2thadoop-uelrcx-host2:8130yarn.resourcemanager.webapp.address.rm2thadoop-uelrcx-host2:8188yarn.resourcemanager.resource-tracker.address.rm2thadoop-uelrcx-host2:8131yarn.resourcemanager.admin.address.rm2thadoop-uelrcx-host2:8033yarn.resourcemanager.ha.admin.address.rm2thadoop-uelrcx-host2:23142yarn.nodemanager.aux-servicesmapreduce_shuffleyarn.nodemanager.aux-services.mapreduce.shuffle.classorg.apache.hadoop.mapred.ShuffleHandleryarn.nodemanager.local-dirs/data/hadoop/yarn/localyarn.nodemanager.log-dirs/data/hadoop/logmapreduce.shuffle.port23080yarn.client.failover-proxy-providerorg.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvideryarn.resourcemanager.ha.automatic-failover.zk-base-path/yarn-leader-electionOptional setting. The default value is /yarn-leader-election

修改slaves

thadoop-uelrcx-host2
thadoop-uelrcx-host3
thadoop-uelrcx-host4

将hadoop安装文件分发到四台服务器上

启动journalnode

在thadoop-uelrcx-host1执行

sbin/hadoop-daemons.sh start journalnode

或者单独进入thadoop-uelrcx-host2,thadoop-uelrcx-host3,thadoop-uelrcx-host4 中分别执行

sbin/hadoop-daemon.sh start journalnode

jps检查是否有journalnode 进程

格式化HDFS

在thadoop-uelrcx-host1执行

bin/hadoop namenode -format

启动namenode

sbin/hadoop-daemon.sh start namenode

在thadoop-uelrcx-host2执行下面命令,完成准备节点同步信息

bin/hdfs namenode -bootstrapStandby

格式化ZK

bin/hdfs zkfc -formatZK

启动hdfs

在thadoop-uelrcx-host1执行下列命令, 启动dfs

sbin/start-dfs.sh

启动yarn

在thadoop-uelrcx-host1执行下列命令, 启动yarn

sbin/start-yarn.sh

HDFS 支持多地址网络

目前,很多情况下,hadoop都运行在多地址网络环境下,集群内部通过内网IP联通,集群外部则通过外网IP访问集群功能。 这样做有很多有点:

  • 安全: 集群内部通讯的网络和对外通讯的网络相隔离,保证数据的安全

  • 性能: 内网集群可以采用很高的网络带宽,如光纤,宽带,或者千兆网

  • Failover/Redundancy: 节点可以在多网络环境下应对网络的适配的失败

多网络地址环境下的hadoop配置修改

Ensuring HDFS Daemons Bind All Interfaces

默认情况下,hdfs 节点既可以使用hostname, 也可以使用IP。 无论哪种情况,hdfs 进程都只会绑定一个单独的ip,以保证其他网络无法访问。 在多网络地址环境下的解决方式,强制服务节点绑定IP网段 0.0.0.0, 不设置端口。

dfs.namenode.rpc-bind-host0.0.0.0 The actual address the RPC server will bind to. If this optional address is set, it overrides only the hostname portion of dfs.namenode.rpc-address. It can also be specified per name node or name service for HA/Federation. This is useful for making the name node listen on all interfaces by setting it to 0.0.0.0.

dfs.namenode.servicerpc-bind-host0.0.0.0 The actual address the service RPC server will bind to. If this optional address is set, it overrides only the hostname portion of dfs.namenode.servicerpc-address. It can also be specified per name node or name service for HA/Federation. This is useful for making the name node listen on all interfaces by setting it to 0.0.0.0.

dfs.namenode.http-bind-host0.0.0.0 The actual adress the HTTP server will bind to. If this optional address is set, it overrides only the hostname portion of dfs.namenode.http-address. It can also be specified per name node or name service for HA/Federation. This is useful for making the name node HTTP server listen on all interfaces by setting it to 0.0.0.0.

dfs.namenode.https-bind-host0.0.0.0 The actual adress the HTTPS server will bind to. If this optional address is set, it overrides only the hostname portion of dfs.namenode.https-address. It can also be specified per name node or name service for HA/Federation. This is useful for making the name node HTTPS server listen on all interfaces by setting it to 0.0.0.0.

Clients use Hostnames when connecting to DataNodes

默认情况下 HDFS 客户端通过namenode提供的IP地址来访问datanode, 然而这个ip有可能是客户端无法访问的。 解决方案就是通过datanode的hostname,经由DNS来访问datanode。

dfs.client.use.datanode.hostnametrueWhether clients should use datanode hostnames when connecting to datanodes.

DataNodes use HostNames when connecting to other DataNodes

特殊情况下, namanode无法通过ip来访问datanode, 此时可以配置hostname,由DNS来访问datanode

dfs.datanode.use.datanode.hostnametrueWhether datanodes should use datanode hostnames when connecting to other datanodes for data transfer.

Ensuring yarn Daemons Bind All Interfaces

yarn.nodemanager.bind-host0.0.0.0

yarn-timeline-service.bind-host0.0.0.0

yarn.resourcemanager.bind-host0.0.0.0

HBASE 安装配置

禁用hbase自带的zk

修改conf/hbase-env.sh

export HBASE_MANAGES_ZK=false

修改hbase-site.xml

hbase.rootdirhdfs://thadoop-uelrcx-host1:9000/hbasehbase.cluster.distributedtruehbase.zookeeper.quorumthadoop-uelrcx-host2,thadoop-uelrcx-host3,thadoop-uelrcx-host4hbase.zookeeper.property.dataDir/data/hadoop/log/hbase/zookeeper

修改regionservers

thadoop-uelrcx-host2
thadoop-uelrcx-host3
thadoop-uelrcx-host4

将hbase安装目录分发了四台机器上

在thadoop-uelrcx-host1启动hbase

bin/start-hbase.sh

hive 安装配置

修改hive-env.sh,设置HADOOP_HOME

HADOOP_HOME=/Users/junjie.cheng/Developers/hadoop-2.6.0-cdh5.8.0

设置hive-site.xml

javax.jdo.option.ConnectionURLjdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&autoReconnect=true&characterEncoding=UTF-8the URL of the MySQL database

hive.jobname.length30

javax.jdo.option.ConnectionDriverNamecom.mysql.jdbc.Driver

javax.jdo.option.ConnectionUserNamehive

javax.jdo.option.ConnectionPasswordhive

datanucleus.autoCreateSchemafalsedatanucleus.fixedDatastoretrue

datanucleus.autoStartMechanismSchemaTable

hive.metastore.warehouse.dir/user/hive/warehousehive.support.concurrencytruehive.zookeeper.quorumthadoop-uelrcx-host2,thadoop-uelrcx-host3,thadoop-uelrcx-host4hive.zookeeper.client.port2181

hive.server2.thrift.port10000

hive.aux.jars.pathfile:///data/hadoop/hive-1.1.0-cdh5.8.0/lib/hive-json-serde.jar,file:///data/hadoop/hive-1.1.0-cdh5.8.0/lib/hive-contrib.jar,file:///data/hadoop/hive-1.1.0-cdh5.8.0/lib/hive-serde.jar

hbase.zookeeper.quorumthadoop-uelrcx-host2,thadoop-uelrcx-host3,thadoop-uelrcx-host4

添加必要的jar包

  • mysql-connector-java-3.1.14-bin.jar

  • hbase-client-1.2.0-cdh5.8.0.jar, hbase-common-1.2.0-cdh5.8.0.jar, hbase-hadoop-compat-1.2.0-cdh5.8.0.jar, hbase-hadoop2-compat-1.2.0-cdh5.8.0.jar

  • netty-all-4.0.23.Final.jar

  • metrics-core-2.2.0.jar

启动hiveserver2

  1. 通过schematool初始化数据源

    $HIVE_HOME/bin/schematool -dbType mysql -initSchema

  2. 启动hiveserver2

    $HIVE_HOME/bin/hiveserver2

  3. 通过beeline 连接hive

    $HIVE_HOME/bin/beeline -u jdbc:hive2://$HS2_HOST:$HS2_PORT

spark 安装及配置

spark on yarn

修改spark-env.sh, 设置

export JAVA_HOME=/opt/jdk1.7.0_79
export HADOOP_DIR=/data/hadoop/hadoop-2.6.0-cdh5.8.0
export HADOOP_CONF_DIR=/data/hadoop/hadoop-2.6.0-cdh5.8.0/etc/hadoop
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/data/hadoop/hadoop-2.6.0-cdh5.8.0/share/hadoop/common/*:/data/hadoop/hadoop-2.6.0-cdh5.8.0/share/hadoop/common/lib/*:/data/hadoop/hadoop-2.6.0-cdh5.8.0/share/hadoop/yarn/*:/data/hadoop/hadoop-2.6.0-cdh5.8.0/share/hadoop/yarn/lib/*:/data/hadoop/spark-1.6.0-cdh5.8.0/lib/*:/data/hadoop/hive-1.1.0-cdh5.8.0/lib/*:/data/hadoop/hbase-1.2.0-cdh5.8.0/lib/*

直接提交任务到yarn上执行

spark standalone

修改slaves

thadoop-uelrcx-host2
thadoop-uelrcx-host3
thadoop-uelrcx-host4

启动spark集群

sbin/start-all.sh

Last updated

Was this helpful?