Konfigurace hadoop+hive+spark

Jsem si teď začal hrát s Hadoopem, Sparkem a Hivem. Takže si zaarchivuju konfiguraci toolů.

Prvně:

Na Ubuntu je nutný mít nainstalovaný ssh a Javu

Pak si z spark.apache.org/ stáhnout aktuální distribuci Sparku, z hive.apache.org/  aktuální distribuci Hive a z hadoop.apache.org/ patřičný Hadoop. Následně pak Derby: https://db.apache.org/derby/

Pozor, kam je přesuneš, protože podle toho bude třeba používat roota. Pro pohodlnější chod je lepší si vytvořit v /home/chajim/ vlastní adresář a do něj všechno nasypat.

Je nutné patřičně nastavit systémový proměnný. Tady archivuju svůj .bashrc


# ~/.bashrc: executed by bash(1) for non-login shells.
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples

# If not running interactively, don't do anything
case $- in
 *i*) ;;
 *) return;;
esac

# don't put duplicate lines or lines starting with space in the history.
# See bash(1) for more options
HISTCONTROL=ignoreboth

# append to the history file, don't overwrite it
shopt -s histappend

# for setting history length see HISTSIZE and HISTFILESIZE in bash(1)
HISTSIZE=1000
HISTFILESIZE=2000

# check the window size after each command and, if necessary,
# update the values of LINES and COLUMNS.
shopt -s checkwinsize

# If set, the pattern "**" used in a pathname expansion context will
# match all files and zero or more directories and subdirectories.
#shopt -s globstar

# make less more friendly for non-text input files, see lesspipe(1)
[ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"

# set variable identifying the chroot you work in (used in the prompt below)
if [ -z "${debian_chroot:-}" ] && [ -r /etc/debian_chroot ]; then
 debian_chroot=$(cat /etc/debian_chroot)
fi

# set a fancy prompt (non-color, unless we know we "want" color)
case "$TERM" in
 xterm-color|*-256color) color_prompt=yes;;
esac

# uncomment for a colored prompt, if the terminal has the capability; turned
# off by default to not distract the user: the focus in a terminal window
# should be on the output of commands, not on the prompt
#force_color_prompt=yes

if [ -n "$force_color_prompt" ]; then
 if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then
 # We have color support; assume it's compliant with Ecma-48
 # (ISO/IEC-6429). (Lack of such support is extremely rare, and such
 # a case would tend to support setf rather than setaf.)
 color_prompt=yes
 else
 color_prompt=
 fi
fi

if [ "$color_prompt" = yes ]; then
 PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
else
 PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi
unset color_prompt force_color_prompt

# If this is an xterm set the title to user@host:dir
case "$TERM" in
xterm*|rxvt*)
 PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a\]$PS1"
 ;;
*)
 ;;
esac

# enable color support of ls and also add handy aliases
if [ -x /usr/bin/dircolors ]; then
 test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)"
 alias ls='ls --color=auto'
 #alias dir='dir --color=auto'
 #alias vdir='vdir --color=auto'

alias grep='grep --color=auto'
 alias fgrep='fgrep --color=auto'
 alias egrep='egrep --color=auto'
fi

# colored GCC warnings and errors
#export GCC_COLORS='error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01'

# some more ls aliases
alias ll='ls -alF'
alias la='ls -A'
alias l='ls -CF'

# Add an "alert" alias for long running commands. Use like so:
# sleep 10; alert
alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"'

# Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.

if [ -f ~/.bash_aliases ]; then
 . ~/.bash_aliases
fi

# enable programmable completion features (you don't need to enable
# this, if it's already enabled in /etc/bash.bashrc and /etc/profile
# sources /etc/bash.bashrc).
if ! shopt -oq posix; then
 if [ -f /usr/share/bash-completion/bash_completion ]; then
 . /usr/share/bash-completion/bash_completion
 elif [ -f /etc/bash_completion ]; then
 . /etc/bash_completion
 fi
fi

# JAVA
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# HADOOP
export HADOOP_HOME=/home/chajim/data/hadoop-2.8.0
export PATH=$HADOOP_HOME/bin:$PATH
export PATH=$HADOOP_HOME/sbin:$PATH

# HIVE
export HIVE_HOME=/home/chajim/data/apache-hive-2.1.1-bin
export HIVE_CONF_DIR=/home/chajim/data/apache-hive-2.1.1-bin/conf
export PATH=$HIVE_HOME/bin:$PATH
export CLASSPATH=$CLASSPATH:/home/chajim/data/hadoop-2.8.0/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/apache-hive-2.1.1-bin/lib/*:.

# DERBY
export DERBY_HOME=/home/chajim/data/db-derby-10.13.1.1-bin
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar

# SPARK
export SPARK_HOME=/home/chajim/data/spark-2.1.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH

# aliasy
alias ..="cd .."
alias ...="cd ../.."
alias hfs="hadoop fs"
alias hls="hfs -ls"
alias hdp="hdfs dfs"
alias starthadoop="start-dfs.sh && start-yarn.sh"
alias stophadoop="stop-dfs.sh && stop-yarn.sh"

Největší peklo přijde až poté. Je nutné nastavit

HADOOP

etc/hadoop/core-site.xml


<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Nastavit SSH:

ssh localhost
ssh-keygen -t rsa -P “ -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Zformátovat HDFS:

hdfs namenode -format

Pak nastavit Yarn

etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Plus nesmím zapomenout do hadoop-env.sh přidat JAVA_HOME, nebere si to z .bashrc

Většina výše uvedeného je odsud.

Další HADOOP nastavení jsou zde: https://wiki.apache.org/hadoop/GettingStartedWithHadoop

HIVE je popsaný zde: http://doctuts.readthedocs.io/en/latest/hive.html

Případně je proces HIVE + HADOOP popsaný zde: https://www.tutorialspoint.com/hive/hive_installation.htm

U Hive je důležitý nejdřív zapnout metastore: hive –service metastore

pak vytvořit metastore_db pomocí: schematool -dbType derby -initSchema

Když to zlobí, tak je třema odstranit metastore_db pomoc rm příkazu a spustit znovu schematool. Pak vše funguje.

Aby Hive užíval Spark je třeba do hive-site.xml přidat:

</property>
  <name>hive.execution.engine</name>
  <value>spark</value>
  <description>
    Chooses execution engine.
  </description>
</property>
Reklamy

Zanechat Odpověď

Vyplňte detaily níže nebo klikněte na ikonu pro přihlášení:

WordPress.com Logo

Komentujete pomocí vašeho WordPress.com účtu. Odhlásit / Změnit )

Twitter picture

Komentujete pomocí vašeho Twitter účtu. Odhlásit / Změnit )

Facebook photo

Komentujete pomocí vašeho Facebook účtu. Odhlásit / Změnit )

Google+ photo

Komentujete pomocí vašeho Google+ účtu. Odhlásit / Změnit )

Připojování k %s