Hive+Sqoop淺度學(xué)習(xí)指南

Kahn 發(fā)布于2019-06-25 19:18 / 1257人閱讀

摘要：業(yè)務(wù)需求統(tǒng)計(jì)每小時(shí)的數(shù)數(shù)據(jù)采集數(shù)據(jù)清洗用來描述將數(shù)據(jù)從來源端經(jīng)過抽取轉(zhuǎn)換加載至目的端的過程字段過濾字段補(bǔ)全用戶信息商品信息字段格式化數(shù)據(jù)分析將數(shù)據(jù)導(dǎo)出介紹由開源的用于解決海量結(jié)構(gòu)化日志的數(shù)據(jù)統(tǒng)計(jì)的項(xiàng)目本質(zhì)將轉(zhuǎn)化為程序的其實(shí)時(shí)上的目錄和

業(yè)務(wù)

需求：統(tǒng)計(jì)每小時(shí)的PV數(shù)

數(shù)據(jù)采集

hdfs

hive

數(shù)據(jù)清洗（ETL）

用來描述將數(shù)據(jù)從來源端經(jīng)過抽取（extract）、轉(zhuǎn)換（transform）、加載（load）至目的端的過程

字段過濾

"31/Aug/2015:00:04:37 +0800"

"GET /course/view.php?id=27 HTTP/1.1"

字段補(bǔ)全

用戶信息、商品信息-》RDBMS

字段格式化

2015-08-31 00:04:37 20150831000437

數(shù)據(jù)分析

MapReduce

Hive

Spark

將數(shù)據(jù)導(dǎo)出

hive介紹

由Facebook開源的,用于解決海量結(jié)構(gòu)化日志的數(shù)據(jù)統(tǒng)計(jì)的項(xiàng)目

本質(zhì): 將HQL轉(zhuǎn)化為MapReduce程序

Hive的其實(shí)時(shí)HDFS上的目錄和文件

Hive的安裝模式

嵌入模式

元數(shù)據(jù)信息被保存在自帶的Deybe數(shù)據(jù)庫中

只允許創(chuàng)建一個(gè)連接

多用于Demo

本地模式

元數(shù)據(jù)信息被保存在MySQL數(shù)據(jù)庫

MySQL數(shù)據(jù)庫與Hive運(yùn)行在同一臺(tái)物理機(jī)器上

多用于開發(fā)和測(cè)試

遠(yuǎn)程模式

元數(shù)據(jù)信息被保存在MySQL數(shù)據(jù)庫

MySQL數(shù)據(jù)庫與Hive運(yùn)行在不同臺(tái)物理機(jī)器上

用于實(shí)際生成環(huán)境

Linux下MySQL安裝

? 1) 卸載

? $ rpm -qa | grep mysql

? $ sudo rpm -e mysql-libs-5.1.71-1.el6.x86_64 --nodeps

? 2) 安裝

? 可選擇將緩存替換,然后再安裝 $ sudo cp -r /opt/software/x86_64/ /var/cache/yum/

? $ sudo yum install -y mysql-server mysql mysql-devel

? 3) 啟動(dòng)mysql服務(wù)

? $ sudo service mysqld start

? 4) 設(shè)置密碼

? $ /usr/bin/mysqladmin -u root password "新密碼"

? 5) 開機(jī)啟動(dòng)

? $ sudo chkconfig mysqld on

? 6) 授權(quán)root的權(quán)限及設(shè)置遠(yuǎn)程登錄

? 登錄

? $ mysql -u root -p

? 授權(quán)

mysql> grant all privileges on *.* to "root"@"%" identified by "密碼";
mysql> grant all privileges on *.* to "root"@"linux01" identified by "密碼"; -- 必須有這一句，%包括所有

? all privileges 所有權(quán)限

? . 所有數(shù)據(jù)庫的所有表

? "root"@"%" 在任意主機(jī)以root身份登錄

? "root"@"linux03.ibf.com" 在linux03主機(jī)以root登錄

? by "root" 使用root作為密碼

? 7)刷新授權(quán)

mysql> flush privileges;

? 8)測(cè)試,在windows中是否可以登錄

 mysql -h linux03.ibf.com -u root -p

hive環(huán)境搭建:本地模式

? 必須先安裝HDFS和Yarn

1)安裝:

? $ tar -zxvf /opt/software/hive-0.13.1-bin.tar.gz -C /opt/modules/

? 重命名hive文件夾名字

? $ cd /opt/modules

? $ mv apache-hive-0.13.1-bin/ hive-0.13.1/

2)在HDFS上創(chuàng)建tmp目錄和hive倉庫

? $ bin/hdfs dfs -mkdir -p /user/hive/warehouse

? $ bin/hdfs dfs -mkdir /tmp #已存在

? $ bin/hdfs dfs -chmod g+w /user/hive/warehouse

? $ bin/hdfs dfs -chmod g+w /tmp

3)修改配置

? $ cd hive-0.13.1/

? $ cp conf/hive-default.xml.template conf/hive-site.xml

? $ cp conf/hive-log4j.properties.template conf/hive-log4j.properties

? $ cp conf/hive-env.sh.template conf/hive-env.sh

? 3-1)修改hive-env.sh

JAVA_HOME=/opt/modules/jdk1.7.0_67 #添加

HADOOP_HOME=/opt/modules/hadoop-2.5.0

export HIVE_CONF_DIR=/opt/modules/hive-0.13.1/conf

? 3-2)修改hive.site.xml


    javax.jdo.option.ConnectionURL
    jdbc:mysql://linux01:3306/metastore?createDatabaseIfNotExist=true


    javax.jdo.option.ConnectionDriverName
    com.mysql.jdbc.Driver


    javax.jdo.option.ConnectionUserName
    root


    javax.jdo.option.ConnectionPassword
    123456

? 3-3)修改日志配置hive-log4j.properties

? hive.log.dir=/opt/modules/hive-0.13.1/logs

? 3-4)拷貝jdbc驅(qū)動(dòng)到hive的lib目錄

? $ cp /opt/software/mysql-connector-java-5.1.34-bin.jar /opt/modules/hive-0.13.1/lib/

? 4)確定yarn和hdfs啟動(dòng)

? $ jps

? 6468 ResourceManager

? 6911 Jps

? 6300 RunJar

? 6757 NodeManager

? 2029 NameNode

? 2153 DataNode

? 此時(shí)使用bin/hive 可以進(jìn)入hive

hive啟動(dòng)及基本使用

? 進(jìn)入hive目錄

? $ cd /opt/modules/hive-0.13.1/

進(jìn)入hive

? bin/hive

基本命令

? show databases;

? create database mydb;

? use mydb;

? show tables;

創(chuàng)建表及加載數(shù)據(jù)

create table student (
id int comment "id of student",
name string comment "name of student",
age int comment "age of student",
gender string comment "sex of student",
addr string
)    
comment "this is a demo"
row format delimited fields terminated by "	";

表默認(rèn)創(chuàng)建在/user/hive/warehouse里

通過hive.metastore.warhouse.dir配置

查看表

desc student; 查看表字段

或

desc formatted student; 可以查看元數(shù)據(jù)

此時(shí)mysql的metastore數(shù)據(jù)庫狀況

mysql> select * from TBLS;

+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+
|      1 |  1556132119 |     6 |                0 | chen  |         0 |     1 | student  | MANAGED_TABLE | NULL               | NULL               |
+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+
1 row in set (0.00 sec)

mysql> select * from COLUMNS_V2;

+-------+-----------------+-------------+-----------+-------------+
| CD_ID | COMMENT         | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+-----------------+-------------+-----------+-------------+
|     1 | NULL            | addr        | string    |           4 |
|     1 | age of student  | age         | int       |           2 |
|     1 | sex of student  | gender      | string    |           3 |
|     1 | id of student   | id          | int       |           0 |
|     1 | name of student | name        | string    |           1 |
+-------+-----------------+-------------+-----------+-------------+
5 rows in set (0.00 sec)

加載數(shù)據(jù)(在家目錄下創(chuàng)建student.log)

load data local inpath "/home/hadoop/student.log" into table student;

從hdfs上加載( 加載完,hdfs上的student.data到表目錄下)

load data inpath "/input/student.data" into table student;

在命令行內(nèi)設(shè)置配置

? 重啟無效

? set hive.cli.print.header=true; #列名

? set hive.cli.print.current.db=true; #表名

? reset; 重置

? 重啟有效

    

?    hive.cli.print.header

?    true





?    hive.cli.print.current.db

?    true

與Linux交互

!ls

!pwd

與hadoop交互

dfs -ls /

dfs -mkdir /hive

hive的腳本

-e 執(zhí)行sql

-f 執(zhí)行sql文件

-S 靜默執(zhí)行

hive -e

$ bin/hive -e "select *from test_db.emp_p"

hive -f

$ bin/hive -S -f /home/hadoop/emp.sql > ~/result.txt

刪除表

drop table user;

清空表

truncate table user;

表類型 內(nèi)部表（管理表 MANAGED_TABLE）

create table emp(
empId int,
empString string,
job string,
salary float,
deptId int
)
row format delimited fields terminated by "	";

load data inpath "/input/dept.txt" into table dept;

# 或從本地加載 
load data local inpath "/home/hadoop/dept.txt" into table dept;

外部表（EXTERNAL_TABLE ）

create external table emp_ex (
empId int,
empName string,
job string,
salary float,
deptId int
)
row format delimited fields terminated by "	"
location "/hive/table/emp";

把數(shù)據(jù)移動(dòng)到表所在位置

hive (mydb)> dfs -mv /input/emp.txt /hive/table/emp/emp.txt

服務(wù)器加載

hive (mydb)> load data local inpath "/home/hadoop/emp.data" into table emp;

或者直接使用dfs命令移動(dòng)數(shù)據(jù)到hive表目錄下

hive (mydb)> dfs -put /home/hadoop/emp.data  /hello/table/emp;

內(nèi)部表和外部表區(qū)別

創(chuàng)建表

外部表創(chuàng)建表的時(shí)候，需要用external

刪除表

外部表在刪除表的時(shí)候只會(huì)刪除表的元數(shù)據(jù)(metadata)信息不會(huì)刪除表數(shù)據(jù)(data)

內(nèi)部表刪除時(shí)會(huì)將元數(shù)據(jù)信息和表數(shù)據(jù)同時(shí)刪除

內(nèi)部表數(shù)據(jù)由Hive自身管理，外部表數(shù)據(jù)由HDFS管理

內(nèi)部表數(shù)據(jù)存儲(chǔ)的位置是hive.metastore.warehouse.dir（默認(rèn)：/user/hive/warehouse），外部表數(shù)據(jù)的存儲(chǔ)位置由自己制定；

分區(qū)表 創(chuàng)建分區(qū)表

create table emp_part(
empno int,
empname string,
empjob string,
mgrno int,
birthday string,
salary float,
bonus float,
deptno  int
)
partitioned by (province string)
row format delimited fields terminated by "	";

向分區(qū)表加載數(shù)據(jù)

顯式指定分區(qū)值

load data local inpath "/home/user01/emp.txt" into table emp_part partition (province="CHICAGO");

分區(qū)操作 查看分區(qū)

show partitions emp_part;

添加分區(qū)

alter table emp_part add partition (province="shanghai");

刪除分區(qū)

alter table emp_part drop partition (province="shanghai");

向分區(qū)添加數(shù)據(jù)

load data local inpath "本地路徑" into table emp_part partition (province="shanghai");

查詢分區(qū)數(shù)據(jù)

select * from emp_part where province="henan";

二級(jí)分區(qū) 創(chuàng)建二級(jí)分區(qū)

create table emp_second(
id int ,
name string,
job string,
salary float,
dept int
)
partitioned by (day string,hour string)
row format delimited fields terminated by "	";

添加分區(qū)

alter table emp_second add partition (day="20180125",hour="16");

刪除分區(qū)

alter table emp_second drop partition (day="20180125");

添加數(shù)據(jù)的時(shí)候指定分區(qū)(沒有該分區(qū)會(huì)創(chuàng)建)

load data local inpath "/home/hadoop/emp.log" into table emp_second partition (day="20180125",hour="17");

桶表

連接兩個(gè)在相同列上劃分了桶的表，使用map side join 實(shí)現(xiàn)

使sampling更高效

需設(shè)置set hive.enforce.bucketing=true

create table bucketed_users(id int, name string)
clustered by (id) into 4 buckets

某個(gè)數(shù)據(jù)被分到哪個(gè)桶根據(jù)指定列的hash值對(duì)桶數(shù)取余得到

導(dǎo)入方式總結(jié) 本地導(dǎo)入

load data local inpath "本地路徑" into table 表名

bin/hdfs dfs -put 本地路徑 hdfs路徑(hive的表位置)

hdfs上導(dǎo)入

load data inpath "hdfs路徑" into table 表名

覆蓋寫

load data inpath "hdfs路徑" overwrite into table 表名

load data local inpath "本地路徑" overwrite into table 表名

通過insert語句將select的結(jié)果插入到一張表中

insert into table test_tb select * from emp_p;

創(chuàng)建表時(shí)加載數(shù)據(jù)

create external table test_tb (

    id int,

    name string

)

row format delimited fields terminated by "	";

location "/hive/test_tb";

sqoop方式 導(dǎo)出方式 hive 腳本 hive -e

bin/hive -e "use test_db;select * from emp_p" > /home/hadoop/result.txt

hive -f 執(zhí)行sql文件

bin/hive -f 路徑 >> /home/hadoop/result.txt

hive>

導(dǎo)出到本地 (默認(rèn)分隔符是 ASSII 001)

insert overwrite local directory "/home/hadoop/data" select * from emp_p;

insert overwrite local directory "/home/hadoop/data" row format delimited fields terminated by "^" select * from emp_p;

到出HDFS

hive > insert overwrite directory "/data" select * from emp_p;

export 和 import (HDFS)

hive > export table emp_p to "/input/export" ;

hive > import table emp_imp from "hdfs_path" ;

HQL

http://hive.apache.org/

常用語法

通配 *指定字段

select id,name from emp;

where 條件查詢

select * from emp_p where salary > 10000;

between and

select * from emp_p where sal between 10000 and 15000;

is null| is not null

select * from user where email is not null;

in () | not in ()

select * from emp_p where did in (1,2,3);

聚合函數(shù)

count max min sum avg

select count(1) personOfDept from emp_p group by job;

select sum(sal) from emp_p;

distinct

select distinct id from emp_part;

select distinct name, province from emp_part;

子查詢

select eid,ename,salary ,did from emp where emp.did in (select did from dept where dname="人事部");

表連接

emp.eid emp.ename       emp.salary      emp.did
1001    jack    10000.0 1
1002    tom     2000.0  2
1003    lily    20000.0 3
1004    aobama  10000.0 5
1005    yang    10000.0 6

dept.did        dept.dname      dept.dtel
1       人事部  021-456
2       財(cái)務(wù)部  021-234
3       技術(shù)部  021-345
4       BI部    021-31
5       產(chǎn)品部  021-232

積

select * from dept, emp;

select * from emp, dept where emp.did=dept.did;

join

select t1.eid, t1.ename, t1.salary,t2.did ,t2.dname from emp t1 join dept t2 on t1.did=t2.did;

外連查詢

left join

select eid,ename, salary,t2.did, t2.dname from emp t1 left join dept t2 on t1.did = t2.did;

right join

select eid,ename, salary,t2.did, t2.dname from emp t1 right join dept t2 on t1.did = t2.did;

全連接

select eid,ename, salary,t2.did, t2.dname from emp t1 full join dept t2 on t1.did = t2.did;

hql的四種排序 全局排序Order By (對(duì)所有的數(shù)據(jù)進(jìn)行排序)

select * from emp_part order by salary;

設(shè)置reduce個(gè)數(shù)為3,也只有一個(gè)文件

set mapreduce.job.reduces=3;

內(nèi)部排序sort by (每個(gè)reduce內(nèi)部進(jìn)行排序 )

底層時(shí)在reduce函數(shù)之前完成的

設(shè)置reduce個(gè)數(shù)

set mapreduce.job.reduces=2;

insert overwrite local directory "/home/hadoop/result" select * from emp_part sort by salary; # 默認(rèn)reduce個(gè)數(shù)為1, 這種情況下和order by一樣

分區(qū)排序（通過distribute by設(shè)置分區(qū) ，使用 sort by設(shè)置分區(qū)內(nèi)排序）

set mapreduce.job.reduces=3;

這里使用部分分區(qū),薪資排序

insert overwrite local directory "/home/hadoop/result" select * from emp_part distribute by deptno sort by salary;

Cluster By (distribute by 和sort by條件一致時(shí) 使用cluster by) 連接方式 配置

修改hive-site.xml


hive.server2.long.polling.timeout
5000



hive.server2.thrift.port
10000



hive.server2.thrift.bind.host
bigdata.ibf.com

mysql數(shù)據(jù)庫中創(chuàng)建一個(gè)普通用戶

1)創(chuàng)建用戶

 CREATE USER "hadoop"@"centos01.bigdata.com" IDENTIFIED BY "123456";

2)授權(quán)訪問（hive的存儲(chǔ)元數(shù)據(jù)的數(shù)據(jù)庫）

GRANT ALL ON metastore.* TO "hadoop"@"centos01.bigdata.com" IDENTIFIED BY "123456";  

GRANT ALL ON metastore.* TO "hadoop"@"%" IDENTIFIED BY "123456";

3)刷新授權(quán)

flush privileges;

beeline

啟動(dòng)服務(wù)

$ bin/hiveserver2 &
或
$bin/hive --service hiveserver2 &

連接

$ bin/beeline
beeline>!connect jdbc:hive2://bigdata.ibf.com:10000
輸入mysql的用戶名
輸入mysql密碼

sqoop的介紹及安裝部署

功能：用于HDFS與RDBMS之間數(shù)據(jù)的導(dǎo)入導(dǎo)出

所有的導(dǎo)入導(dǎo)出都是基于HDFS而言

數(shù)據(jù)分析流程

數(shù)據(jù)采集
                     日志;    
                     RDBMS; 使用sqoop,將需要分析的數(shù)據(jù)采集到HDFS
數(shù)據(jù)清洗
            字段過濾
           字段補(bǔ)全            -》將需要分析的字段導(dǎo)入到HDFS
           字段格式化               
數(shù)據(jù)分析
            將分析后的數(shù)據(jù)存儲(chǔ)在HDFS
            將結(jié)果數(shù)據(jù)從HDFS導(dǎo)出到MySQL
數(shù)據(jù)展示
                     從RDBMS中讀取數(shù)據(jù)

sqoop支持：HDFS，hive，hbase

sqoop的底層

    -》使用sqoop命令，通過不同的參數(shù)，實(shí)現(xiàn)不同的需求

    -》sqoop根據(jù)不同的參數(shù)，解析后傳遞給底層的MapReduce模板

    -》將封裝好的MapReduce打成jar包，提交給yarn執(zhí)行

    -》這個(gè)MapReduce只有maptask，沒有reducetask

版本

    -》sqoop1

    -》sqoop2：

        -》多了server端

        -》添加了安全機(jī)制

安裝部署

下載解壓

tar -zxvf /opt/software/sqoop-1.4.5-cdh5.3.6.tar.gz  -C /opt/cdh-5.3.6/

修改配置文件

$ pwd
/opt/cdh-5.3.6/sqoop-1.4.5-cdh5.3.6
$ cp conf/sqoop-env-template.sh conf/sqoop-env.sh

修改sqoop-env.sh

 export HADOOP_COMMON_HOME=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6
 #Set path to where hadoop-*-core.jar is available
 export HADOOP_MAPRED_HOME=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6
 #Set the path to where bin/hive is available
 export HIVE_HOME=/opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6

將MySQL連接驅(qū)動(dòng)放入sqoop的lib目錄

$ cp /opt/software/mysql-connector-java-5.1.34-bin.jar  /opt/cdh-5.3.6/sqoop-1.4.5-cdh5.3.6/lib/

使用測(cè)試

查看命令信息

[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop help

查看數(shù)據(jù)庫

[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop list-databases 
--connect jdbc:mysql://linux03.ibf.com:3306 
--username root 
--password 123456

解決sqoop1.4.6-cdh-5.14.2的報(bào)錯(cuò)問題

在sqoop-1.4.6中,需要添加java-json包

$ cp /opt/software/java-json.jar /opt/cdh5.14.2/sqoop-1.4.6-cdh5.14.2/lib/

解決找不到hive倉庫的問題

$ cp ${HIVE_HOME}/conf/hive-site.xml ${SQOOP_HOME}/conf/

在HADOOP_CLASSPATH中追加hive的依賴

$ sudo vi /etc/profile
#HADOOP_CLASSPATH
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/cdh5.14.2/hive-1.1.0-cdh5.14.2/lib/*
source /etc/profile

sqoop的導(dǎo)入

bin/sqoop import --help #查看命令提示

導(dǎo)入到HDFS

源：MySQL的一張表

目標(biāo)：HDFS一個(gè)路徑

在MySQL中創(chuàng)建測(cè)試表

在mysql中添加數(shù)據(jù)

use test_db;
create table user(
id int primary key,
name varchar(20) not null,
salary float
)charset=utf8;
insert into user values(1,"張三",9000);
insert into user values(2,"李四",10000);
insert into user values(3,"王五",6000);

案例一：mysql-> hdfs (導(dǎo)入到默認(rèn)路徑)

把mysql 中test_db.user 導(dǎo)入到HDFS上, 默認(rèn)在hdfs://linux01:8020/user/hadoop/

[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import 
> --connect jdbc:mysql://linux03.ibf.com:3306/mydb 
> --username root 
> --password 123456 
> --table user

當(dāng)沒有reduce時(shí), 有幾個(gè)map就有幾個(gè)輸出文件

案例二：mysql-> hdfs制定路徑及map個(gè)數(shù)

? -》指定hdfs輸出目錄：--target-dir
? -》指定map的個(gè)數(shù)：-m

[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import 
> --connect jdbc:mysql://linux03.ibf.com:3306/test_db 
> --username root 
> --password root 
> --table user 
> --target-dir /toHdfs 
> -m 1

案例三：

? -》修改導(dǎo)出分隔符 --fields-terminated-by
? -》--direct 導(dǎo)入更快
? -》提前刪除輸出目錄

[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import 
> --connect jdbc:mysql://linux03.ibf.com:3306/test_db 
> --username root 
> --password root 
> --table toHdfs 
> --target-dir /toHdfs 
> --direct 
> --delete-target-dir 
> --fields-terminated-by "	" 
> -m 1

案例四：導(dǎo)入指定的列：--columns

[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import 
> --connect jdbc:mysql://linux03.ibf.com:3306/mydb 
> --username root 
> --password 123456 
> --table user 
> --columns name,salary 
> --fields-terminated-by "-" 
> --target-dir /sqoop 
> --delete-target-dir 
> --direct 
> -m 1

將SQL語句執(zhí)行的結(jié)果進(jìn)行導(dǎo)入-e,--query

bin/sqoop import 
--connect jdbc:mysql://bigdata01.com:3306/test 
--username root 
--password 123456 
-e "select * from user where salary>9000 and $CONDITIONS" 
--target-dir /toHdfs 
--delete-target-dir 
-m 1

在上面的-e的查詢語句中必須包含where $CONDITIONS ,

    如果想用where語句
     where salary>9000 and $CONDITIONS"

可以設(shè)置密碼文件(把--password 改為 --password-file)

sqoop會(huì)讀取整個(gè)password-file，包括空格和回車，可以使用echo -n命令生成密碼文件，如：echo -n "secret" > password.file

$ echo -n "root" > /home/hadoop/mysqlpasswd && chmod 400 /home/hadoop/mysqlpasswd
bin/sqoop import 
--connect jdbc:mysql://bigdata01.com:3306/test 
--username root 
--password-file  file:///home/hadoop/mysqlpasswd 
-e "select * from toHdfs where $CONDITIONS" 
--target-dir /sqoop 
--delete-target-dir 
-m 1

導(dǎo)入到HIVE

hive,指定數(shù)據(jù)庫中沒有該表, 就會(huì)創(chuàng)建該表

bin/sqoop import 
--connect jdbc:mysql://linux03.ibf.com:3306/mydb 
--username root 
-P 
--table user 
--fields-terminated-by "	" 
--delete-target-dir 
-m 1 
--hive-import 
--hive-database test_db 
--hive-table user

增量導(dǎo)入

  過程
            MapReduce將數(shù)據(jù)導(dǎo)入到hdfs用戶的家目錄
            從家目錄將數(shù)據(jù)導(dǎo)入到hive表
        增量導(dǎo)入
            追加：根據(jù)某一列上一次導(dǎo)入的最后一個(gè)值，來判斷追加的數(shù)據(jù)
            時(shí)間戳：根據(jù)數(shù)據(jù)記錄修改的時(shí)間戳來進(jìn)行導(dǎo)入
      --check-column         Source column to check for incremental   change
      --incremental     Define an incremental import of type   "append" or "lastmodified"
      --last-value           Last imported value in the incremental  check column

如果HDFS上沒有該文件會(huì)創(chuàng)建該文件

bin/sqoop import 
--connect jdbc:mysql://linux03.ibf.com:3306/mydb 
--username root 
--password 123456 
--table user 
--fields-terminated-by "	" 
--target-dir /sqoop/incremental 
-m 1 
--direct 
--check-column id 
--incremental append 
--last-value 3

sqoop job

創(chuàng)建sqoop job,自動(dòng)創(chuàng)建增量 (報(bào)錯(cuò))

Sqoop job相關(guān)的命令有兩個(gè)：

bin/sqoop job

bin/sqoop-job

使用這兩個(gè)都可以

創(chuàng)建job：--create

刪除job：--delete

執(zhí)行job：--exec

顯示job：--show

列出job：--list

創(chuàng)建job

bin/sqoop-job 
--create your-sync-job 
-- import 
--connect jdbc:mysql://linux03.ibf.com:3306/mydb 
--username root 
-P 
--table user 
-m 1 
--target-dir /hive/incremental 
--incremental append 
--check-column id 
--last-value 1

查看job

bin/sqoop-job --show your-sync-job

bin/sqoop job --show your-sync-job
bin/sqoop job --exec your-sync-job

bin/sqoop job --list
bin/sqoop job --delete my-sync-job

sqoop的導(dǎo)出

將數(shù)據(jù)從hive(HDFS上的文件與目錄),HDFS導(dǎo)出到MySQL

use mydb
create table user_export(
id int primary key,
name varchar(20) not null,
salary float
);

需要現(xiàn)在數(shù)據(jù)庫中建立表

bin/sqoop export 
--connect jdbc:mysql://linux03.ibf.com:3306/mydb 
--username root 
-P 
--table user_export 
--export-dir /hive/incremental 
--input-fields-terminated-by "," 
-m 1

使用sqoop --options-file

創(chuàng)建文件 vi sqoopScript ( 所有參數(shù)一行一個(gè))

編輯文件sqoopScript

export
--connect 
jdbc:mysql://linux03.ibf.com:3306/test_db
--username
root
-P
--table
emp
-m
1
--export-dir
/input/export
--fields-terminated-by
"	"

執(zhí)行導(dǎo)出()

bin/sqoop --options-file ~/sqoopScript

內(nèi)容

Hive簡(jiǎn)單案例需求分析及結(jié)果的導(dǎo)出

動(dòng)態(tài)分區(qū)的介紹及使用

使用腳本動(dòng)態(tài)加載到hive表中

hive函數(shù)

簡(jiǎn)單日志流量案例

1 需求及分析

需求

分析統(tǒng)計(jì)每天每小時(shí)的PV數(shù)和UV數(shù)

分析

創(chuàng)建數(shù)據(jù)源表

? 創(chuàng)建分區(qū)表(天,小時(shí))/ 加載數(shù)據(jù)

數(shù)據(jù)清洗

? 創(chuàng)建hive表

? 字段過濾

? id url guid 字段補(bǔ)全(無) 字段格式化(無)

數(shù)據(jù)分析

? pv：count(url) uv：count(distinct guid)

保存結(jié)果

? 日期(天) 小時(shí) PV UV

導(dǎo)出結(jié)果

? 導(dǎo)出到MySQL

2 具體實(shí)現(xiàn)

數(shù)據(jù)原表

1) 創(chuàng)建原表

create database if not exists hive_db;

user hive_db;

create table tracklogs(

id string,

url string,

referer string,

keyword string,

type string,

guid string,

pageId string,

moduleId string,

linkId string,

attachedInfo string,

sessionId string,

trackerU string,

trackerType string,

ip string,

trackerSrc string,

cookie string,

orderCode string,

trackTime string,

endUserId string,

firstLink string,

sessionViewNo string,

productId string,

curMerchantId string,

provinceId string,

cityId string,

fee string,

edmActivity string,

edmEmail string,

edmJobId string,

ieVersion string,

platform string,

internalKeyword string,

resultSum string,

currentPage string,

linkPosition string,

buttonPosition string

)

partitioned by (date string,hour string)

row format delimited fields terminated by "t";

2) 加載數(shù)據(jù)

load data local inpath "/opt/datas/2015082818" into table tracklogs partition(date="20150828",hour="18");

load data local inpath "/opt/datas/2015082819" into table tracklogs partition(date="20150828",hour="19");

分析

1) 建立數(shù)據(jù)分析表

create table clear (

id string,

url string,

guid string

)

partitioned by (date string, hour string)

row format delimited fields terminated by "t";

2) 過濾數(shù)據(jù)

insert into table clear partition(date="20150828",hour="18") select id,url,guid from tracklogs where date="20150828" and hour="18";

insert into table clear partition(date="20150828",hour="19") select id,url,guid from tracklogs where date="20150828" and hour="19";

3) 指標(biāo)分析

pv : select date,hour,count(url) as pv from clear group by date,hour;

uv: select date,hour, count(distinct guid) as uv from clear group by date,hour;

保存結(jié)果到result

create table result as select date,hour, count(url) pv, count(distinct guid) as uv from clear group by date,hour;

創(chuàng)建表時(shí)沒指定分隔符則默認(rèn)分隔符為 001
導(dǎo)出結(jié)果到mysql

# 創(chuàng)建表

create table result(

day varchar(30),

hour varchar(30),

pv varchar(30) not null,

uv varchar(30) not null,

primary key(day,hour)

);

# 導(dǎo)出數(shù)據(jù)

[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop export

--connect jdbc:mysql://linux03.ibf.com:3306/mydb

--username root

--password root

--table result

--export-dir /user/hive/warehouse/hive_db.db/result

--input-fields-terminated-by "001"

-m 1

動(dòng)態(tài)分區(qū)表

開啟動(dòng)態(tài)分區(qū)

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

打開動(dòng)態(tài)分區(qū)后，動(dòng)態(tài)分區(qū)的模式，有 strict和 nonstrict 兩個(gè)值可選，strict 要求至少包含一個(gè)靜態(tài)分區(qū)列，nonstrict則無此要求。

創(chuàng)建表

create table clear_dynamic (

id string,

url string,

guid string

)

partitioned by (date string, hour string)

row format delimited fields terminated by "t";

動(dòng)態(tài)加載數(shù)據(jù)

直接加載20180129的所有hour的數(shù)據(jù)

insert into table clear_dynamic partition(date="20180129",hour) select id,url,guid,hour from tracklogs where date="20180129";

根據(jù)hour自動(dòng)分區(qū)

以前是這樣寫

insert into table clear partition(date="20150828",hour="18") select id,url,guid from tracklogs where date="20150828" and hour="18";

insert into table clear partition(date="20150828",hour="19") select id,url,guid from tracklogs where date="20150828" and hour="19";

使用腳本動(dòng)態(tài)加載到hive表中

20180129/

? 2018012900

? 2018012901

? 2018012902

? 2018012903

? 2018012904

? 2018012905

1) 編寫shell_腳本(bin/hive -e "" )

2) 測(cè)試腳本

show partitions tracklogs; #查看分區(qū)

alter table tracklogs drop partition(date="20150828",hour="18"); 刪除分區(qū)

alter table tracklogs drop partition(date="20150828",hour="19");

select count(1) from tracklogs; #查看記錄數(shù)

3) 使用shell腳本使用(bin/hive -f )

4)測(cè)試

show partitions tracklogs; #查看分區(qū)

alter table tracklogs drop partition(date="20150828",hour="18"); 刪除分區(qū)

alter table tracklogs drop partition(date="20150828",hour="19");

select count(1) from tracklogs; #查看記錄數(shù)

Hive函數(shù)

用戶自定義函數(shù),用于實(shí)現(xiàn)hive中不能實(shí)現(xiàn)的業(yè)務(wù)邏輯處理

類型:

? UDF: 一進(jìn)一出

? UDAF: 多進(jìn)一出 sum,count等

? UDTF: 一進(jìn)多出行列轉(zhuǎn)換

編寫UDF:

? 編寫UDF必須繼承UDF

? 必須至少實(shí)現(xiàn)一個(gè)evaluale方法

? 必須要有返回類型,可以是null

? 建議使用hadoop序列化類型

需求:日期轉(zhuǎn)換

? 31/Aug/2015:00:04:37 +0800 --> 2015-08-31 00:04:37

實(shí)現(xiàn)步驟

? 1) 自定義類實(shí)現(xiàn)UDF類

? 2) 打包不要指定主類

? 3) 添加到hive中

maven中導(dǎo)入hadoop的包和hive的包


    org.apache.hive
    hive-exec
    1.2.2

具體實(shí)現(xiàn)范例

package com.myudf;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class DateFormate extends UDF {
    SimpleDateFormat inputDate = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss",Locale.ENGLISH);
    SimpleDateFormat outDate = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
    
    //   31/Aug/2015:00:04:37 +0800 -->  2015-08-31 00:04:37
    public Text evaluate(Text str) {
        if(str == null) {
            return null;
        }
        if(StringUtils.isBlank(str.toString())) {
            return null;
        }
        Date date = null;
        String val = null;
        try {
            date = inputDate.parse(str.toString());
            val = outDate.format(date);
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return new Text(val);
    }
    
    public static void main(String[] args) {
        Text val = new DateFormate().evaluate(new Text("31/Aug/2015:00:04:37 +0800"));
        System.out.println(val);
    }
}

hive (test_db)>add jar /home/hadoop/DDD.jar;

hive (test_db)> CREATE TEMPORARY FUNCTION removequote as "com.myudf.date.RemoveQuoteUDF";

hive (test_db)> show functions;

hive壓縮格式 壓縮格式

bzip2, gzip, lzo, snappy等

壓縮比：bzip2>gzip>lzo bzip2

壓縮解壓速度：lzo>gzip>bzip2 lzo

hadoop支持的壓縮格式

bin/hadoop checknative  -a

http://google.github.io/snappy/

配置壓縮

編譯hadoop源碼：

mvn package -Pdist,native,docs -DskipTests -Dtar  -Drequire.snappy

替換$HADOOP_HOME/lib/native

關(guān)閉hadoop相關(guān)進(jìn)程

解壓cdh5.xxx-snappy-lib-native.tar.gz 到$HADOOP_HOME/lib

$ tar -zxvf native-hadoop-cdh5.14.2.tar.gz -C /opt/modules/hadoop-2.6.0-cdh5.14.2/lib

在次檢查支持解壓

 可以觀察到已經(jīng)支持
 $ bin/hadoop checknative -a

配置hadoop (jobhistory可以查看所有配置信息)

mapred-site.xml 配置


    mapreduce.map.output.compress
    true



    mapreduce.map.output.compress.codec
    org.apache.hadoop.io.compress.SnappyCodec

測(cè)試

運(yùn)行pi程序: $ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar pi 1 2

通過主機(jī):19888觀察該任務(wù)的configuration中壓縮配置

通過hive配置壓縮格式

shuffle階段啟用壓縮

set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

?reduce輸出的結(jié)果文件進(jìn)行壓縮

set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

hive中的文件存儲(chǔ)格式 格式

create table (
...
)
row format delimited fields terminated by ""
STORED AS file_format

文件格式如下

TEXTFILE

RCFILE

ORC

PARQUET

AVRO

INPUTFORMAT

常用的文件格式（默認(rèn)是textfile ）

| ORC -- (Note: Available in Hive 0.11.0 and later)

| PARQUET --Parquet就是基于Dremel的數(shù)據(jù)模型和算法實(shí)現(xiàn)的。這個(gè)比較常見

數(shù)據(jù)存儲(chǔ)類型 按行存儲(chǔ)

寫的快

按列存儲(chǔ)

讀得快

可以跳過不符合條件的數(shù)據(jù)，只讀取需要的數(shù)據(jù)，降低IO數(shù)據(jù)量。

壓縮編碼可以降低磁盤存儲(chǔ)空間。由于同一列的數(shù)據(jù)類型是一樣的，可以使用更高效的壓縮編碼進(jìn)一步節(jié)約存儲(chǔ)空間。

只讀取需要的列，支持向量運(yùn)算，能夠獲取更好的掃描性能。

驗(yàn)證存儲(chǔ)格式及壓縮

使用給定的日志文件（18.1MB）

使用不同的存儲(chǔ)格式，存儲(chǔ)相同的數(shù)據(jù)，判斷文件大小

在MapReduce的shuffle階段啟用壓縮(對(duì)中間數(shù)據(jù)進(jìn)行壓縮可以減少map和reduce task間的數(shù)據(jù)傳輸量。對(duì)于IO型作業(yè)，可以加快速度。)

set hive.exec.compress.intermediate=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

對(duì)輸出結(jié)果壓縮

set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

創(chuàng)建表file_text ，并加載數(shù)據(jù)

create table if not exists file_text(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by "	"
stored  as  textfile;
load data local inpath "/home/hadoop/page_views.data" into table file_text;

對(duì)比默認(rèn)格式和file_orc_snappy 數(shù)據(jù)大小比較

create table if not exists file_orc_snappy(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by "	"
stored as  ORC
tblproperties("orc.compression"="Snappy");

insert into table file_orc_snappy select *  from file_text;
-- 不能通過load來加載，因?yàn)閘oad本質(zhì)是hdfs的put，這樣不能壓縮，必須要insert這樣走M(jìn)apReduce才能讓壓縮發(fā)揮作用

對(duì)比默認(rèn)格式和parquet格式數(shù)據(jù)大小比較

create table if not exists file_parquet(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by "	"
stored as parquet;

insert into table file_parquet select * from file_text;

對(duì)比默認(rèn)格式和parquet格式，snappy壓縮數(shù)據(jù)大小比較

create table if not exists file_parquet_snappy(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by "	"
stored as parquet
tblproperties("parquet.compression"="Snappy");

insert into table file_parquet_snappy select * from file_text;

hive (mydb)> dfs -du -s -h /user/hive/warehouse/mydb.db/file_parquet_snappy;
hive (mydb)> dfs -du -s -h /user/hive/warehouse/mydb.db/file_parquet;

hive中使用正則加載數(shù)據(jù)

通過正則匹配，加載復(fù)雜格式日志文件

1 正則

2 根據(jù)日志加載數(shù)據(jù)

? 日志

"27.38.5.159" "-" "31/Aug/2015:00:04:53 +0800" "GET /course/view.php?id=27 HTTP/1.1" "200" "7877" - "http://www.ibf.com/user.php?act=mycourse&testsession=1637" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibf.com"

? 創(chuàng)建表

CREATE TABLE apachelog (

remote_addr string,

remote_user string,

time_local string,

request string,

status string,

body_bytes_set string,

request_body string,

http_referer string,

http_user_agent string,

http_x_forwarded_for string,

host string

)

ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.RegexSerDe"

WITH SERDEPROPERTIES (

"input.regex" = "("[^ ]*") ("-|[^ ]*") ("[^]]*") ("[^]]*") ("[0-9]*") ("[0-9]*") (-|[^ ]*) ("[^ ]*") ("[^"]*") ("-|[^ ]*") ("[^ ]*")"

)

STORED AS TEXTFILE;

load data local inpath "/home/hadoop/moodle.ibf.access.log" into table apachelog;

Hive優(yōu)化 大表拆分為小表 外部表+分區(qū)表 存儲(chǔ)格式&數(shù)據(jù)壓縮 SQL優(yōu)化 并行執(zhí)行

//Whether to execute jobs in parallel

set hive.exec.parallel=true;

//How many jobs at most can be executed in parallel

set hive.exec.parallel.thread.number=8;#可以調(diào)大,提高并行效率

mapreduce

Reduce數(shù)目

set mapreduce.job.reduces=1

JVM重用

mapreduce.job.jvm.numtasks=1 默認(rèn)1個(gè)

推測(cè)執(zhí)行

hive配置，默認(rèn)為true

set hive.mapred.reduce.tasks.speculative.execution=true;

hadoop

mapreduce.map.speculative true

mapreduce.reduce.speculative true

設(shè)置輸出文件合并

Size of merged files at the end of the job

將小文件合并避免降低hdfs存儲(chǔ)大量小文件而降低性能

set hive.merge.size.per.task=256000000;

嚴(yán)格模式

set hive.mapred.mode=strict; nonstrict默認(rèn)

? 嚴(yán)格模式下，

? 分區(qū)表，必須加分區(qū)字段過濾條件

? 對(duì)order by, 必須使用limit

? 限制笛卡爾積的查詢（join 的時(shí)候不使用on，而使用where）

hive join

map join

如果關(guān)聯(lián)查詢兩張表中有一張小表默認(rèn)map join，將小表加入內(nèi)存

hive.mapjoin.smalltable.filesize=25000000 默認(rèn)大小

hive.auto.convert.join=true 默認(rèn)開啟

如果沒有開啟使用mapjoin，使用語句制定小表使用mapjoin

select /+ MAPJOIN(time_dim) / count(1) from

store_sales join time_dim on (ss_sold_time_sk = t_time_sk)

reduce join

對(duì)兩張大表join

對(duì)關(guān)聯(lián)的key進(jìn)行分組

smb join

Sort-Merge-Bucket join

解決大表與大表join速度慢問題

通過分桶字段的的hash值對(duì)桶的個(gè)數(shù)取余進(jìn)行分桶

set hive.enforce.bucketing=true;

create table 表名 (

字段

)

clustered by(分桶字段) into 分桶數(shù)量 buckets;

如

create table student(

id int,

age int,

name string

)

clustered by (id) into 4 bucket

row format delimited fields terminated by ",";

小文件處理

重建表，建表時(shí)減少reduce的數(shù)量

通過參數(shù)調(diào)節(jié)，設(shè)置map/reduce的數(shù)量

//每個(gè)Map最大輸入大小(這個(gè)值決定了合并后文件的數(shù)量)

set mapred.max.split.size=256000000;

//一個(gè)節(jié)點(diǎn)上split的至少的大小(這個(gè)值決定了多個(gè)DataNode上的文件是否需要合并)

set mapred.min.split.size.per.node=100000000;

//一個(gè)交換機(jī)下split的至少的大小(這個(gè)值決定了多個(gè)交換機(jī)上的文件是否需要合并)

set mapred.min.split.size.per.rack=100000000;

//執(zhí)行Map前進(jìn)行小文件合并

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

//設(shè)置map端輸出進(jìn)行合并，默認(rèn)為true

set hive.merge.mapfiles = true

//設(shè)置reduce端輸出進(jìn)行合并，默認(rèn)為false

set hive.merge.mapredfiles = true

//設(shè)置合并文件的大小

set hive.merge.size.per.task = 256000000

//當(dāng)輸出文件平均大小小于設(shè)定值時(shí)，啟動(dòng)合并操作。這一設(shè)定只有當(dāng)hive.merge.mapfiles或hive.merge.mapredfiles設(shè)定為true時(shí)，才會(huì)對(duì)相應(yīng)的操作有效。

set hive.merge.smallfiles.avgsize=16000000

數(shù)據(jù)傾斜

本質(zhì)原因：key的分布不均導(dǎo)致的

Map 端部分聚合，相當(dāng)于Combiner

hive.map.aggr=true

有數(shù)據(jù)傾斜的時(shí)候進(jìn)行負(fù)載均衡

hive.groupby.skewindata=true

當(dāng)選項(xiàng)設(shè)定為 true，生成的查詢計(jì)劃會(huì)有兩個(gè) MR Job。第一個(gè) MR Job 中，Map 的輸出結(jié)果集合會(huì)隨機(jī)分布到 Reduce 中，每個(gè) Reduce 做部分聚合操作，并輸出結(jié)果，這樣處理的結(jié)果是相同的 Group By Key 有可能被分發(fā)到不同的 Reduce 中，從而達(dá)到負(fù)載均衡的目的；第二個(gè) MR Job 再根據(jù)預(yù)處理的數(shù)據(jù)結(jié)果按照 Group By Key 分布到 Reduce 中（這個(gè)過程可以保證相同的 Group By Key 被分布到同一個(gè) Reduce 中），最后完成最終的聚合操作。

hive案例：日志分析

名詞
1) UV： count(distinct guid)
訪問您網(wǎng)站的一臺(tái)電腦客戶端為一個(gè)訪客。00:00-24:00內(nèi)相同的客戶端只被計(jì)算一次。
2) PV：Page View--- count(url)
即頁面瀏覽量或點(diǎn)擊量，用戶每次刷新即被計(jì)算一次。
3) 登錄人數(shù)：
登錄網(wǎng)站訪問的人數(shù)[會(huì)員]，endUserId有值的數(shù)量
4) 游客數(shù)：
沒有登錄訪問的人數(shù)，endUserId為空的數(shù)量
5) 平均訪問時(shí)長(zhǎng)：

訪客平均在網(wǎng)站停留的時(shí)間
trackTime  --> max - min

6) 二跳率： pv>1的訪問量/總訪問量

平均瀏覽2個(gè)頁面及以上(pv>1)的用戶數(shù) /  用戶總數(shù)(discont guid) 點(diǎn)擊1次

二跳率的概念是當(dāng)網(wǎng)站頁面展開后，用戶在頁面上產(chǎn)生的首次點(diǎn)擊被稱為“二跳”，二跳的次數(shù)即為“二跳量”。二跳量與瀏覽量的比值稱為頁面的二跳率。

count(case when pv >=2 then guid else null end ) / discont (guid)

7) 獨(dú)立IP：---count(distinct ip)
獨(dú)立IP表示，擁有特定唯一IP地址的計(jì)算機(jī)訪問您的網(wǎng)站的次數(shù)，因?yàn)檫@種統(tǒng)計(jì)方式比較容易實(shí)現(xiàn)，具有較高的真實(shí)性，所以成為大多數(shù)機(jī)構(gòu)衡量網(wǎng)站流量的重要指標(biāo)。比如你是ADSL撥號(hào)上網(wǎng)的，你撥一次號(hào)都自動(dòng)分配一個(gè)ip，這樣你進(jìn)入了本站，那就算一個(gè)ip，當(dāng)你斷線了而沒清理cookies，之后又撥了一次號(hào)，又自動(dòng)分配到一個(gè)ip，你再進(jìn)來了本站，那么又統(tǒng)計(jì)到一個(gè)ip，但是UV（獨(dú)立訪客)沒有變，因?yàn)?次都是你進(jìn)入了本站。

日期	uv	pv	登錄人數(shù)	游客人數(shù)	平均訪問時(shí)間	二跳率	獨(dú)立IP數(shù)

窗口函數(shù)分析函數(shù) over語句

準(zhǔn)備測(cè)試數(shù)據(jù)

hive (db_analogs)> create database ts;
hive (db_analogs)> use ts;
hive (ts)> create table testscore(gender string,satscore int, idnum int) row format delimited fields terminated by "	";
hive (ts)> load data local inpath "/opt/datas/TESTSCORES.csv" into table testscore;

OVER with standard aggregates: COUNT、SUM、MIN/MAX、 AVG

需求1：

按照性別分組，satscore分?jǐn)?shù)排序（降序），最后一列顯示所在分組中的最高分

Female  1000    37070397        1590
Female  970     60714297        1590
Female  910     30834797        1590
Male    1600    39196697        1600
Male    1360    44327297        1600
Male    1340    55983497        1600

答案sql:

hive (ts)>  select gender,satscore,idnum,max(satscore) over(partition by gender order by satscore desc) maxs  from testscore;

注意：

partition by 是分組用的

分析函數(shù)

要求 topN

? 按照性別分組，satscore排序（降序），最后一列顯示在分組中的名次

需求1：

分?jǐn)?shù)相同名次不同，名次后面根據(jù)行數(shù)增長(zhǎng)

Female  1590    23573597        1
Female  1520    40177297        2
Female  1520    73461797        3
Female  1490    9589297 4
Female  1390    99108497        5
Female  1380    23048597        6  # 分?jǐn)?shù)相同
Female  1380    81994397        7  # 分?jǐn)?shù)相同

需求2：

分?jǐn)?shù)相同名次相同，名次后面根據(jù)行數(shù)增長(zhǎng)

Female  1590    23573597        1
Female  1520    40177297        2
Female  1520    73461797        2
Female  1490    9589297 4
Female  1390    99108497        5
Female  1380    23048597        6  #分?jǐn)?shù)相同
Female  1380    81994397        6  # 分?jǐn)?shù)相同

需求3：

分?jǐn)?shù)相同名次相同，名次連續(xù)增長(zhǎng)

Female  1590    23573597        1
Female  1520    40177297        2
Female  1520    73461797        2
Female  1490    9589297 3
Female  1390    99108497        4
Female  1380    23048597        5
Female  1380    81994397        5

SQL

sql1
hive (ts)> select gender,satscore,idnum,row_number() over(partition by gender order by satscore desc) maxs  from testscore;
-- ROW_NUMBER() 從1開始，按照順序，生成分組內(nèi)記錄的序列

sql2
select gender,satscore,idnum,rank() over(partition by gender order by satscore desc) maxs  from testscore;
-- RANK() 生成數(shù)據(jù)項(xiàng)在分組中的排名，排名相等會(huì)在名次中留下空位 

sql3
select gender,satscore,idnum,dense_rank() over(partition by gender order by satscore desc) maxs  from testscore;
-- DENSE_RANK() 生成數(shù)據(jù)項(xiàng)在分組中的排名，排名相等會(huì)在名次中不會(huì)留下空位

窗口函數(shù)

# 當(dāng)有order by，而沒有指定窗口子句時(shí)，窗口子句默認(rèn)為RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(從起點(diǎn)到當(dāng)前行的范圍)

# 當(dāng)order by和窗口子句都沒有時(shí)，窗口子句默認(rèn)ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING（從起點(diǎn)到后面的終點(diǎn)）

UNBOUNDED PRECEDING

UNBOUNDED FOLLOWING

1 PRECEDING

1 FOLLOWING

CURRENT ROW

窗口對(duì)比

select gender,satscore,idnum,sum(satscore) over(partition by gender order by satscore desc RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) sums  from testscore;

select gender,satscore,idnum,sum(satscore) over(partition by gender order by satscore desc RANGE BETWEEN UNBOUNDED PRECEDING AND unbounded following) sums  from testscore;

select gender,satscore,idnum,sum(satscore) over(partition by gender order by satscore desc RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) sums  from testscore;

當(dāng)前行數(shù)據(jù)幅度+1后范圍內(nèi)

LAG

落后值（上n個(gè)值），在不指定落后個(gè)數(shù)的情況下，默認(rèn)為落后一個(gè)值（數(shù)據(jù)從上向下顯示，落后即當(dāng)前值之前顯示的值）

場(chǎng)景: 分析用戶頁面瀏覽順序

sql

hive (ts)> select gender,satscore,idnum, lag(satscore) over(partition by gender order by satscore desc) as lastvalue from testscore;

要求

gender  satscore        idnum   lastvalue
Female  1590    23573597        NULL  # 此處為null，可以為其指定默認(rèn)值
Female  1520    40177297        1590  # 顯示當(dāng)前satscore的上一條記錄的值
Female  1520    73461797        1520  # 顯示當(dāng)前satscore的上一條記錄的值
Female  1490    9589297 1520
Female  1390    99108497        1490

LEAD

與LAG相反（下n擱置），用法同理，前面的值（領(lǐng)先值），默認(rèn)為領(lǐng)先一個(gè)值（數(shù)據(jù)從上向下顯示，領(lǐng)先即當(dāng)前值之后顯示的值）

sql

hive (ts)> select gender,satscore,idnum, lead(satscore, 1, 0) over(partition by gender order by satscore desc) as nextvalue from testscore;

結(jié)果

gender  satscore        idnum   nextvalue
...
Female  1060    59149297        1060
Female  1060    46028397        1000
Female  1000    37070397        970
Female  970     60714297        910
Female  910     30834797        0

GPU云服務(wù)器云服務(wù)器 sqoop hive oracle Sqoop apache sqoop sqoop導(dǎo)入hbase

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://systransis.cn/yun/18029.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

Kahn

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

極客時(shí)間云原生訓(xùn)練營(yíng)

閱讀 5130·2021-11-25 09:43
恒創(chuàng)科技11.11超值購：海外云服務(wù)器264元/年起,獨(dú)立服務(wù)器5折再加送10M帶寬

閱讀 1704·2021-10-27 14:18
虛擬主機(jī)是干什么的-使用虛擬云主機(jī)可以干什么？

閱讀 1069·2021-09-22 16:03
清除浮動(dòng)（clearfix）的常見方法

閱讀 1366·2019-08-30 13:19
879本IT技術(shù)書籍(都看完你就可以升仙兒了)

閱讀 1588·2019-08-30 11:15
可能是你見過最完善的微前端解決方案

閱讀 1664·2019-08-26 14:04
Centos7環(huán)境下源碼部署ThingsBoard

閱讀 3138·2019-08-23 18:40
#思考# 移動(dòng)端絕對(duì)定位在不同設(shè)備適配問題

閱讀 1179·2019-08-23 18:17

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來選購！

Hive+Sqoop淺度學(xué)習(xí)指南

相關(guān)文章

**數(shù)據(jù)集成工具的使用（一）---Sqoop 從理論學(xué)習(xí)到熟練使用**

如何將其他RDBMS的數(shù)據(jù)到遷移到Trafodion

發(fā)表評(píng)論

0條評(píng)論

Kahn

男|高級(jí)講師

TA的文章

極客時(shí)間云原生訓(xùn)練營(yíng)

恒創(chuàng)科技11.11超值購：海外云服務(wù)器264元/年起,獨(dú)立服務(wù)器5折再加送10M帶寬

虛擬主機(jī)是干什么的-使用虛擬云主機(jī)可以干什么？

清除浮動(dòng)（clearfix）的常見方法

879本IT技術(shù)書籍(都看完你就可以升仙兒了)

可能是你見過最完善的微前端解決方案

Centos7環(huán)境下源碼部署ThingsBoard

#思考# 移動(dòng)端絕對(duì)定位在不同設(shè)備適配問題

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來選購！

Hive+Sqoop淺度學(xué)習(xí)指南

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來選購！