首页>>互联网>>大数据->Hadoop之Hive基本操作

Hadoop之Hive基本操作

时间:2023-11-29 本站 点击:0

Hive数据库操作

创建数据库

hive的数据库置默认存放在/user/hive/warehouse目录

create database myhive; create database if not exists myhive;

创建数据库-指定存储路径,location :用来指定数据库的存放目录

create database myhive location '/myhive';

查看数据库

show databases; # like过滤show databases like 'my*';

查看数据库详细信息

desc database myhive;

更多详细信息

desc database extended myhive;

切换数据库

use myhive;

删除数据库

删除一个空数据库,如果数据库下面有数据表,那么就会报错

drop database myhive;drop database if exists myhive;

强制删除数据库,包含数据库下面的表一起删除

drop database myhive cascade;rop database if exists myhive  cascade;

修改数据库属性

使用ALTER DATABASE命令为某个数据库的DBPROPERTIES设置键-值对属性值,来描述这个数据库的属性信息

修改数据库的属性,但是元数据信息不可更改(数据库名称以及数据库所在位置)

alter database myhive2 set dbproperties('createtime'='20220202');

Hive数据类型

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

基础数据类型

类型 Java数据类型 描述\ TINYINT byte 8位有符号整型。取值范围:-128~127 SMALLINT short 16位有符号整型。取值范围:-32768~32767 INT int 32位有符号整型。取值范围:-2 31 ~2 31 -1 BIGINT long 64位有符号整型。取值范围:-2 63 +1~2 63 -1 BINARY 二进制数据类型,目前长度限制为8MB FLOAT float 32位二进制浮点型 DOUBLE double 64位二进制浮点型 DECIMAL(precision,scale) 10进制精确数字类型。precision:表示最多可以表示多少位的数字。取值范围: 1 <= precision <=38 scale:表示小数部分的位数。取值范围: 0 <=scale <= 38 。如果不指定以上两个参数,则默认为decimal(10,0) VARCHAR(n) 变长字符类型,n为长度。取值范围:1~65535 CHAR(n) 固定长度字符类型,n为长度。最大取值255。长度不足则会填充空格,但空格不参与比较。 STRING string 字符串类型,目前长度限制为8MB DATE 日期类型,格式为 yyyy-mm-dd 。取值范围:0000-01-01~9999-12-31 DATETIME 日期时间类型。取值范围:0000-01-0100:00:00.000~9999-12-31 23.59:59.999,精确到毫秒 TIMESTAMP 与时区无关的时间戳类型。取值范围:0000-01-0100:00:00.000000000~9999-12-3123.59:59.999999999,精确到纳秒 BOOLEAN boolean BOOLEAN类型。取值:True、False

复制数据类型

类型 定义方法 构造方法 ARRAY array<1nt>array<struct<a:int,b:string>> array(1, 2, 3)  array(array(1,2) array(3, 4)) MAP map<string,string>  map<smallint,array > map(“k1”, “v1”, “k2”,“v2”)map(1S, array(‘a’, ‘b’),2S, array(‘x’, ‘y’)) STRUCT struct<x:int,y:int>struct<field1:bigint,field2:array ,field3:map<int, int>> named_struct(‘x’, 1, ‘y’,2)named_struct(‘field1’, 100L,‘field2’, array(1, 2), ‘field3’,map(1, 100, 2, 200))

Hive数据表操作

数据库编码问题

创建表出现异常:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead

异常解决:数据库一般默认编码是utf-8或utfmb4,修改hive数据库编码位latin1即可

语法格式

create database myhive location '/myhive';0

CREATE TABLE

create database myhive location '/myhive';1

EXTERNAL

create database myhive location '/myhive';2

COMMENT

create database myhive location '/myhive';3

PARTITIONED BY

create database myhive location '/myhive';4

CLUSTERED BY

create database myhive location '/myhive';5

SORTED BY

create database myhive location '/myhive';6

ROW FORMAT

create database myhive location '/myhive';7

STORED AS

create database myhive location '/myhive';8

LOCATION

create database myhive location '/myhive';9

LIKE

show databases; # like过滤show databases like 'my*';0

创建数据表

show databases; # like过滤show databases like 'my*';1

show databases; # like过滤show databases like 'my*';2

复杂表

show databases; # like过滤show databases like 'my*';3

查看数据表

show databases; # like过滤show databases like 'my*';4

查询表类型

show databases; # like过滤show databases like 'my*';5

查询表结构

show databases; # like过滤show databases like 'my*';6

插入数据

show databases; # like过滤show databases like 'my*';7

查询数据

show databases; # like过滤show databases like 'my*';8

修改数据表

修改表名称

show databases; # like过滤show databases like 'my*';9

添加列

desc database myhive;0

更新列

desc database myhive;1

删除数据表

desc database myhive;2

内部表与外部表

在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。

外部表是把指定的hdfs路径的数据加载到表当中来,hive表不完全独占这份数据,所以删除hive表的时候,数据仍然存放在hdfs当中,不会删掉

创建内部表

创建内部表并指定字段之间的分隔符

desc database myhive;3

根据查询结果创建表(通过复制表结构和表内容创建新表)

create table tb_test2 as show databases; # like过滤show databases like 'my*';8

根据已经存在的表结构创建表

desc database myhive;5

创建外部表

创建外部表并指定字段之间的分隔符

desc database myhive;6

数据加载与导出

数据加载

load data [local] inpath 'datapath' [overwrite] into table tb_name [partition (partcol1=val1,…)];

desc database myhive;7

1.加载linux本地数据

加载数据(将原数据复制到HDFS表目录下)

desc database myhive;8

加载数据并覆盖已有数据

desc database myhive;9

2.加载HDFS数据

从hdfs文件系统向表中加载数据(从HDFS处剪切文件到HDFS表目录下)

desc database extended myhive;0

3.通过查询插入数据形式加载数据表

查询结果插入一张表

insert overwrite table tb_test1 show databases; # like过滤show databases like 'my*';8

查询结果一次性存放到多张表

desc database extended myhive;2

数据导出

1.导出到本地

创建数据存放目录

desc database extended myhive;3

将查询结果数据存放到本地

insert overwrite local directory '/var/hive/tb_test' show databases; # like过滤show databases like 'my*';8

按照指定的方式将数据输出到本地

insert overwrite local directory '/var/hive/tb_test'ROW FORMAT DELIMITED fields terminated by ','collection items terminated by '-'map keys terminated by ':'lines terminated by '\n'show databases; # like过滤show databases like 'my*';8

导出到HDFS

查询结果输出到HDFS

# 创建数据存放目录hdfs dfs -mkdir -p /lzj/copy# 导出查询结果数据insert overwrite directory '/hive/back/tb_test'ROW FORMAT DELIMITED FIELDS TERMINATED BY ','show databases; # like过滤show databases like 'my*';8

使用HDFS命令拷贝文件到其他目录

desc database extended myhive;7

将表结构和数据同时备份

# 导出数据export table tb_test to '/hive/back/tb_test';# 删除表结构desc database myhive;2# 恢复表结构和数据import from '/hive/back/tb_test';

分区表

把大的数据,按照一定规则(年,月,天,时)进行切分成一个个的小的文件,然后操作小的文件

创建分区表

创建一个表带1个分区,分区的字段不能和表的字段相同

desc database extended myhive;9

创建一个表带多个分区,前后两个分区的关系为父子关系

use myhive;0

加载数据到分区表

use myhive;1

use myhive;2

分区表的查询

分区表关联查询

use myhive;3

直接插入数据

use myhive;4

通过查询插入数据

load方式

use myhive;5

select方式

use myhive;6

查看分区情况

use myhive;7

添加一个分区

use myhive;8

删除分区

use myhive;9

动态分区

上述操作属于静态分区

静态分区与动态分区的主要区别在于静态分区是手动指定,而动态分区是通过数据来进行判断

静态分区的列是在编译时期通过用户传递来决定的;动态分区只有在SQL执行时才能决定

开启动态分区首先要在hive会话中设置开启

drop database myhive;drop database if exists myhive;0

其他参数:

drop database myhive;drop database if exists myhive;1

创建分区表

drop database myhive;drop database if exists myhive;2

创建外部表

drop database myhive;drop database if exists myhive;3

如果使静态分区,插入数据必须指定分区的值,如果有多个不同分区值则将执行多次

drop database myhive;drop database if exists myhive;4

如果使用动态分区,动态分区会根据select的结果自动判断数据应该load到那个分区

drop database myhive;drop database if exists myhive;5

分桶表

将数据按照指定的字段进行划分,分到多个桶(多个文件)中。

Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中

开启分桶功能

drop database myhive;drop database if exists myhive;6

设置Reduce个数

drop database myhive;drop database if exists myhive;7

创建桶表

drop database myhive;drop database if exists myhive;8

创建普通表

创建普通表,通过insert overwrite的方式将普通表的数据通过查询的方式加载到桶表当中去

drop database myhive;drop database if exists myhive;9

普通表加载数据

drop database myhive cascade;rop database if exists myhive  cascade;0

桶表加载数据

通过insert overwrite给桶表中加载数据

drop database myhive cascade;rop database if exists myhive  cascade;1

数据抽样

数据块抽样: Hive随机抽取N行数据,数据总量的百分比(n百分比)或N字节的数据

drop database myhive cascade;rop database if exists myhive  cascade;2

桶表抽样

drop database myhive cascade;rop database if exists myhive  cascade;3

随机抽样

drop database myhive cascade;rop database if exists myhive  cascade;4

Hive查询

语法格式

drop database myhive cascade;rop database if exists myhive  cascade;5

order by

drop database myhive cascade;rop database if exists myhive  cascade;6

sort by

drop database myhive cascade;rop database if exists myhive  cascade;7

distribute by

drop database myhive cascade;rop database if exists myhive  cascade;8

Cluster by

drop database myhive cascade;rop database if exists myhive  cascade;9

查询语法

show databases; # like过滤show databases like 'my*';8select id,name from tb_test;

列别名

alter database myhive2 set dbproperties('createtime'='20220202');1

LIMIT

LIMIT语句,限制返回的行数

alter database myhive2 set dbproperties('createtime'='20220202');2

WHERE

WHERE语句,条件过滤

alter database myhive2 set dbproperties('createtime'='20220202');3

LIKE 和 RLIKE

LIKE与在MySQL中的用法一致

RLIKE子句是Hive中这个功能的一个扩展,可以通过Java正则表达式指定匹配条件

alter database myhive2 set dbproperties('createtime'='20220202');4

alter database myhive2 set dbproperties('createtime'='20220202');5

常用函数

总行数( count)

alter database myhive2 set dbproperties('createtime'='20220202');6

最大值( max)

alter database myhive2 set dbproperties('createtime'='20220202');7

最小值( min)

alter database myhive2 set dbproperties('createtime'='20220202');8

总和( sum)

alter database myhive2 set dbproperties('createtime'='20220202');9

平均值( avg)

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead0

比较运算符

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead1

逻辑运算符

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead2

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead3

分组

GROUP BY语句

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead4

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead5

HAVING 语句

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead6

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead7

JOIN语句

Hive支持通常的SQL JOIN语句,但是只支持等值连接,不支持非等值连接。

等值 JOIN

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead8

内连接,连接两表中都存在与连接条件相匹配的数据将会被返回

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead9

左外连接,左边表中符合WHERE子句的所有记录将会被返回

create database myhive location '/myhive';00

右外连接,右边表中符合WHERE子句的所有记录将会被返回

create database myhive location '/myhive';01

多表连接,至少需要n-1个连接条件

多数情况,Hive会对每对JOIN连接对象启动一个MapReduce任务

例如:首先启动一个MapReduce job对tb_a 和tb_b 进行连接操作,然后再启动一个MapReduce job将第一个MapReduce job的输出和tb_c 进行连接操作

create database myhive location '/myhive';02

排序

全局排序

Order By是全局排序,对应一个reduce,因此输入规模较大时,需要较长计算时间;

order by放在select语句的结尾;使用ORDER BY子句排序

create database myhive location '/myhive';03

create database myhive location '/myhive';04

别名排序

create database myhive location '/myhive';05

多列排序

create database myhive location '/myhive';06

Sort By局部排序

Sort By局部排序,是对每个MapReduce内部排序,数据进入reducer前完成排序

create database myhive location '/myhive';07

分区排序(DISTRIBUTE BY) distribute by(字段): 根据指定的字段将数据分到不同的reducer,且分发算法是hash散列

需要结合sort by使用,且distribute by要在sort by之前

create database myhive location '/myhive';08

分区并排序 CLUSTER BY

create database myhive location '/myhive';09

create database myhive location '/myhive';10

Hive函数

内置函数

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inFunctions

查看系统自带的函数

create database myhive location '/myhive';11

显示自带的函数的用法

create database myhive location '/myhive';12

详细显示自带的函数的用法

create database myhive location '/myhive';13

自定义函数

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

当Hive内置函数无法满足业务需要时,可以自定义UDF来方便的扩展

自定义函数分三类:

create database myhive location '/myhive';14

添加依赖

create database myhive location '/myhive';15

编写函数

create database myhive location '/myhive';16

打包上传到hive

create database myhive location '/myhive';17

hive客户端添加jar包

create database myhive location '/myhive';18

设置函数与自定义函数关联

create database myhive location '/myhive';19

使用自定义函数

create database myhive location '/myhive';20
原文:https://juejin.cn/post/7098144957932044325


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如若转载,请注明出处:/BigData/1292.html