`

用正则表达式匹配w3c日志,导入hive

    博客分类:
  • hive
阅读更多

用正则表达式匹配w3c日志,导入hive

w3c格式建表语句:

 

CREATE EXTERNAL TABLE w3c(
  host STRING,
  ident STRING,
  username STRING,
  log_date STRING,
  method STRING,
  uri STRING,
  version STRING,
  status STRING,
  flux STRING,
  referer STRING,
  user_agent STRING
  )
PARTITIONED BY(dt STRING, userid STRING, channelid STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([\\d|.]+)\\s+([^ ]+)\\s+([^ ]+)\\s+\\[(.+)\\]\\s+\"([^ ]+)\\s(.+)\\s([^ ]+)\"\\s+([^ ]+)\\s+([^ ]+)\\s+\"(.+)\"\\s+\"(.+)\"?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %5$s %6$s %7$s %8$s %9$s" );
 

 

导入数据方式有两种:

1、将指定路径下的文件导入,但是不移动文件

 

alter table w3c add partition (dt='1', userid='1',channelid='1') location 'path';
 

 

2、导入指定文件,并将文件移动到/user/hive/warehouse下

 

load data inpath 'path' overwrite into table w3c partition (dt='1',userid='1',channelid='1');
 

 

删除表的语句

 

drop table tablename;
 

 

显示表内的partition

 

show partitions tablename;

 

显示建表信息

 

describe tablename;显示表内字段及其类型
describe extended tablename;建表的其他配置方面的信息,partition, table type 等等
 

 

将表查询出来导入文件

 

insert overwrite directory 'path' select a.* from w3c a where a.dt='1' and a.userid='1' and channelid='1';
 

 

另外有几点要注意的:

1、导入日志支持.gz等格式
2、在查询表之前要先执行
     hive> add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u4.jar;
     或者设置hive/conf/hive-default.conf  添加
     <property>
          <name>hive.aux.jars.path</name>
          <value>file:///usr/local/hadoop/hive/lib/hive-contrib-0.7.0-cdh3u0.jar</value>
     </property>
     保存配置

————————————————————————————————————————————————————

下面是学习hive是从https://cwiki.apache.org 摘抄下来的东西:

 

DDL Operations

create

	hive> CREATE TABLE pokes (foo INT, bar STRING);
	hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
		The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.

show tables

 

	hive> SHOW TABLES;
	hive> SHOW TABLES '.*s';
		可以匹配正则表达式
 

shows the list of columns

 

	hive> DESCRIBE invites;
 

ALTER TABLE

 

	hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
  	hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  	hive> ALTER TABLE events RENAME TO 3koobecaf;
 

 

Dropping tables

 

	hive> DROP TABLE pokes;
 

 

 

 

DML Operations

Loading data from flat files into Hive:

 

	hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
		If 'local' is omitted then it looks for the file in HDFS.s
		The keyword 'overwrite' signifies that existing data in the table is deleted.If the 'overwrite' keyword is omitted, data files are appended to existing data sets.
	hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
 

 

 

SQL Operations

SELECTS and FILTERS

 

	hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
		displayed on the console.
	hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
		into an HDFS directory. 
	hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
		into a local directory

	hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  	hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
  	hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
 	hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
 	hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';
 	hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
	hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
 

 

GROUP BY

 

	hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
	hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
		Note that for versions of Hive which don't include HIVE-287 , you'll need to use COUNT(1) in place of COUNT(*).

	hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
 

 

 

MULTITABLE INSERT

 

	FROM src
	INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
	INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
	INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
	INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
 

 

STREAMING

 

	FROM invites a 
	INSERT OVERWRITE TABLE events 
	SELECT 
		TRANSFORM(a.foo, a.bar) 
		AS (oof, rab) 
		USING '/bin/cat' 
	WHERE a.ds > '2008-08-09';
 

 

 

Joins

 

	INSERT OVERWRITE TABLE pv_users
    SELECT pv.*, u.gender, u.age
    FROM user u JOIN page_view pv ON (pv.userid = u.id)
    WHERE pv.date = '2008-03-03';

    LEFT OUTER, RIGHT OUTER or FULL OUTER
    Note that Hive only supports equi-joins.
 

 

 

Aggregations

 

	Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible
	    INSERT OVERWRITE TABLE pv_gender_agg
	    SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
	    FROM pv_users
	    GROUP BY pv_users.gender;
    however, the following query is not allowed
	    INSERT OVERWRITE TABLE pv_gender_agg
	    SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
	    FROM pv_users
	    GROUP BY pv_users.gender;
 

——————————————————————————————————————————————————————

下面是从其他网站看的,觉得有用就记录下来

 

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
  [(col_name data_type [COMMENT col_comment], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type
    [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...)
  [SORTED BY (col_name [ASC|DESC], ...)]
  INTO num_buckets BUCKETS]
  [ROW FORMAT row_format]
  [STORED AS file_format]
  [LOCATION hdfs_path]
 

 

   [ROW FORMAT DELIMITED]关键字,是用来设置创建的表在加载数据的时候,支持的列分隔符;

[STORED AS file_format]关键字是用来设置加载数据的数据类型。Hive本身支持的文件格式只有:Text File,Sequence File。如果文件数据是纯文本,可以使用 [STORED AS TEXTFILE]。如果数据需要压缩,使用 [STORED AS SEQUENCE] 。通常情况,只要不需要保存序列化的对象,我们默认采用[STORED AS TEXTFILE]。

 

hive支持的字段类型,并不多,可以简单的理解为数字类型和字符串类型,详细列表如下:

TINYINT

SMALLINT

INT

BIGINT

BOOLEAN

FLOAT

DOUBLE

STRING


 外部表


    EXTERNAL 关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。具体sql如下:

    CREATE EXTERNAL TABLE test_1(id INT, name STRING, city STRING) SORTED BY TEXTFILE ROW FORMAT DELIMITED‘\t’ LOCATION ‘hdfs://../../..’



桶的概念,主要是为性能考虑,可以理解为对分区内列,进行再次划分,提高性能。在底层,一个桶其实是一个文件。如果桶划分过多,会导致文件数量暴增,一旦达到系统文件数量的上限,就杯具了。

 

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics