0%

Analog Data With TPCDS & TPCH

为了测试Kudu的性能,学习了一下大公司SRE生成模拟数据的手段
本文会贴上各种原帖,本文仅记录生成过程中遇到的困难和介绍文章中的不同

大神fayson的日志:

如何编译及使用TPC-DS生成测试数据

如何编译及使用hive-testbench生成Hive基准测试数据

Impala TPC-DS基准测试

一、遇到的问题

1.源码无法编译

源码下载下来之后build,需要的组件根本下载不了

这里Google到了一个办法

先把包下载下来,放进对应的文件夹里然后编译

2.安装遇到的问题

类似配置冲突的问题 不知道为什么

yes和no我都分别选过,但是都不对,配置完成之后执行都有问题

我初步怀疑可能是版本问题,我下一个旧版本的试一试

TPC下载地址

之前用的是V 2.11的,现在下载一个V 2.10.1的试一下

执行完毕之后,首先报错

权限不够,我重新使用hdfs用户来创建目录

hdfs用户没有办法git clone

我使用了root用户clone之后

1
2
chown -R hdfs:hdfs hive-testbench/
chmod -R 777 hive-testbench/

将权限开放

其余操作使用HDFS完成

。。。

等了一段时间

MR正常运行没有问题,可以MR运行完毕之后还是报错,不知道为什么

中间又做了很多尝试,失败的尝试在这不做记录

重点记录一下我在BUG日志中发现HiveServer2有一些问题

Google之后发现了是因为配置里面出现了问题
Java 8里面用原先的配置代码已经被舍弃了,更改完毕之后解决了这个问题,

但是TPCDS的问题还是没有解决,吐出一口老血

验证HiveServer2正确开启的代码

1
2
3
4
5
6
7
8
9
10
$ /usr/lib/hive/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000> SHOW TABLES;
show tables;
+-----------+
| tab_name |
+-----------+
+-----------+
No rows selected (0.238 seconds)
0: jdbc:hive2://localhost:10000>

现在我解决问题的点还在于是不是CDH的配置还有一些问题

但是TPCH明明又能够生成数据的,难顶了

3.数据从Hive转入Kudu速度过慢

从周五下班的点开始到周一上班,1000个Tasks,仅仅完成了210个,速度十分之慢。


二、解决办法

1.总结问题

好好想了下我自己遇到的错误,有几个点,第一个点是TPCH是可以生成数据的,第二个点是我在编译TPCDS源码的过程中,报出了奇怪的提示,我一直怀疑可能是我编译的时候除了问题,但是重新编译了好几遍,一直没有找到解决办法。

这边在另一个技术博客上找到了解决方案,可以在本地编译完成之后再上传到服务器,但是我看了一下这篇博客,他是用的官方原版的hive-testbench,里面会有一些错误,我直接下载了别人使用的版本hive14.zip(可以在TIM上下载),然后下载TPCDS_Tools.zip改名tpcds_kit.zip放进tpcds对应的文件夹就可以了,最后编译成功。

编译完成之后,数据在Hive上面,Hive上面生成了两个库,一个是ORC库,还有一个是TEXT库,ORC文件impala用不了就算了,迁移TEXT就行,代码可以在下面的github中找到,然后要注意的事情是最后package的代码,因为是scala,打包的代码和别的并不一样

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
<build>
<finalName>anlogSparkSQL</finalName>
<plugins>
<!-- 设置项目编译版本-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<!-- 用于编译scala代码到class -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<archive>
<manifest>
<mainClass>kuduimport.hiveToKudu</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>

使用HUE里面的Oozie调用Spark程序的时候,如果想要在spark提交里面出现任务记录,应该添加

1
2
3
4
--conf spark.shuffle.memoryFraction=0.3
--conf spark.yarn.historyServer.address=http://datanode127:18089
--conf spark.eventLog.dir=hdfs://master126:8020/user/spark/spark2ApplicationHistory
--conf spark.eventLog.enabled=true

github/kudu-learning


2.自动生成Kudu表格脚本

脚本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists call_center;

create table call_center(
cc_call_center_sk bigint
, cc_call_center_id string
, cc_rec_start_date string
, cc_rec_end_date string
, cc_closed_date_sk bigint
, cc_open_date_sk bigint
, cc_name string
, cc_class string
, cc_employees int
, cc_sq_ft int
, cc_hours string
, cc_manager string
, cc_mkt_id int
, cc_mkt_class string
, cc_mkt_desc string
, cc_market_manager string
, cc_division int
, cc_division_name string
, cc_company int
, cc_company_name string
, cc_street_number string
, cc_street_name string
, cc_street_type string
, cc_suite_number string
, cc_city string
, cc_county string
, cc_state string
, cc_zip string
, cc_country string
, cc_gmt_offset double
, cc_tax_percentage double
, PRIMARY KEY(cc_call_center_sk)
)
PARTITION BY HASH PARTITIONS 2
STORED AS KUDU;

create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists catalog_page;

create table catalog_page(
cp_catalog_page_sk bigint
, cp_catalog_page_id string
, cp_start_date_sk bigint
, cp_end_date_sk bigint
, cp_department string
, cp_catalog_number int
, cp_catalog_page_number int
, cp_description string
, cp_type string
, PRIMARY KEY(cp_catalog_page_sk)
)
PARTITION BY HASH PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists catalog_returns;

create table catalog_returns
(
cr_item_sk bigint,
cr_order_number bigint,
cr_returned_date_sk bigint,
cr_returned_time_sk bigint,
cr_refunded_customer_sk bigint,
cr_refunded_cdemo_sk bigint,
cr_refunded_hdemo_sk bigint,
cr_refunded_addr_sk bigint,
cr_returning_customer_sk bigint,
cr_returning_cdemo_sk bigint,
cr_returning_hdemo_sk bigint,
cr_returning_addr_sk bigint,
cr_call_center_sk bigint,
cr_catalog_page_sk bigint,
cr_ship_mode_sk bigint,
cr_warehouse_sk bigint,
cr_reason_sk bigint,
cr_return_quantity int,
cr_return_amount double,
cr_return_tax double,
cr_return_amt_inc_tax double,
cr_fee double,
cr_return_ship_cost double,
cr_refunded_cash double,
cr_reversed_charge double,
cr_store_credit double,
cr_net_loss double
, PRIMARY KEY(cr_item_sk,cr_order_number)
)
PARTITION BY HASH (cr_item_sk) PARTITIONS 16
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;

use kudu_spark_tpcds_1000;

drop table if exists catalog_sales;

create table catalog_sales
(
cs_item_sk bigint,
cs_order_number bigint,
cs_sold_date_sk bigint,
cs_sold_time_sk bigint,
cs_ship_date_sk bigint,
cs_bill_customer_sk bigint,
cs_bill_cdemo_sk bigint,
cs_bill_hdemo_sk bigint,
cs_bill_addr_sk bigint,
cs_ship_customer_sk bigint,
cs_ship_cdemo_sk bigint,
cs_ship_hdemo_sk bigint,
cs_ship_addr_sk bigint,
cs_call_center_sk bigint,
cs_catalog_page_sk bigint,
cs_ship_mode_sk bigint,
cs_warehouse_sk bigint,
cs_promo_sk bigint,
cs_quantity int,
cs_wholesale_cost double,
cs_list_price double,
cs_sales_price double,
cs_ext_discount_amt double,
cs_ext_sales_price double,
cs_ext_wholesale_cost double,
cs_ext_list_price double,
cs_ext_tax double,
cs_coupon_amt double,
cs_ext_ship_cost double,
cs_net_paid double,
cs_net_paid_inc_tax double,
cs_net_paid_inc_ship double,
cs_net_paid_inc_ship_tax double,
cs_net_profit double
, PRIMARY KEY(cs_item_sk,cs_order_number)
)
PARTITION BY HASH (cs_item_sk) PARTITIONS 64
STORED AS KUDU;



create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists customer_address;

create table customer_address
(
ca_address_sk bigint,
ca_address_id string,
ca_street_number string,
ca_street_name string,
ca_street_type string,
ca_suite_number string,
ca_city string,
ca_county string,
ca_state string,
ca_zip string,
ca_country string,
ca_gmt_offset double,
ca_location_type string
, PRIMARY KEY(ca_address_sk)
)
PARTITION BY HASH (ca_address_sk) PARTITIONS 6
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists customer_demographics;

create table customer_demographics
(
cd_demo_sk bigint,
cd_gender string,
cd_marital_status string,
cd_education_status string,
cd_purchase_estimate int,
cd_credit_rating string,
cd_dep_count int,
cd_dep_employed_count int,
cd_dep_college_count int
, PRIMARY KEY(cd_demo_sk)
)
PARTITION BY HASH (cd_demo_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists customer;

create table customer
(
c_customer_sk bigint,
c_customer_id string,
c_current_cdemo_sk bigint,
c_current_hdemo_sk bigint,
c_current_addr_sk bigint,
c_first_shipto_date_sk bigint,
c_first_sales_date_sk bigint,
c_salutation string,
c_first_name string,
c_last_name string,
c_preferred_cust_flag string,
c_birth_day int,
c_birth_month int,
c_birth_year int,
c_birth_country string,
c_login string,
c_email_address string,
c_last_review_date string
, PRIMARY KEY(c_customer_sk)
)
PARTITION BY HASH (c_customer_sk) PARTITIONS 8
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists date_dim;

create table date_dim
(
d_date_sk bigint,
d_date_id string,
d_date string,
d_month_seq int,
d_week_seq int,
d_quarter_seq int,
d_year int,
d_dow int,
d_moy int,
d_dom int,
d_qoy int,
d_fy_year int,
d_fy_quarter_seq int,
d_fy_week_seq int,
d_day_name string,
d_quarter_name string,
d_holiday string,
d_weekend string,
d_following_holiday string,
d_first_dom int,
d_last_dom int,
d_same_day_ly int,
d_same_day_lq int,
d_current_day string,
d_current_week string,
d_current_month string,
d_current_quarter string,
d_current_year string
, PRIMARY KEY(d_date_sk)
)
PARTITION BY HASH (d_date_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists household_demographics;

create table household_demographics
(
hd_demo_sk bigint,
hd_income_band_sk bigint,
hd_buy_potential string,
hd_dep_count int,
hd_vehicle_count int
, PRIMARY KEY(hd_demo_sk)
)
PARTITION BY HASH (hd_demo_sk) PARTITIONS 2
STORED AS KUDU;



create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists income_band;

create table income_band(
ib_income_band_sk bigint
, ib_lower_bound int
, ib_upper_bound int
, PRIMARY KEY(ib_income_band_sk)
)
PARTITION BY HASH (ib_income_band_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists inventory;

create table inventory
(
inv_date_sk bigint,
inv_item_sk bigint,
inv_warehouse_sk bigint,
inv_quantity_on_hand int
, PRIMARY KEY(inv_date_sk)
)
PARTITION BY HASH (inv_date_sk) PARTITIONS 12
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists item;

create table item
(
i_item_sk bigint,
i_item_id string,
i_rec_start_date string,
i_rec_end_date string,
i_item_desc string,
i_current_price double,
i_wholesale_cost double,
i_brand_id int,
i_brand string,
i_class_id int,
i_class string,
i_category_id int,
i_category string,
i_manufact_id int,
i_manufact string,
i_size string,
i_formulation string,
i_color string,
i_units string,
i_container string,
i_manager_id int,
i_product_name string
, PRIMARY KEY(i_item_sk)
)
PARTITION BY HASH (i_item_sk) PARTITIONS 4
STORED AS KUDU;



create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists promotion;

create table promotion
(
p_promo_sk bigint,
p_promo_id string,
p_start_date_sk bigint,
p_end_date_sk bigint,
p_item_sk bigint,
p_cost double,
p_response_target int,
p_promo_name string,
p_channel_dmail string,
p_channel_email string,
p_channel_catalog string,
p_channel_tv string,
p_channel_radio string,
p_channel_press string,
p_channel_event string,
p_channel_demo string,
p_channel_details string,
p_purpose string,
p_discount_active string
, PRIMARY KEY(p_promo_sk)
)
PARTITION BY HASH (p_promo_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists reason;

create table reason(
r_reason_sk bigint
, r_reason_id string
, r_reason_desc string
, PRIMARY KEY(r_reason_sk)
)
PARTITION BY HASH (r_reason_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists ship_mode;

create table ship_mode(
sm_ship_mode_sk bigint
, sm_ship_mode_id string
, sm_type string
, sm_code string
, sm_carrier string
, sm_contract string
, PRIMARY KEY(sm_ship_mode_sk)
)
PARTITION BY HASH (sm_ship_mode_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists store_returns;

create table store_returns
(
sr_item_sk bigint,
sr_returned_date_sk bigint,
sr_return_time_sk bigint,
sr_customer_sk bigint,
sr_cdemo_sk bigint,
sr_hdemo_sk bigint,
sr_addr_sk bigint,
sr_store_sk bigint,
sr_reason_sk bigint,
sr_ticket_number bigint,
sr_return_quantity int,
sr_return_amt double,
sr_return_tax double,
sr_return_amt_inc_tax double,
sr_fee double,
sr_return_ship_cost double,
sr_refunded_cash double,
sr_reversed_charge double,
sr_store_credit double,
sr_net_loss double,
PRIMARY KEY(sr_item_sk)
)
PARTITION BY HASH PARTITIONS 32
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists store_sales;

create table store_sales
(
ss_item_sk bigint,
ss_sold_date_sk bigint,
ss_sold_time_sk bigint,
ss_customer_sk bigint,
ss_cdemo_sk bigint,
ss_hdemo_sk bigint,
ss_addr_sk bigint,
ss_store_sk bigint,
ss_promo_sk bigint,
ss_ticket_number bigint,
ss_quantity int,
ss_wholesale_cost double,
ss_list_price double,
ss_sales_price double,
ss_ext_discount_amt double,
ss_ext_sales_price double,
ss_ext_wholesale_cost double,
ss_ext_list_price double,
ss_ext_tax double,
ss_coupon_amt double,
ss_net_paid double,
ss_net_paid_inc_tax double,
ss_net_profit double
, PRIMARY KEY(ss_item_sk)
)
PARTITION BY HASH (ss_item_sk) PARTITIONS 96
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists store;

create table store
(
s_store_sk bigint,
s_store_id string,
s_rec_start_date string,
s_rec_end_date string,
s_closed_date_sk bigint,
s_store_name string,
s_number_employees int,
s_floor_space int,
s_hours string,
s_manager string,
s_market_id int,
s_geography_class string,
s_market_desc string,
s_market_manager string,
s_division_id int,
s_division_name string,
s_company_id int,
s_company_name string,
s_street_number string,
s_street_name string,
s_street_type string,
s_suite_number string,
s_city string,
s_county string,
s_state string,
s_zip string,
s_country string,
s_gmt_offset double,
s_tax_precentage double
, PRIMARY KEY(s_store_sk)
)
PARTITION BY HASH (s_store_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists time_dim;

create table time_dim
(
t_time_sk bigint,
t_time_id string,
t_time int,
t_hour int,
t_minute int,
t_second int,
t_am_pm string,
t_shift string,
t_sub_shift string,
t_meal_time string
, PRIMARY KEY(t_time_sk)
)
PARTITION BY HASH (t_time_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists warehouse;

create table warehouse(
w_warehouse_sk bigint
, w_warehouse_id string
, w_warehouse_name string
, w_warehouse_sq_ft int
, w_street_number string
, w_street_name string
, w_street_type string
, w_suite_number string
, w_city string
, w_county string
, w_state string
, w_zip string
, w_country string
, w_gmt_offset double
, PRIMARY KEY(w_warehouse_sk)
)
PARTITION BY HASH (w_warehouse_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists web_page;

create table web_page(
wp_web_page_sk bigint
, wp_web_page_id string
, wp_rec_start_date string
, wp_rec_end_date string
, wp_creation_date_sk bigint
, wp_access_date_sk bigint
, wp_autogen_flag string
, wp_customer_sk bigint
, wp_url string
, wp_type string
, wp_char_count int
, wp_link_count int
, wp_image_count int
, wp_max_ad_count int
, PRIMARY KEY(wp_web_page_sk)
)
PARTITION BY HASH (wp_web_page_sk) PARTITIONS 2
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists web_returns;

create table web_returns
(
wr_item_sk bigint,
wr_returned_date_sk bigint,
wr_returned_time_sk bigint,
wr_refunded_customer_sk bigint,
wr_refunded_cdemo_sk bigint,
wr_refunded_hdemo_sk bigint,
wr_refunded_addr_sk bigint,
wr_returning_customer_sk bigint,
wr_returning_cdemo_sk bigint,
wr_returning_hdemo_sk bigint,
wr_returning_addr_sk bigint,
wr_web_page_sk bigint,
wr_reason_sk bigint,
wr_order_number bigint,
wr_return_quantity int,
wr_return_amt double,
wr_return_tax double,
wr_return_amt_inc_tax double,
wr_fee double,
wr_return_ship_cost double,
wr_refunded_cash double,
wr_reversed_charge double,
wr_account_credit double,
wr_net_loss double
, PRIMARY KEY(wr_item_sk)
)
PARTITION BY HASH (wr_item_sk) PARTITIONS 8
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists web_sales;

create table web_sales
(
ws_item_sk bigint,
ws_sold_date_sk bigint,
ws_sold_time_sk bigint,
ws_ship_date_sk bigint,
ws_bill_customer_sk bigint,
ws_bill_cdemo_sk bigint,
ws_bill_hdemo_sk bigint,
ws_bill_addr_sk bigint,
ws_ship_customer_sk bigint,
ws_ship_cdemo_sk bigint,
ws_ship_hdemo_sk bigint,
ws_ship_addr_sk bigint,
ws_web_page_sk bigint,
ws_web_site_sk bigint,
ws_ship_mode_sk bigint,
ws_warehouse_sk bigint,
ws_promo_sk bigint,
ws_order_number bigint,
ws_quantity int,
ws_wholesale_cost double,
ws_list_price double,
ws_sales_price double,
ws_ext_discount_amt double,
ws_ext_sales_price double,
ws_ext_wholesale_cost double,
ws_ext_list_price double,
ws_ext_tax double,
ws_coupon_amt double,
ws_ext_ship_cost double,
ws_net_paid double,
ws_net_paid_inc_tax double,
ws_net_paid_inc_ship double,
ws_net_paid_inc_ship_tax double,
ws_net_profit double
, PRIMARY KEY(ws_item_sk)
)
PARTITION BY HASH (ws_item_sk) PARTITIONS 64
STORED AS KUDU;


create database if not exists kudu_spark_tpcds_1000;
use kudu_spark_tpcds_1000;

drop table if exists web_site;

create table web_site
(
web_site_sk bigint,
web_site_id string,
web_rec_start_date string,
web_rec_end_date string,
web_name string,
web_open_date_sk bigint,
web_close_date_sk bigint,
web_class string,
web_manager string,
web_mkt_id int,
web_mkt_class string,
web_mkt_desc string,
web_market_manager string,
web_company_id int,
web_company_name string,
web_street_number string,
web_street_name string,
web_street_type string,
web_suite_number string,
web_city string,
web_county string,
web_state string,
web_zip string,
web_country string,
web_gmt_offset double,
web_tax_percentage double
, PRIMARY KEY(web_site_sk)
)
PARTITION BY HASH (web_site_sk) PARTITIONS 2
STORED AS KUDU;

命令:

1
impala-shell -f impala-shell

Tips:可能会遇到这样的错误

1
2
ERROR: ImpalaRuntimeException: Error creating Kudu table 'impala::kudu_spark_tpcds_2.catalog_sales'
CAUSED BY: NonRecoverableException: The requested number of tablets is over the maximum permitted at creation time (60). Additional tablets may be added by adding range partitions to the table post-creation.

原因:

Kudu默认配置最多分区被限制了,需要配置

如图栏目里,配置

1
--max_create_tablets_per_ts=30

生成日志后

1
tail q* -n 1 >> kudu_time_2.log

不知道为什么,impala+kudu对内存的管理存在一些问题,明明物理内存足够使用,却老是会用上交换内存。

测试性能下降,这里取消交换内存再试一次

1
2
3
# 取消交换内存:
swapoff -a
swapon -a

parquet表格生成脚本 alltables_parquet.sql

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
drop database if exists ${VAR:DB} cascade;
create database ${VAR:DB};
use ${VAR:DB};
set parquet_file_size=512M;
set COMPRESSION_CODEC=snappy;
drop table if exists call_center;
create table ${VAR:DB}.call_center
stored as parquet
as select * from ${VAR:HIVE_DB}.call_center;
drop table if exists catalog_page;
create table ${VAR:DB}.catalog_page
stored as parquet
as select * from ${VAR:HIVE_DB}.catalog_page;
drop table if exists catalog_returns;
create table ${VAR:DB}.catalog_returns
stored as parquet
as select * from ${VAR:HIVE_DB}.catalog_returns;
drop table if exists catalog_sales;
create table ${VAR:DB}.catalog_sales
stored as parquet
as select * from ${VAR:HIVE_DB}.catalog_sales;
drop table if exists customer_address;
create table ${VAR:DB}.customer_address
stored as parquet
as select * from ${VAR:HIVE_DB}.customer_address;
drop table if exists customer_demographics;
create table ${VAR:DB}.customer_demographics
stored as parquet
as select * from ${VAR:HIVE_DB}.customer_demographics;
drop table if exists customer;
create table ${VAR:DB}.customer
stored as parquet
as select * from ${VAR:HIVE_DB}.customer;
drop table if exists date_dim;
create table ${VAR:DB}.date_dim
stored as parquet
as select * from ${VAR:HIVE_DB}.date_dim;
drop table if exists household_demographics;
create table ${VAR:DB}.household_demographics
stored as parquet
as select * from ${VAR:HIVE_DB}.household_demographics;
drop table if exists income_band;
create table ${VAR:DB}.income_band
stored as parquet
as select * from ${VAR:HIVE_DB}.income_band;
drop table if exists inventory;
create table ${VAR:DB}.inventory
stored as parquet
as select * from ${VAR:HIVE_DB}.inventory;
drop table if exists item;
create table ${VAR:DB}.item
stored as parquet
as select * from ${VAR:HIVE_DB}.item;
drop table if exists promotion;
create table ${VAR:DB}.promotion
stored as parquet
as select * from ${VAR:HIVE_DB}.promotion;
drop table if exists reason;
create table ${VAR:DB}.reason
stored as parquet
as select * from ${VAR:HIVE_DB}.reason;
drop table if exists ship_mode;
create table ${VAR:DB}.ship_mode
stored as parquet
as select * from ${VAR:HIVE_DB}.ship_mode;
drop table if exists store_returns;
create table ${VAR:DB}.store_returns
stored as parquet
as select * from ${VAR:HIVE_DB}.store_returns;
drop table if exists store_sales;
create table ${VAR:DB}.store_sales
stored as parquet
as select * from ${VAR:HIVE_DB}.store_sales;
drop table if exists store;
create table ${VAR:DB}.store
stored as parquet
as select * from ${VAR:HIVE_DB}.store;
drop table if exists time_dim;
create table ${VAR:DB}.time_dim
stored as parquet
as select * from ${VAR:HIVE_DB}.time_dim;
drop table if exists warehouse;
create table ${VAR:DB}.warehouse
stored as parquet
as select * from ${VAR:HIVE_DB}.warehouse;
drop table if exists web_page;
create table ${VAR:DB}.web_page
stored as parquet
as select * from ${VAR:HIVE_DB}.web_page;
drop table if exists web_returns;
create table ${VAR:DB}.web_returns
stored as parquet
as select * from ${VAR:HIVE_DB}.web_returns;
drop table if exists web_sales;
create table ${VAR:DB}.web_sales
stored as parquet
as select * from ${VAR:HIVE_DB}.web_sales;
drop table if exists web_site;
create table ${VAR:DB}.web_site
stored as parquet
as select * from ${VAR:HIVE_DB}.web_site;

然后用命令

1
impala-shell -i datanode127 --var=DB=tpcds_parquet_500 --var=HIVE_DB=tpcds_text_500 -f alltables_parquet.sql

理论上也能这么生成Kudu表,只要用下面的语句。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
drop database if exists ${VAR:DB} cascade;
create database ${VAR:DB};
use ${VAR:DB};

drop table if exists call_center;
create table ${VAR:DB}.call_center
PRIMARY KEY (cc_call_center_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.call_center;

drop table if exists catalog_page;
create table ${VAR:DB}.catalog_page
PRIMARY KEY (cp_catalog_page_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_page;

drop table if exists catalog_returns;
create table ${VAR:DB}.catalog_returns
PRIMARY KEY (cr_returned_date_sk,cr_returned_time_sk,cr_item_sk,cr_refunded_customer_sk)
PARTITION BY HASH(cr_returned_date_sk,cr_returned_time_sk,cr_item_sk,cr_refunded_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_returns;

drop table if exists catalog_sales;
create table ${VAR:DB}.catalog_sales
PRIMARY KEY (cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk)
PARTITION BY HASH(cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_sales;

drop table if exists customer_address;
create table ${VAR:DB}.customer_address
PRIMARY KEY (ca_address_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer_address;

drop table if exists customer_demographics;
create table ${VAR:DB}.customer_demographics
PRIMARY KEY (cd_demo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer_demographics;

drop table if exists customer;
create table ${VAR:DB}.customer
PRIMARY KEY (c_customer_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer;

drop table if exists date_dim;
create table ${VAR:DB}.date_dim
PRIMARY KEY (d_date_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.date_dim;

drop table if exists household_demographics;
create table ${VAR:DB}.household_demographics
PRIMARY KEY (hd_demo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.household_demographics;

drop table if exists income_band;
create table ${VAR:DB}.income_band
PRIMARY KEY (ib_income_band_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.income_band;

drop table if exists inventory;
create table ${VAR:DB}.inventory
PRIMARY KEY (inv_date_sk,inv_item_sk,inv_warehouse_sk)
PARTITION BY HASH(inv_date_sk,inv_item_sk,inv_warehouse_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.inventory;

drop table if exists item;
create table ${VAR:DB}.item
PRIMARY KEY (i_item_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.item;

drop table if exists promotion;
create table ${VAR:DB}.promotion
PRIMARY KEY (p_promo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.promotion;

drop table if exists reason;
create table ${VAR:DB}.reason
PRIMARY KEY (r_reason_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.reason;

drop table if exists ship_mode;
create table ${VAR:DB}.ship_mode
PRIMARY KEY (sm_ship_mode_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.ship_mode;

drop table if exists store_returns;
create table ${VAR:DB}.store_returns
PRIMARY KEY (sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk)
PARTITION BY HASH(sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store_returns;

drop table if exists store_sales;
create table ${VAR:DB}.store_sales
PRIMARY KEY (ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk)
PARTITION BY HASH(ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store_sales;

drop table if exists store;
create table ${VAR:DB}.store
PRIMARY KEY (s_store_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store;

drop table if exists time_dim;
create table ${VAR:DB}.time_dim
PRIMARY KEY (t_time_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.time_dim;

drop table if exists warehouse;
create table ${VAR:DB}.warehouse
PRIMARY KEY (w_warehouse_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.warehouse;

drop table if exists web_page;
create table ${VAR:DB}.web_page
PRIMARY KEY (wp_web_page_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_page;

drop table if exists web_returns;
create table ${VAR:DB}.web_returns
PRIMARY KEY (wr_returned_date_sk,wr_returned_time_sk,wr_item_sk,wr_refunded_customer_sk)
PARTITION BY HASH(wr_returned_date_sk,wr_returned_time_sk,wr_item_sk,wr_refunded_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_returns;

drop table if exists web_sales;
create table ${VAR:DB}.web_sales
PRIMARY KEY (ws_sold_date_sk,ws_sold_time_sk,ws_ship_date_sk,ws_item_sk,ws_bill_customer_sk)
PARTITION BY HASH(ws_sold_date_sk,ws_sold_time_sk,ws_ship_date_sk,ws_item_sk,ws_bill_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_sales;

drop table if exists web_site;
create table ${VAR:DB}.web_site
PRIMARY KEY (web_site_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_site;

但是存在问题就是Kudu的表需要主键,并且主键需要放置在最前面,但是tpcds默认生成的表格无法把主键放在最前面,所以这样创建的表格主主键包含很多个key,所以还是用上面的方法。


kudu性能调优