THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

参考 issue:

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKYLIN-5069

Table of Contents

一、背景

目前,kylin4.0 中仍然需要通过 hiveClient 去获取 hive meta 信息来 load hive table,load hive table 时需要从 $HIVE_HOME/lib 下将 hive_dependency 加载到 kylin 环境中,通过用户反馈发现,由于用户使用的 hive 版本各异,hive_dependency 也会各不相同,在 load hive table 时经常会出现类冲突问题。

此外,由于 kylin 会加载 hadoop classpath 的所有类到环境中,然后在 kylin 运行时再通过 SparkClassLoader 对 kylin 环境中的所有 dependency 做过滤加载,使用过滤完的 classpath 作为启动 sparder 的环境,这个过程可以简化为只加载需要的类到 kylin 环境中,去除 SparkClassLoader 的类加载过程。

为了解决此类问题,我们计划将加载 dependency 的过程统一通过 spark 管理

  1. 从 kylin4.0 中去除 hive dependency,使用 SparkSession 来获取 hive meta 信息。
  2.  整理hadoop classpath,只加载 kylin4.0 真正需要的 hadoop 相关 jar 包到 kylin4.0 环境中,去除 SparkClassLoader。

二、开发计划

需要做的事情如下:

1. 从 kylin 启动脚本 kylin.sh 中去除加载 hive dependency 的过程;

2. 为避免 kylin 启动脚本 kylin.sh 中将 hadoop lib 目录下的所有 jar 包都加入 classpath,对 hadoop lib 下的 jar 包做整理和筛选,并将需要的 jar 包 copy 到 $SPARK_HOME/jars 目录下(仅当 $SPARK_HOME 路径为 $KYLIN_HOME/spark 时);

01 Background

At present, kylin 4.0 still needs to obtain hive meta information through hiveclient to load hive table. When loading hive table, you need to get $hive from_ Hide under home / Lib_ Dependency is loaded into kylin environment. Through user feedback, it is found that due to different versions of hive used by users, hive_ The dependencies are also different. Class conflicts often occur when loading hive table.

In addition, kylin will load all classes of Hadoop classpath into the environment, and then filter and load all dependencies in the kylin environment through sparkclassloader when kylin is running. The filtered classpath is used as the environment for starting sparder. This process can be simplified to only load the required classes into the kylin environment, Remove the class loading process of sparkclassloader.

To solve such problems, we plan to uniformly manage the process of loading dependency through spark:

  1. Remove hive dependency from kylin 4.0 and use sparksession to obtain hive meta information.
  2. Sort out the Hadoop classpath, load only the Hadoop related jar packages really needed by kylin 4.0 into kylin 4.0 environment, and remove the sparkclassloader.

02 Dev Design

What needs to be done is as follows:

1. Remove the process of loading hive dependency from kylin startup script kylin.sh;


2. To avoid adding all jar packages in the hadoop classpath to the classpath in the kylin startup script kylin.sh, sort and filter the jar packages under hadoop lib, and copy the required jar packages to ${SPARK_HOME}/jars directory (only when ${SPARK_HOME} path is ${KYLIN_HOME}/spark);


3. Modify the classpath to be loaded by kylin in kylin.sh: the previous classpath includes kylin server classpath, 3. 在 kylin.sh 中对 kylin 要加载的 classpath 进行修改:之前的 classpath 包括 kylin server classpath、 ${KYLIN_HOME}/conf、conf, ${KYLIN_HOME}/lib/ *, ${KYLIN_HOME}/ext/ *、hadoop classpath、hive classpath,修改之后的 , hadoop classpath and hive classpath. The modified class path 只包括 only includes kylin server classpath、classpath, ${KYLIN_HOME}/conf、conf, ${KYLIN_HOME}/lib/ *, ${KYLIN_HOME}/ext/ *, $ {KYLIN_HOME}/hadoop_conf/ *、$, ${SPARK_HOME}/jars / *。之前的 . The previous hadoop classpath and hive classpath are replaced by ${SPARK_HOME}/jars/ 代替;*;


4. 继承 Inherit IHiveClient 接口实现 SparkHiveClient 类,使用 SparkSession 实现其中的方法;

5. 将 Kylin 4.0 中使用到原有 CLIHiveCLient/BeelineHiveClient 类的地方均替换为使用 SparkHiveClient 类;

6. 清理相关无用代码。

...

interface to implement SparkHiveClient class, and use SparkSession to implement its methods;


5. Replace the original CliHiveclient/BeelineHiveClient class in kylin 4.0 with SparkHiveClient class;


6. Clean up relevant useless codes.

03 Configuration Change

kylin.source.hive.client:原默认值为 cli,可配置为 cli 和 beeline;修改之后默认值为 spark_catalog。原来使用 cli 和 beeline 的用户均改变为使用 spark_catalog 来访问 hive meta。

四、测试

:The original default value is "cli", which can be configured as "cli" and "beeline"; After modification, the default value is "spark_catalog"。 Users who used "cli" and "beeline" changed to "spark_ catalog" to access hive meta.

04 Test

After the code is completed, compatibility tests are carried out in various environments supported by kylin4, mainly testing the construction, query and load hive table. Finally, it passed the test in the following environments:代码完成后,在 kylin4 支持的各个环境中进行兼容性测试,主要测试构建、查询和 load hive table。最终在以下环境通过测试:

Hadoop DistributionSparkHadoopHiveCluster Manager

Distributed Filesystem

Verified ?Comment
CDH 5.72.4.7/3.1.12.6.0-cdh5.7.61.1.0-cdh5.7.6YARNHDFS
  •  verified
无需额外步骤
HDP 2.42.4.7/3.1.12.7.1.2.4.0.0-161.2.1000.2.4.0.0-16YARNHDFS
  •  verified
无需额外步骤
AWS EMR 5.33.02.4.7/3.1.1

2.10.1-amzn-1

Hive 2.3.7-amzn-4

YARNHDFS/S3
  •  verified
无需额外步骤

CDH 6.2.02.4.7/3.1.13.0.0-cdh6.2.02.1.1-cdh6.2.0YARNHDFS
  •  verified
需要准备jar包放在指定目录:You need to prepare the jar package and put it in the specified directory: Deploy Kylin 4 on CDH 6
AWS EMR 6.3.03.1.1

3.2.1-amzn-3

3.1.2-amzn-4YARNHDFS/S3
  •  verified
无需额外步骤

Apache3.1.13.2.02.3.9YARN, StandaloneS3
  •  verified
http://kylin.apache.org/docs40/install/deploy_without_hadoop.html

...