Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
Background
Kylin will generate temporary files in HDFS during the cube building; Besides, when purge/drop/merge cubes, some parquet files may be left in HDFS and will no longer be queried; Although Kylin has started to do some automated garbage collection, it might not cover all cases; You can do an offline storage cleanup periodically.
Directory tree structure under Kylin 4.0 's working dir
Working Dir(ROOT)
- {PROJECT_NAME} [managed by tool]
- parquet
- {CUBE_NAME} [managed by tool]
- {SEGMENT_NAME} [managed by tool]
- {CUBOID_ID}
- parquet files
- {CUBOID_ID}
- {SEGMENT_NAME} [managed by tool]
- {CUBE_NAME} [managed by tool]
- spark_log
- driver
- {JOB_ID}
- drivers' log of cubing job
- {JOB_ID}
- executor
- {JOB_ID}
- executors' log of cubing job
- {JOB_ID}
- driver
- dict/global_dict [managed by tool]
- {CUBE_NAME}
- {COLUMN_NAME}
- dict files
- {COLUMN_NAME}
- {CUBE_NAME}
- table_snapshot [managed by tool]
- {SCHEMA_NAME.TABLE_NAME}
- {JOB_ID}
- parquet files
- {JOB_ID}
- {SCHEMA_NAME.TABLE_NAME}
- job_tmp [managed by tool]
- {JOB_ID}
- TBD
- {JOB_ID}
- parquet
- cube_statistics
- {CUBE_NAME}
- {JOB_ID}
- seq file of cuboid 's HLL
- {JOB_ID}
- {CUBE_NAME}
- _sparder_log
- {DATE}
- executors 's log of query job
- {DATE}
- resources-jdbc
- TBD
Summary
In above directory tree, the directory which end with "managed by tool" means StorageCleanupJob will try to check and delete useless files under these directory.
How to use
Option Table
Option | Data Type | Default Value | Comment |
---|---|---|---|
delete | Boolean | false | Boolean, whether or not to do real delete operation. Default value is false, means a dry run. |
cleanupTableSnapshot | Boolean | true | Boolean, whether or not to delete unreferenced snapshot files. Default value is true . |
cleanupGlobalDict | Boolean | true | Boolean, whether or not to delete unreferenced global dict files. Default value is true . |
cleanupJobTmp | Boolean | false | Boolean, whether or not to delete job tmp files. Default value is false . |
cleanupThreshold | Integer | 168 | Integer, used to specific delete unreferenced storage that have not been modified before how many hours (recent files are protected). Default value is 168 hours. |
List help information
[root@cdh-master apache-kylin-4.0.0-SNAPSHOT-bin]# bin/kylin.sh org.apache.kylin.tool.StorageCleanupJob -help Retrieving hive dependency... Retrieving hadoop conf dir... Retrieving Spark dependency... ... Running org.apache.kylin.rest.job.StorageCleanupJob -help usage: org.apache.kylin.rest.job.StorageCleanupJob -cleanupGlobalDict <cleanupGlobalDict> Boolean, whether or not to delete unreferenced global dict files. Default value is true . -cleanupJobTmp <cleanupJobTmp> Boolean, whether or not to delete job tmp files. Default value is false . -cleanupTableSnapshot <cleanupTableSnapshot> Boolean, whether or not to delete unreferenced snapshot files. Default value is true . -cleanupThreshold <cleanupThreshold> Integer, used to specific delete unreferenced storage that have not been modified before how many hours (recent files are protected). Default value is 168 hours. -delete <delete> Boolean, whether or not to do real delete operation. Default value is false, means a dry run.
List directory which to be deleted
bin/kylin.sh org.apache.kylin.tool.StorageCleanupJob
Deleted them after confirm
bin/kylin.sh org.apache.kylin.tool.StorageCleanupJob --delete true
Only delete stale job_tmp and stale cuboid files
bin/kylin.sh org.apache.kylin.tool.StorageCleanupJob --delete true \ --cleanupJobTmp ture -cleanupTableSnapshot false \ -cleanupGlobalDict false --cleanupThreshold 24