How to clean up storage

Directory tree structure in Kylin 4.0

Root-dir
- PROJECT_NAME
  - parquet
    - {CUBE_NAME}
      - segment_name
  - spark_log
    - executors' log of cubing job
  - dict/global_dict
    - {CUBE_NAME}
      - {COLUMN_NAME}
  - table_snapshot
    - table_name
      - job_id
  - job_tmp
- cube_statistics
  - {CUBE_NAME}
    - {JOB_ID}
      - seq file of cuboid 's HLL
- _sparder_log
  - {DATE}
    - executors 's log of query job
- resources-jdbc

Kylin will generate temporary files in HDFS during the cube building; Besides, when purge/drop/merge cubes, some parquet files may be left in HDFS and will no longer be queried; Although Kylin has started to do some automated garbage collection, it might not cover all cases; You can do an offline storage cleanup periodically:

Which will be deleted:

temp job files

hdfs:///kylin/${metadata_url}/${project}/job_tmp

none used segment cuboid files

hdfs:///kylin/${metadata_url}/${project}/${cube_name}/${non_used_segment}

Usage:

1、 Check which resources can be cleanup, this will not remove anything:

export KYLIN_HOME=/path/to/kylin_home
${KYLIN_HOME}/bin/kylin.sh org.apache.kylin.tool.StorageCleanupJob --delete false

2、You can pickup 1 or 2 resources to check whether they’re no longer be referred; Then add the “--delete true” option to start the cleanup:

${KYLIN_HOME}/bin/kylin.sh org.apache.kylin.tool.StorageCleanupJob --delete true

Space shortcuts

Page tree

Directory tree structure in Kylin 4.0