Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Div
classhome-banner

 RFC - 27 : Data skipping index to improve query performance

Table of Contents

Proposers

...

Approvers

  • TBD

Status

Current state


Current State

Status
titleUnder Discussion

(tick)

Status
colourYellow
titleIn Progress


Status
colourRed
titleABANDONED


Status
colourGreen
titleCompleted


Status
colourBlue
titleINactive


Discussion thread: here


JIRA: here

...

Example CombinedMetadataRecords generated:


partition

filePath

c1

c1_min

c1_max

c2

c2_min

c2_max

p1

f1-c1.parquet

city_id

20

30

commit_time

“a”

“g”

p1

f2-c1.parquet

city_id

25

100

commit_time

“b”

“g”

p2

f3-c1.parquet

city_id

40

60

commit_time

“i”

“w”

p3

f4-c1.parquet

city_id

300

400

commit_time

“x”

“z”

Few notes (TBD orgnanize this better):

...

  • # of columns with metadata: 2078

Results:


Time to scan full file

Time to query 10 rows 

Time for query large range (5K rows)

Storage space

HFile

15 seconds

51 ms

17 seconds

100MB

Parquet

6.1 seconds

1.9 seconds

2.1 seconds

43MB 

Parquet-spark sql

7 seconds

440 ms

1.5 seconds

43MB

Index integrations with query engines

...