Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We generate a single Hive split with the corresponding Druid query for timeseries and groupBy, from which we generate the records. Thus, the degree of parallelism is 1 in these cases. However, for simple select queries without limit (although they might still contain filters or projections), we partition the original query into x queries and generate one split for each of them, thus incrementing the degree of parallelism for these queries, which usually return a large number of results, to x. We explain more details later on. 

 

Consider that depending on the query, it might not be possible to push any computation to Druid. However, our contract is that the query should always be executed. Thus, in those cases, Hive will send a select query to Druid, which basically will read all the segments from Druid, generate records, and then execute the rest of Hive operations on those records. This is also the approach that will be followed if the cost optimizer is disabled (not recommended).

...