Similarity Search: Scenarios, SQL and Result Format

I hope to let IoTDB support similar research by adding the indexing mechanism. Similarity search is one of the most important directions in the field of time series, and some users have shown their requirements for similarity search. Different from the existing query conditions in IoTDB, similarity search takes a sequence as input, and aim to find a list of similar sequences from the database efficiently. The similarity between the two sequences depends on their Euclidean distance or other distance functions. The similarity index techniques are widely used for speeding up the similarity search.

Firstly and concretely, I'd like to demonstrate two industrial scenarios as well as the SQL and the query result format. If it makes sense, we could add SQL formats to the IoTDB as the first step of the IoTDB index mechanism.

Case 1: Erythromycin Fermentation

The figure above shows the IoTDB schema for the Erythromycin Fermentation. Some irrelevant information has been omitted. The pharmaceutical factory has many fermentors (Ferm 01, Ferm 02, etc.), each of which produces erythromycin batch by batch.

For each batch (e.g., batch-20191017), the factory will monitor some measurements in the fermentation process, like glucose feeding rate (Glu), carbon dioxide exit rate (CER), and pH value. Researchers have found that the glucose feeding rate is critical to the final erythromycin output. Therefore, analysts want to build a similarity index (e.g. RTree+PAA index) on Glu sequences for all fermenters and all batches. After building the index, the analyst inputs a Glu sequence, finds batches with similar Glu curves, and makes further analysis.

Index Creation

The SQL format of index creation statement is as follows:

 CREATE INDEX ON seriesPath
 WITH INDEX=indexName (, key=value)*

In addition to the index name and target series paths, users are allowed to pass in any index parameters in format of "key(string):value(string)", such as "PAA-dimension:8" or "leaf-size: 10000".

In this scenario, the SQL of index creation is:

 CREATE INDEX ON root.Ery.*.Glu 
 WITH INDEX=RTreePAA, PAA_dim=8

Index Query

The analyst hopes to find the two Glu sequences closest to the given query Glu sequence in all fermenters and batches. The index query statement is as follows:

 SELECT TOP 2 Glu 
 FROM root.Ery.* 
 WHERE Glu LIKE (0, 120, 20, 80, ..., 120, 100, 80, 0)

Recognizing the keyword LIKE, IoTDB assigns this query statement to the indexing mechanism. For consistency with other IoTDB query results, retrieved sequences are listed as separate columns and aligned according to timestamps.

As shown in the above figure, two sequences closest to the given query sequence are Ferm01-batch-20191018 and Ferm02-batch-20191024.

Note that after a fermentation batch is completed, the series has been fixed, and will not be appended or renewed. Because of the regular productivity, we can suppose that all series among batches and fermenters have similar lengths.

Index Drop

The index drop statement is as follow:

 DROP INDEX indexName ON seriesPath

In this case, the drop SQL is:

 DRRP INDEX RTreePAA ON root.Ery.*.Glu

Case 2: Extreme Operating Gust

The above figure shows the IoTDB schema of a wind turbine manufacturer. The manufacturer has many wind turbines (AZQ01, AZQ02, etc.), each of which continuously monitors the state of the turbine itself and the surrounding environment, such as wind speed, wind direction, generator power, etc. The analyst hopes to build an index (e.g. ELB index) for the speed series of AZQ02 (root.Wind.AZQ02.Speed). After building the index, the analyst inputs a EOG pattern (Extreme Operating Gust), finds all results in speed series for further process control, fault diagnosis and predictive maintenance.

Index Creation

The index creation statement is as follows, similar to the above case:

 CREATE INDEX ON root.Wind.AZQ02.Speed
 WITH INDEX=ELBIndex, Block_Size=5

Index Query

The query condition and the result format are the main differences from the above scenario.

The analyst input a pattern of extreme operating gust and wishes to find all similar subsequence on root.Wind.AZQ02.Speed. The query statement is as follows:

 SELECT Speed.* FROM root.Wind.AZQ02
 where Speed 
 CONTAIN (15, 14, 12, ..., 12, 12, 11) WITH TOLERANCE 1 
 CONCAT  (10, 20, 25, ..., 24, 14, 8) WITH TOLERANCE 2 
 CONCAT  (8, 9, 10, ..., 14, 15, 15) WITH TOLERANCE 1

The pattern contains three sub-patterns with different searching thresholds of which the second sub-pattern is 2 and the other two are 1.

In this case, the result is the subsequence of long series, which does not exist in the original IoTDB schema. For consistency with other IoTDB query results, each retrieved subsequence is listed as individual columns, and the column name is the concatenation of the series pathname and the start time of the subsequence. As shown in the above figure, there are two retrieved subsequences, which started from 2019-10-18 12: 30: 00.000 and 2019-10-18 12: 30: 10.000 respectively.

Since the monitoring process is continuous, the new-coming data will be inserted into the index continuously.

Index Drop

Similar to the above case.

Space shortcuts

Page tree

Case 1: Erythromycin Fermentation

Index Creation

Index Query

Index Drop

Case 2: Extreme Operating Gust

Index Creation

Index Query

Index Drop

1 Comment

Rong Kang