TsFile's Hive connector
TsFile's Hive connector implements support for reading external TsFile type file formats through Hive, enabling users to manipulate TsFile through Hive.
Load a single TsFile file into Hive, whether the file is stored on the local file system or in HDFS
Load all files in a specific directory into Hive, whether the files are stored in the local file system or HDFS
Querying TsFile with HQL
Until now, write operations are not supported in hive-connector. Therefore, insert operations in HQL are not allowed
Design principle
The Hive connector needs to be able to parse the TsFile file format and convert it into a line-by-line format that Hive can recognize. You also need to be able to format the output according to the form of a user-defined Table. Therefore, the function implementation of the Hive connector is mainly divided into four parts
Slicing the entire TsFile file
Read data from shards and convert it into a data type that Hive can recognize
Parse user-defined Table
Deserialize data into Hive's output format
Concrete implementation class
The above four main functional modules have their corresponding implementation classes. The four implementation classes are introduced below.
org.apache.iotdb.hive.TSFHiveInputFormat
This class is mainly responsible for formatting the input TsFile file. It inherits the FileInputFormat <NullWritable, MapWritable>
class. Some general formatting operations have been implemented in FileInputFormat
. This class overrides its The getSplits (JobConf, int) method customizes the sharding method for TsFile files; and the
getRecordReader (InputSpli, JobConf, Reporter) method is used to generate a TSFHiveRecordReader
that specifically reads data from a slice.
org.apache.iotdb.hive.TSFHiveRecordReader
This class is mainly responsible for reading TsFile data from a shard.
It implements the IReaderSet
interface. This interface is a set of methods for setting internal properties of the class, mainly to extract the duplicated code sections inTSRecordReader
and TSHiveRecordReader
.
public interface IReaderSet {
void setReader(TsFileSequenceReader reader);
void setMeasurementIds(List<String> measurementIds);
void setReadDeviceId(boolean isReadDeviceId);
void setReadTime(boolean isReadTime);
}
Let's first introduce some important fields of this class
private List<QueryDataSet> dataSetList = new ArrayList<>();
All QueryDataSets generated by this shard
private List<String> deviceIdList = new ArrayList<>();
Device name list, this order is consistent with the order of dataSetList, that is, deviceIdList [i] is the device name of dataSetList [i].
private int currentIndex = 0;
The index of the QueryDataSet currently being processed
This class calls the initialize (TSFInputSplit, Configuration, IReaderSet, List <QueryDataSet>, List <String>)
method of TSFRecordReader
in the constructor to initialize some of the class fields mentioned above. It overrides the next ()
method of RecordReader
to return the data read from TsFile.
next(NullWritable, MapWritable)
We noticed that after reading the data from TsFile, it was returned in the form of MapWritable
. HereMapWritable
is actually a `Map ', except that its key and value are serialized and deserialized. Special adaptation, its reading process is as follows
First determine if there is a value for
QueryDataSet
at the current position ofdataSetList
. If there is no value, then increasecurrentIndex
by 1 until the firstQueryDataSet
with a value is foundThen call
next ()
method ofQueryDataSet
to getRowRecord
Finally, the getCurrentValue () method of TSFRecordReader is called, and the value in RowRecord is placed in MapWritable.
org.apache.iotdb.hive.TsFileSerDe
This class inherits AbstractSerDe
and is also necessary for us to implement Hive to read data from custom input formats.
It overrides the Initialize () method of AbstractSerDe. In this method, the corresponding device name, sensor name, and corresponding type of the sensor are parsed from the user-created table sql. An ObjectInspector object is also constructed. This object is mainly responsible for the conversion of data types. Since TsFile only supports primitive data types, when other data types occur, an exception needs to be thrown. The specific construction process can be seen in the createObjectInspectorWorker () method. .
The main responsibility of this class is to serialize and deserialize data in different file formats. As our Hive connector only supports read operations for the time being, it does not support insert operations, so only the deserialization process, so only overwrite The deserialize (Writable)
method is called, which calls the deserialize ()
method of TsFileDeserializer
.
org.apache.iotdb.hive.TsFileDeserializer
This class deserializes the data into Hive's output format. There is only one deserialize ()
method.
public Object deserialize(List<String>, List<TypeInfo>, Writable, String)
The Writable
parameter of this method is theMapWritable
generated by next ()
of TSFHiveRecordReader
.
First determine if the Writable
parameter is of typeMapWritable
, if not, throw an exception.
Then take out the value of the sensor of the device from MapWritable