Deletes are supported at a record level in Hudi with 0.5.1 release. This blog is a "how to" blog on how to delete records in hudi. Deletes can be done with 3 flavors, with Hudi Client: Hudi RDD APIs, with Spark data source and with DeltaStreamer.
Delete
...
using RDD Level APIs
If you have embedded HoodieWriteClient , then deletion is as simple as passing in a JavaRDD<HoodieKey> to to the delete api.
Code Block | ||||
---|---|---|---|---|
| ||||
// Fetch list of HoodieKeys from elsewhere that needs to be deleted // convert to JavaRDD if required. JavaRDD<HoodieKey> toBeDeletedKeys List<WriteStatus> statuses = writeClient.delete(toBeDeletedKeys, commitTime); |
Deletion with Datasource
Now we will walk through an example of how to perform deletes on a sample dataset using the Datasource API. Quick Start has the same example as below. Feel free to check it out.
...
Step 2 : Import as required and set up table name, etc . for sample dataset
Code Block | ||||
---|---|---|---|---|
| ||||
import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ val tableName = "hudi_cow_table" val basePath = "file:///tmp/hudi_cow_table" val dataGen = new DataGenerator |
...
Deletion with HoodieDeltaStreamer takes the same path as upsert and so it relies on a specific field called "_hoodie_is_deleted" of type boolean in each record.
- If a record has the field value set to false
...
- or it's not present, then it is considered a regular upsert
...
- if not (if the value is set to
...
- true ), then its considered to be deleted record.
This essentially means that the schema has to be changed for the source, to add this field and all incoming records are expected to have this field set. We will be working to relax this in future releases. But for now, this is what we have.
Lets say the original schema is
...