Introduction
This document proposes the addition of DEFAULT clause to Hive. DEFAULT clause is a domain constraint which lets user specify a value for domain i.e. column to be used in absence of user specified value i.e. in absence of column reference. Note that this doesn’t not propose to implement DEFAULT ON NULL like ORACLE which lets user specify DEFAULT value for explicit NULLs.
Background
Hive currently let users declare following constraints:
PRIMARY KEY
FOREIGN KEY
UNIQUE
NOT NULL
DEFAULT will be fifth addition to this list. Note that unlike existing constraints DEFAULT will only support ENABLE/ENFORCED or DISABLE keyword. VALIDATE/NOVALIDATE and RELY/NORELY will not be supported since ENABLING DEFAULT will not change existing data and optimizer will not make use of RELY keyword.
Proposed Changes
Create Table
CREATE TABLE will be updated to let user specify DEFAULT if following ways
With column definition
CREATE TABLE <tableName> (<columnName> <dataType> DEFAULT <defaultValue>)
With constraint specification
CREATE TABLE <tableName> (<columnName> <dataType>, …, CONSTRAINT <constraintName> DEFAULT <defaultValue> (<columnName>)
To be compliant with SQL standard Hive will only permit default values which falls in one of following category:
LITERAL
DATE TIME VALUE FUNCTION i.e. CURRENT_TIME, CURRENT_DATE
CURRENT_USER()
CURRENT_DATABASE()
NULL
INSERT
Anytime user doesn’t specify a value explicitly for a column its default value will be used if defined e.g
INSERT INTO <tableName>(co1, col3) values(<val1> , <val2>)
Above statement doesn’t specify value for col2 so system will use the default value for col2 if it is defined.
On the other hand if user specify explicit value including NULL e.g.
INSERT INTO <tableName>(col1, col2, col3) values (<val1>, <val2>, <val3>)
Default value will not be used.
Above are also valid for all following type of DMLs:
INSERT INTO
INSERT SELECT
MERGE
PARTITION COLUMNS
Since in Hive query can not be written without referencing partition columns there could not be situation where value for partition column isn’t explicit. Therefore having DEFAULT for partition columns will not make sense and we propose to not add it.
EXTERNAL TABLE
We plan to disallow DEFAULT for external table since the data isn’t managed by Hive.
ACID/MM TABLE
DEFAULT constraint will be allowed and behavior will be same as non-acid tables.
METASTORE SCHEMA
We propose to add column DEFAULT_VALUE to KEY_CONSTRAINTS table in metastore schema to store DEFAULT VALUE.
OTHER DDL
Following DDLs will be updated to accommodate DEFAULT:
ALTER TABLE CHANGE COLUMN
ALTER TABLE DROP CONSTRAINT
Proposed Design
Currently if INSERT query is missing a value in INSERT i.e. if user hasn’t specified a value for column Hive uses ‘NULL’ as default. Hive compiler detects this missing column value at compile time and inserts a NULL. We propose to update this logic to check for default and use that instead of NULL value.
Along with this logic change we foresee following changes:
Metastore code will need to be updated to support DEFAULT constraint
We propose to store/serialize default value as string after it is evaluated and constant folded.
DEFAULT_VALUE will be added to KEY_CONSTRAINTS table in metastore schema.
Hive Parser will need to be updated to allow new DEFAULT keyword with default value
Error handling/Validation logic needs to be added to make sure DEFAULT value conforms to allowed categories during CREATE TABLE.
Type check to make sure DEFAULT VALUE type is compatible with corresponding column type.