Status

Current state: Committed

Discussion thread: https://lists.apache.org/thread/qsmxsymozymy6dy9tp5xw9gn5fhz9nt4

JIRA: Unable to render Jira issues macro, execution error.

Motivation

Many users have the need of masking sensitive data, such as contact info, age, gender, credit card numbers, etc. Dynamic data masking (DDM) allows to obscure sensitive information while still allowing access to the masked columns, and without changing the stored data. It is a feature available in multiple databases, but not in Apache Cassandra. This CEP aims to provide DDM for Apache Cassandra using both native and user-defined CQL functions.

DDM will be applied to specific columns once they are read from the storage engine, so unprivileged users won’t be able to see the real data that is stored in the database. Instead, they will see a redacted version of the data, such as “XXXX”. However, the stored data will remain unmasked.

There will be different types of masking, including partial masking. For example, a masking function can redact all the digits of a credit card number except the two last digits. Users will be able to provide their own masking functions if they need something more elaborate or personalized than the standard masking functions.

DDM aims to provide some degree of protection against accidental data exposure. However, it’s not a replacement for other security techniques such as column-level permissions or data encryption.

Audience

Cassandra developers and operators.

Goals

Add a set of native CQL functions providing generic masking for all, or almost all, CQL data types.
Permit attaching the native masking functions to specific table columns, so any query over those columns will return the masked data.
Allow the use of UDFs as masking functions attached to specific table columns.
Add a table-level UNMASK permission allowing certain users to both see and query the unmasked columns, assuming that they also have the SELECT permission on the table.
Add a table-level SELECT_MASKED permission allowing certain users to query but not see the unmasked columns, assuming that they also have the SELECT permission on the table.
Prevent users without UNMASK nor SELECT_MASKED permissions to indirectly guess the real value of the masked value by trying values on the WHERE clause of SELECT queries.

Non-Goals

DDM won’t mask, encrypt nor transform in any way the data that is stored in the database. The data will be dynamically transformed after reading it from the storage engine, and before presenting it to the unprivileged user.
DDM won’t provide any kind of column-level access control nor finer granularity than what the current authorization system allows. The new UNMASK and SELECT_MASKED permissions will be analogous to the current SELECT permission.
DDM won’t provide any additional protection against direct access to the sstable files.
DDM masking functions will always be scalar functions applied column value per column value. We won’t support, for now, more complex strategies such as shuffling column values. However, the proposed CQL syntax is not incompatible with that type of masking if we wanted to add it in the future.

Timeline

Dynamic data masking will target 4.2, with development happening mostly on trunk. It will be developed in the following incremental steps:

Add basic native CQL functions for masking data.
Allow to attach native masking functions to table columns with CREATE/ALTER TABLE.
Add a new UNMASK permission allowing users with that permission to see and query unmasked data. Steps 2 and 3 might be done together or in a short-lived feature branch.
Add a new SELECT_MASKED permission allowing users with that permission to query but not see unmasked data.
Allow to attach user-defined functions (UDF) as masking functions with CREATE/ALTER TABLE.
Maybe add more sophisticated and/or specific masking functions.

Proposed Changes

Masking functions

There will be a set of native CQL functions providing different types of masking for column values, such as replacing the real value of a column by a fixed value. Here is a more detailed list of possible functions:

default	The column value will be replaced by a fixed value of the same data type. Text values will be replaced by “XXXX”, date values will be replaced by “1970-01-01”, numeric values will be replaced by zero, etc.
replace by text	The column value will be replaced by a fixed text value, independently of the original data type. The replacement text will be an argument of the function.
partial	Text values will be replaced by “XXXX”, but they will expose the first m and last n characters of the real value. The lengths of the exposed prefix and suffix will be arguments of the function.
null	The column value will be replaced by null. This probably won’t be applicable to primary key columns.
hash	The column value will be replaced by its hash. We can have different functions for different hash algorithms.

This list is not exhaustive and of course we can add more functions if we think they are useful.

It would be possible to use the masking functions on SELECT queries, like any other native CQL function. For example:

> CREATE TABLE patients (
      id timeuuid PRIMARY KEY,
      name text,
      birth date);
> INSERT INTO patients(id, name, birth) VALUES (now(), 'alice', '1982-01-02');

> SELECT name, mask_with_default(name), mask_with_partial(name, 2, 1) FROM patients;

 name  | mask_with_default(name) | mask_with_partial(name, 2, 1)
-------+-------------------------+-------------------------------
 alice | XXXX                    | alXXXXe

> SELECT birth, mask_with_default(birth), mask_with_null(birth) FROM patients;

 birth      | mask_with_default(birth) | mask_with_null(birth)
------------+--------------------------+-----------------------
 1982-01-02 | 1900-01-01               | null

These native masking functions have a value on their own, even if we don't attach them to the schema as we’ll see in the next section. So adding them would probably be the first step of the implementation. Once we add a few of them and establish the common superclasses, testing framework etc., adding more functions should be easy. Adding extra functions could be a good source of easy and satisfactory tickets for newcomers.

Masked table columns

It will be possible to attach the masking functions to schema columns. By doing so, all SELECT queries involving those columns would produce masked values, without the need to specify the masking function on the SELECT query. For example:

> CREATE TABLE patients (
      id timeuuid PRIMARY KEY,
      name text MASKED WITH partial(2, 1),
      birth date MASKED WITH default()
      );
> INSERT INTO patients(id, name, birth) VALUES (now(), 'alice', '1982-12-21);

> SELECT name, birth FROM patients;

 name    | birth
---------+------------
 alXXXXe | 1900-01-01

In addition to masking native CQL functions, it will also be allowed to use regular UDFs to mask a column. That way users will be able to provide their own masking functions.

It will be possible to change the masking of a column at any moment with the ALTER TABLE command:

> ALTER TABLE patients ALTER name MASKED WITH partial(2, 1);
> ALTER TABLE patients ALTER name WITHOUT MASK;

It won't be possible to use masked columns in the WHERE and IF clauses of SELECT and UPDATE statements without the proper permissions. This is done to prevent malicious users from inferring the underlying data:

> UPDATE patients SET name='alice' WHERE id=ed976860-1e3b-11ed-814b-e5a3bd61806e IF name='alice';  

Unauthorized: Error from server: code=2100 [Unauthorized] message="User has no UNMASK nor SELECT_UNMASK permission on <table k.patients>"

> SELECT name, birth FROM patients WHERE name='alice' ALLOW FILTERING; 

Unauthorized: Error from server: code=2100 [Unauthorized] message="User has no UNMASK nor SELECT_UNMASK permission on <table k.patients>"

Although we might not support it in an initial version, materialized views should inherit the same masking as its base table, so:

> CREATE TABLE patients (
      id timeuuid PRIMARY KEY,
      name text MASKED WITH default(),
      birth date MASKED WITH default()
      );
> CREATE MATERIALIZED VIEW patients_by_name AS 
      SELECT * FROM patients 
      WHERE id IS NOT NULL AND name IS NOT NULL 
      PRIMARY KEY (name, id);
> INSERT INTO patients(id, name, birth) VALUES (now(), 'alice', '1982-12-21);

> SELECT name, birth FROM patients_by_name;

 name | birth
------+------------
 XXXX | 1900-01-01

UNMASK and SELECT_MASKED permissions

There will be a special table-level permission for unmasking values, so users would need that permission to see the clear data. That permission will be named UNMASK. Users having the UNMASK (and SELECT) permission will see the real data, as if the masking function didn’t exist. For example:

> CREATE TABLE patients (
      id timeuuid PRIMARY KEY,
      name text MASKED WITH default(),
      birth date MASKED WITH default()
      );

> INSERT INTO patients(id, name, birth) VALUES (now(), 'alice', '1982-12-21');

> CREATE USER unprivileged_user WITH PASSWORD 'xyz';
> CREATE USER privileged_user WITH PASSWORD 'zyx';

> GRANT SELECT ON ALL TABLE patients TO unprivileged_user;
> GRANT SELECT ON ALL TABLE patients TO privileged_user;
> GRANT UNMASK ON ALL TABLE patients TO privileged_user;

> LOGIN unprivileged_user
> SELECT name, birth FROM patients WHERE name='alice' ALLOW FILTERING; 

Unauthorized: Error from server: code=2100 [Unauthorized] message="User has no UNMASK nor SELECT_UNMASK permission on <table k.patients>"

> SELECT name, birth FROM patients WHERE id=db2b372f-f91b-4537-b46b-c478f8330c29;

 name    | birth
---------+------------
 alXXXXe | 1900-01-01  
            
> LOGIN privileged_user
> SELECT name, birth FROM patients WHERE name='alice' ALLOW FILTERING;

 name  | birth
-------+------------
 alice | 1982-12-21

The UNMASK permission will also allow using the masked columns in the WHERE/IF clauses of statements.

There will also be a new SELECT_MASKED permission that will allow using the masked columns in the WHERE/IF clauses of statements. However, it will not allow to see the clear data in the query results. This permission is meant to be granted to not-malicious users that are not trying to exploit the data, but generating masked data for its consumption by other agents. For example:

> CREATE TABLE patients (
      id timeuuid PRIMARY KEY,
      name text MASKED WITH default(),
      birth date MASKED WITH default()
      );

> INSERT INTO patients(id, name, birth) VALUES (now(), 'alice', '1982-12-21');  

> CREATE USER privileged_user WITH PASSWORD 'zyx';
> GRANT SELECT ON TABLE patients TO privileged_user;
> GRANT SELECT_MASKED ON TABLE patients TO privileged_user;  

> LOGIN privileged_user
> SELECT name, birth FROM patients WHERE name='alice' ALLOW FILTERING;

 name    | birth
---------+------------
 alXXXXe | 1900-01-01

New or Changed Public Interfaces

CQL’s syntax for creating and altering tables will be extended to allow the specification of masked columns. This will be done by adding new clauses to the existing statements, and they will keep backward compatibility.

The IAuthorizer, IAuthenticator and IResource interfaces will be extended with two new types of permission. This should be done without breaking binary compatibility. Existing implementations will keep working as long as DDM is not used. However, implementations will need to be updated if they want to support DDM. Masking table columns will be forbidden if the authenticator or the authorizer don’t explicitly declare that they support it.

Compatibility, Deprecation, and Migration Plan

The new feature will be backward compatible, so anything that worked before should continue working if we don’t use DDM.

We might forbid enabling DDM during rolling upgrades to avoid problems with the additions to the schema. However, depending on the implementation details, we might allow it so only the upgraded coordinator nodes will mask the data.

Clients reading tables with masked columns should be prepared to receive columns with data types different to those defined in the table metadata, depending on what masking function is used. The default masking function won’t change the column type.

Test Plan

We will include the usual unit and integration tests, testing a reasonable number of use cases. We will also include upgrade tests covering both partial and full upgrades.

Rejected Alternatives

Columns are presented with their original name after masking, so one can't know that they are masked by just looking at them. Users would only know that they are seeing masked data because they know the table schema:

> SELECT name, birth FROM patients;

name     | birth
---------+------------
alXXXXe  | 1900-01-01

An alternative approach would be replacing the name of the column with the call to the function, this way:

> SELECT name, birth FROM patients;

 mask_with_partial(name, 2, 1) | mask_with_default(birth)
-------------------------------+--------------------------
 alXXXXe                       | 1900-01-01

That would make it clear that the column is masked. However, that might be problematic for applications receiving unexpected column names, especially if the column masking has been added after the application was written. Also, keeping the original column name seems to be standard practice of other databases out there.

Other databases

The following table contains some information and links about how other databases out there deal with data masking. The (incomplete) list of databases comes from the top of DB-Engines ranking. Most of the databases on the top of the list support data masking some way or another, with Apache Cassandra being an exception.

Microsoft SQL Server / Azure SQL	SQL server has been the main reference for this implementation. It has very similar syntax for attaching masking functions to columns on CREATE TABLE. It also has a mostly identical UNMASK permission that can be assigned to users. However, as long as I know it doesn’t allow you to use the masking functions as regular database functions. It seems that UDFs cannot be used as masking functions. Tables with materialized views don’t support masking. The names of the masked columns are not changed. The documentation for DDM can be found here: https://docs.microsoft.com/en-us/sql/relational-databases/security/dynamic-data-masking
MySQL	MySQL data masking is based on a set of functions available as a server-side plugin. There isn’t the possibility of specifying masking functions as part of the column definition, nor are there specific permissions. However, the masking functions, as any other function, can be used with stored procedures and materialized views. The documentation can be found here: https://dev.mysql.com/doc/refman/8.0/en/data-masking.html This blog post is also useful: https://dev.mysql.com/blog-archive/data-masking-in-mysql/
PostgreSQL	There are extensions for PostgreSQL providing data masking through security labels. Security labels are a mechanism for allowing third parties to provide their security extensions. Interesting as it is, I think that the general approach of providing such a framework is out of scope. Other than using separate statements for specifying how the columns are masked, the syntax is not very different. Also, the masking rules can be different for each user/role. The names of the masked columns are not changed when presenting the data. Documentation can be found here: https://www.postgresql.org/about/news/postgresql-anonymizer-10-privacy-by-design-for-postgres-2452/ https://postgresql-anonymizer.readthedocs.io/en/latest/dynamic_masking/
MongoDB	MongoDB provides data masking through aggregation pipelines. Given how different MongoDB and Apache Cassandra are, I don’t think this is very relevant. Some reading can be found in these links: https://webassets.mongodb.com/mongodb_data_masking.pdf https://studio3t.com/knowledge-base/articles/mongodb-data-masking-techniques/ https://github.com/pkdone/mongo-data-masking There is also a feature called field redaction: https://www.mongodb.com/docs/manual/reference/operator/aggregation/redact/#mongodb-pipeline-pipe.-redact
IBM Db2	IBM Db2 uses a dedicated CREATE MASK statement to add masking to specific columns. The masks are defined as a kind of procedure where it’s possible to define different behaviours depending on the calling user, inside a CASE statement. The names of the columns don't change when they are masked. Documentation and examples can be found here: https://www.ibm.com/docs/en/i/7.3?topic=statements-create-mask
Oracle	Oracle database supports data masking through a feature called data redaction. Masking policies can be set on a per-user basis. The names of the masked columns are not changed. Documentation: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/asoag/configuring-oracle-data-redaction-policies.html
MariaDB	Although MariaDB doesn't support this feature, the commercial MaxScale proxy solution does. Columns are masked with rules set in a MaxScale config file. The names of the masked columns are not changed. Documentation: https://mariadb.com/resources/blog/sensitive-data-masking-with-mariadb-maxscale/
Snowflake	Masking policies are set with specific statements. The policies can be applied to specific users. The names of the masked columns are not changed. Documentation: https://docs.snowflake.com/en/user-guide/security-column-ddm-use.html

Space shortcuts

Page tree

Status

Motivation

Audience

Goals

Non-Goals

Timeline

Proposed Changes

Masking functions

Masked table columns

UNMASK and SELECT_MASKED permissions

New or Changed Public Interfaces

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Other databases

Space shortcuts

Page tree

CEP-20: Dynamic Data Masking

Status

Motivation

Audience

Goals

Non-Goals

Timeline

Proposed Changes

Masking functions

Masked table columns

UNMASK and SELECT_MASKED permissions

New or Changed Public Interfaces

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Other databases