Skip to main content
Skip table of contents

Pre-set masking algorithms

Delphix Compliance Services comes with a growing number of pre-set algorithms, as shown in the selection when creating a new domain. Below are some examples of these. 

The full list of pre-set algorithms can be found by navigating to Compliance → Algorithms. The listed algorithms can be expanded to show Details and Utilization, where the associated Domain is shown.

Algorithm name

Examples

Account

"6379315274824970" → "64893"

"ABCxyz123" → "72345"

"ID3938491" → "72433"

Address Line

"49 Main St" → "55 BLUE DR"

"1947 Highway 5" → "92 GREEN ST"

"9 County Route 52.5" → "1049 ORANGE CIRCLE"

Email

"bob@gmail.com" → "Andy.Samberg@nytimes.edu"

Last, First

"Lincoln, Abe" → "Campbell, Allison"

"George Washington" → "Douglas, Alfred"

"teddy" → "Smith, Jack"

Driver License

"6379315274824970" → "865345234"

"ABCxyz123" → "952731585"

"US949382" → "164927562"

Digits

"6379315274824970" → "8345698341375224"

"99" → "05"

"ABCxyz123" → "ABCxyz391"

"0" → "6"

Web URLs

"www.google.com" → "http://www.blogspot.com"

"delphix.com" → "http://www.gaurdian.co.uk"

"https://en.wikipedia.org/wiki/Syslog#References" → "http://www.newegg.com"

Different frameworks can be used to create algorithms that target the same data domains. For example, both a secure lookup algorithm and a character mapping algorithm can be used to de-identify first names. Some frameworks are best suited for certain tasks, have different parameters, or use a different masking approach. Below are some of the most popular algorithms and a description of their function.

Supported framework details

Character Mapping

The Character Mapping framework maps text values, defined by a set of character groups, to other text values generated from the same character groups. For example, an algorithm that defines a character group of [0-9] will find all characters between 0 and 9, and replace them with values also contained within that character group.  

To elaborate further, a Character Mapping algorithm could be defined with a single character group, "[0-9]", and it might mask as follows:

  • "(603) 867-5309" → "(463) 638-0193"

  • "999-12-3456" → "453-71-6283"

  • "Call Tom at 8:00PM" → "Call Tom at 2:45PM"

Mappings are calculated algorithmically, so it is not necessary to provide the set of mapping values. The algorithm preserves any characters not assigned to a group. Any characters from the first Unicode plane can be mapped, which covers most characters used in modern languages. Other (supplementary) characters can only be preserved.

The particular set of permutations used is determined by the algorithm's key, so rekeying the algorithm will cause different outputs to be generated for each input.

The algorithm has the following properties:

  • The masked value for each input is consistent unless the algorithm is rekeyed.

  • No two text inputs produce the same text output. Collisions are possible for some data types, such as Numeric, where multiple text values, such as "001" and "1", are treated as the same value.

  • As long as at least one maskable character is present in the input, the masked value will never match the input.

  • Each masked position influences the mapping done at every other masked position.

For these reasons, this algorithm is useful for masking columns with uniqueness requirements, such as primary and foreign key columns.

Binary Lookup

A Binary Lookup algorithm is much like the Secure Lookup algorithm, but is used when entire files are stored in a specific column. This algorithm replaces objects that appear in object columns. For example, if a bank has an object column that stores images of checks, you can use a Binary Lookup algorithm to mask those images. The Delphix Engine cannot change data within images themselves, such as the names on X-rays or driver’s licenses. However, you can replace all such images with a new, fictional image. This fictional image is provided by the owner of the original data.

Date Replacement

The Date Replacement framework masks a date value based on specified beginning and end dates. Masked output values are calculated algorithmically using the algorithm's key, so rekeying the algorithm will cause a different output value to be generated for each input. It is possible for an input to be masked to itself.

As an example, a Date Replacement algorithm with a minimum range of "2020-01-01 00:00:00" and a maximum range of "2020-01-05 00:00:00" with the unit set to Days will replace the input value with a date in the specified range. Dates may mask as follows:

  • "1995-03-05 13:25:00" → "2020-01-02 00:00:00"

  • "2021-10-13 01:59:59" → "2020-01-04 00:00:00"

  • "1856-07-31 00:00:00" → "2020-01-01 00:00:00"

Another example with a minimum range of "2020-01-01 01:00:00" and a maximum range of "2020-01-01 03:00:00" with the unit set to Hours provides 3 possible mask values:

  • "2020-01-01 01:00:00"

  • "2020-01-01 02:00:00"

  • "2020-01-01 03:00:00"

Using the same range of "2020-01-01 01:00:00" to "2020-01-01 03:00:00" but with the unit set to Minutes, there are 121 possible output values as the unit is the granularity at which time is subdivided. Note that the range is inclusive of both range values. Possible masked values may be as follows:

  • "2020-01-01 01:00:00"

  • "2020-01-01 01:14:00"

  • "2020-01-01 01:59:00"

  • "2020-01-01 02:23:00"

  • "2020-01-01 03:00:00"

All inputs with the same value masked with the same algorithm configuration will result in the same output values.

DateShift

The Date Shift framework masks date values to different dates based on a specified range around the input value. Masked values are calculated algorithmically using the algorithm's key, so rekeying the algorithm will cause different outputs to be generated for each input. All valid input values will be masked to a new value, and the new value will never match the input.

As an example, a Date Shift algorithm with a minimum value of 3 and a maximum value of 5 with the unit set to Days will shift the input value from 3 to 5 days into the future. Dates may mask as follows:

  • "2021-02-03 12:30:00" → "2021-02-06 12:30:00"

  • "1905-12-10 00:00:00" → "1905-12-15 00:00:00"

  • "2001-07-31 23:45:30" → "2001-08-04 23:45:30"

With roll enabled and the same configuration, a date at the end of a month will wrap around to the beginning of the month. Dates may mask as follows:

  • "2021-02-25 10:00:00" → "2021-02-01 10:00:00"

  • "1932-05-03 01:15:15" → "1932-05-08 01:15:15"

  • "1999-08-31 18:30:00" → "1999-08-03 18:30:00"

All inputs with the same value masked with the same algorithm configuration will result in the same output values.

Email

The Email framework masks string values by splitting the input on the '@' symbol and independently masking the name and domain portions of the email address. Masked values are calculated algorithmically using the algorithm's key, so rekeying the algorithm will cause different outputs to be generated for each input. All inputs to this framework are valid and the framework will not generate non-conformant data events. 

As an example, an Email algorithm that uses Lookup Value to mask the name portion and Replacement Text to mask the domain portion with the following configuration:

Lookup file:

  • Amy

  • Bob

  • Jake

  • Katherine

Replacement text: example.com

May mask as:

  • "albert@delphix.com" → "Bob@example.com"

  • "albert@gmail.com" → "Bob@example.com"

  • "andrew_smith_123@delphix.com" → "Katherine@example.com"

Another example that uses the Algorithm option for both the name and domain portion with the following configuration:

Name algorithm: dlpx-core:FirstName

Domain algorithm: dlpx-core:CM Alpha-Numeric

May mask as follows:

  • "bob@gmail.com" → "alton@dqpnx.fsy"

  • "bob@hotmail.com" → "alton@poatzdw.bya"

  • "alex@gmail.com" → "jameel@dqpnx.fsy"

  • "joe_123@yahoo.com" → "miryam@wbpaq.kts"

The Email framework will not generate non-conformant data events, but the chained algorithm may generate such events.

All inputs with the same value masked with the same algorithm configuration will result in the same output values.

Free Text Redaction

A Free Text Redaction framework helps to remove sensitive data that appears in free-text columns such as “Notes.” This type of algorithm requires some expertise to use because it must be set to recognize sensitive data within a block of text.

The algorithm uses a list of lookup words to determine what information it needs to mask. Decide which words the algorithm uses to search for material, such as addresses. For example, setting the algorithm to look for “St,” “Cir,” “Blvd,” and other words that suggest an address. Pattern matching can be used to identify potentially sensitive information. For example, a number that takes the form 123-45-6789 is likely to be a Social Security Number. Lookup words and regular expressions will match individual words within the input text, rather than phrases.

This framework can also be used to show or hide information by displaying either a DenyList or an AllowList.

DenyList

Designated material will be redacted (removed). For example, a deny list can be set to hide patient names and addresses. The deny list feature will match the data in the lookup file to the input.

AllowList

ONLY designated material will be visible. For example, if a drug company wants to assess how often a particular drug is being prescribed, an allow list can be used so that only the name of the drug will appear in the notes.

Below is an example of the Free Text Redaction framework.

Input: The customer Bob Jones is satisfied with the terms of the sales agreement. Please call to confirm at 718-223-7896.

Redact type: DenyList

Lookup file:

  • Bob

  • Jones

  • agreement

Lookup file redaction value: XXXX

Regular expressions entry: [0-9]{3}-[0-9]{3}-[0-9]{4}

Regular expression redaction value: YYYY

MASKING RESULT: The customer XXXX XXXX is satisfied with the terms of the sales XXXX. Please call to confirm at YYYY.

"Bob", "Jones", "agreement" and the phone number are redacted.

Full Name

The Full Name framework has the logic to recognize parts of the input related to the First and Last names, as well as treating the particles (which are imported from the chained Last Name algorithm instance). Last Name also has a logic of limiting the number of masked First names (removing the rest), as well as smart trimming of the result (masked) output to the required length.

After distinguishing parts of the input string: Full Name algorithm feeds the single words from the first name part (which also includes middle names, treated same as first names) to the instance of the First Name algorithm and the whole last name part to the instance of the Last Name algorithm. Then it combines the masking results, according the embedded logic and the configuration.

If input string contains only single word: This word is considered as a first name or last name (depending on the Consider Single Word Input as Last Name flag) and forwarded for masking to corresponding chained algorithm instance. Single word input is always masked, even if contains configured particle.

Main features of the Full Name Framework:

  • Deterministic output: The masked result for each input is consistent when using the same algorithm key, same configuration and same chained algorithm instances.

  • Not unique: The masked result might be the same for different inputs.

  • Garbage in, garbage out: The algorithm returns the unmasked input / null / empty string if input is one of the following: null, empty string “”, white spaces only “ ”, single not alphanumeric symbol (for example “!”).

  • Single word input: Considered either as a Last Name (default) or as a First Name (even if configured in one of the particles files).

  • When particle is configured in both particles files: The remove action takes precedence.

  • Multiple first names: Masks only first N names (1-4, as configured, default = 2), the rest are ignored.

  • Full Name Convention: If configured last name separator is detected or configured convention is “last-first-middle” than detects an input as last-first-middle, otherwise first-middle-last (default). Heading/Trailing white spaces are not preserved.

  • Smart trim: If trimming of the masked value is required it's done in a way to keep the realistically looking full name as long as possible. For instance: first we trim the heading/trailing preserved particles. If not enough - abbreviating the masked first/middle names (one by one, starting the last one). If still no enough - removing the particles prior to the last name, etc.

Below is an example of smart trim. Let's suppose our masked result (prior to checking of the maxLength) is:

“President George Herbert Walker Van Bush Jr.”

The requirement for chained instances for First Name and Last Name masking is an existing extensible algorithm instance, masking the String type. Although it can be any String type extensible algorithm instance, it is recommended using the instances based on the Name framework.

Name

The extensible Name algorithm framework co-exists with the legacy FIRST NAME SL and LAST NAME SL ones. Name framework provides masking functionality for String type input. It is based on the Secure Lookup mechanism, and includes additional configuration flags making it more flexible and robust.

Similar to Secure Lookup, it creates masking results which are deterministic (i.e. the same algorithm with the same configuration and security key will provide the same result for the same input) and not unique. If a framework with algorithm(s) that provide unique masking results is needed, consider exploring others like Character Mapping.

This framework uses SHA256 hashing method and allows case configurations for input and output (i.e. masked) values. It also allows filtering accents, configuring the maximum length of the masked value. If input name is a multi-word string, it might contain particles related to the name. Particles are considered to be any prefixes, suffixes, titles, etc. This framework allows for configuring which particles are removed and which are preserved.

Payment Card

The Payment Card framework masks payment card numbers based on the starting digits to be preserved and the minimum number of positions to be masked. This framework is built on top of the Character Mapping Algorithm Framework with a character set of [0-9]. All characters outside of this character group remain unmasked. Masked values are calculated algorithmically using the algorithm's key, so rekeying the algorithm will cause different outputs to be generated for each input. The last digit may remain the same if the calculated check digit is equivalent to the last digit of the input. Any inputs with more than one digit will never mask to the original value.

Any inputs with a single digit will remain unmasked.

This framework preserves the validity of the payment card number using the Luhn check. All input values with valid Luhn checks will be masked to values with valid Luhn checks. All invalid values with invalid Luhn checks will be masked to values with invalid Luhn checks.

As an example, a Payment Card algorithm with a minMaskedPositions value of 6 and a preserve value of 6 may mask as follows:

  • "5419033646326699" → "5419036803270758"

  • "5419-0336-4632-6699" → "5419-0368-0327-0758"

  • "5319abc0339def4632ghi6599!" → "5319abc0364def1507ghi4137!"

All inputs with the same sequence of digits masked with the same algorithm configuration will result in the same output values.

RegexDecompose

The Regex Decompose framework masks values that match specified Java 8 regular expressions. The algorithm attempts to match the algorithm input against each regular expression, and once a match is found, the associated action is applied to transform either the entire input, or each capturing group (parts of the input) defined by the expression. A fallback action may be provided for use when none of the defined regular expressions match the input. If no fallback action is defined and an input fails to match any of the defined regular expressions, the algorithm may be configured to generate a non-conformant data exception.

Capturing groups are used in regular expressions to create subgroups. These can be expressed in regular expressions using parentheses to group characters together. This algorithm allows for different capturing groups to be assigned different mask actions. Nested capturing groups are unsupported and may lead to unpredictable behavior. If no capturing groups are defined, the first action is applied to the entire match. In this case, the action list should contain only one action.

Creation of Regex Decompose algorithms can only be done through the API, see API Calls for Creating Algorithms - Regex Decompose.

As an example, a Regex Decompose algorithm with the following configuration:

  • Mask Pattern:

    • Regular Expression: "[0-9]*"

    • Action: Redact

    • Redact String: "redacted"

  • Require Mask: false

  • Trim Input: true

  • Maximum Input Length: 10

Will produce masked results as follows:

  • "12345" → "redacted"

  • " 6789 " → " redacted "

  • "12345678901" → non-conformant data

    • exceeds maximum input length

  • "abc123" → "abc123"

    • remains unmasked since it does not match the regex pattern

The provided regular expression matches any inputs with 0 or more digits in the range [0-9] and any inputs that match will be replaced with the string "redacted". Any inputs that contain characters outside of the range [0-9] will not be masked. If require mask was set to true, the last example "abc123" would trigger a non-conformant data event as the value would not be masked by the algorithm.

Another example that includes capturing groups with the following configuration:

  • Mask Pattern:

    • Regular Expression: "([1-9]*)-([a-z]*)"

    • Action 1: Redact

      • Redact Character: 'X'

    • Action 2: Preserve

  • Require Mask: true

  • Trim Input: true

  • Maximum Input Length: 10

  • Fallback Action: Redact

    • Redact String: "redacted"

Will produce masked results as follows:

  • "12345-abc" → "XXXXX-abc"

  • "abc-123" → "redacted"

    • does not match the pattern so the fallback action is applied

  • "1-a" → "X-a"

  • "-" → "redacted"

    • does match the pattern but the masked output would be "-" which breaks the requirement that the output must be different from the input so the fallback action is applied

  • "redacted" → non-conformant data

    • does not match the pattern so the fallback action is applied but the fallback action does not change the value so it fails the requirement that the input must be masked

The provided regular expression matches any inputs with 0 or more digits in the range [1-9], a dash, and 0 or more characters in the range [a-z]. Any inputs that do not match that pattern will be masked by the fallback action. If the fallback action fails to change the input, a non-conformant data event will occur.

All inputs with the same input value masked with the same algorithm configuration will result in the same output values.

Secure Lookup

This is the most commonly used type of algorithm for its ease to generate and ability to work with different languages. When this algorithm replaces real, sensitive data with fictional data, it creates repeating data patterns, known as “collisions.” For example, the names “Tom” and “Peter” could both be masked as “Matt”. Because names and addresses naturally recur in real data, this mimics an actual data set. However, for the Continuous Compliance engine to mask all data into unique outputs, the Character Mapping algorithm is better suited.

Case sensitive lookups are optional. Masked output casing can also be customized to preserve case of the lookup file, preserve input value case, force all uppercase, or force all lowercase. 

Tokenization

The Tokenization framework allows for masking data and reversing its masking. For example, a Tokenization algorithm can be used to mask data before its sent to an external vendor for analysis. The vendor can then identify accounts that need attention without having any access to the original, sensitive data. Once the vendor’s feedback is obtained, you can reverse the masking and take action on the appropriate accounts.

The Tokenization algorithm is designed to be used in Tokenization/Re-Identification jobs, though it can also be used in Masking.

The algorithm tokenizes values using AES-128 encryption in CBC-CTS mode, with an optional initialization vector (IV), and Base64 encoding. The results are alpha-numeric strings that are longer than the original values. If the result is too long to fit in the field, the algorithm can be configured to either (a) fallback to a reversible masking algorithm, which produces a result that is the same length as the original value, or (b) fail the job.

The algorithm has the following properties:

  • The masked value for each input is consistent when using the same algorithm and the initialization vector length is 0. Changing the key for the algorithm or using an initialization vector length greater than 0 will result in different masked values.

  • As long as at least one maskable character is present in the input, the masked value will never match the input.

  • The algorithm used to mask a value can change depending on the length of the input.

  • The algorithm only works on string data types. Numbers can be masked if the column data type is a String type, such as VARCHAR or TEXT.

This new algorithm framework was introduced in version 6.0.13.0 to replace the existing Tokenization algorithm and adds the ability to select a fallback algorithm. Below is example data showing before and after Tokenization:

Before

NONE
1,Erasmus,245 Park Ave,123-45-6789
2,Salathiel,245 park ave,123-45-6789
3,Salathiel,1003 Stant Drive,111-11-1111

After

TEXT
ID,fname,address,ssn
1,FQL71CmqK/pkd8B2vVP903O4+/krT91dscS0rKQRACQ=,XFLst0IcSbOa2UlEOmlACPkcaOEVczZsEdxl225kF1M=,x6tJ4eyL4it4ji84h8PzoCW4QBZphEqDOy3hEj4h1jE=
2,4bGZoCLpbV2zAMsTkcc5lMTBKksvOP+tfAWucq+BnKM=,OA9dJ5HN5oRx18ZYo1f5Y8DofvhFoRo98cuQHZ7YeEo=,Evj+LnETt7ABbXlTDPyNvvJe8WJnrhEWeS0lqtqrr4U=
3,Ll4T49FrCBYRibOAKOY4vbnswbOn1RpqBU97EGg4RvA=,f6AR0T+HBoTW7+l0e8ok9rImj872PUnYYNYMDYSy4dw=,wYMvEhktV371kqH607afJHZloT+4DYNJxehWIcPZJzI=
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.