Delphix Compliance Service (or DCS) for Azure is a highly scalable API-based masking service built for Azure cloud.
Azure Data Factory (or ADF) is Azure's cloud extract, transform, and load (ETL) service which enables connecting to, moving, and transforming data seamlessly.
Together, DCS for Azure and ADF can be used to deliver secure masked data to environments more quickly than ever before. This article describes the performance factors that should be considered when implementing masking in ADF pipelines with DCS for Azure and Azure SQL.
SQL database configuration
Database configuration and the throughput that a target database is able to achieve is one of the largest performance factors for overall ADF pipeline performance. The ADF pipelines described in this article do not mask data in place, but rather, copy data from one table/database to another table/database (applying masking to sensitive data between source and target table/database). This amounts to a “load” of all the rows from one table to another and can generate high CPU usage and large amounts of database log I/O on the target database. This is especially true when processing large tables/databases. Larger SQL server sizes provide more CPU and have higher log I/O throughput which can reduce the overall pipeline execution time.
Batch size (the number of rows in an API request) is an important factor in optimizing the performance of the REST stage of an ADF Mapping Data Flow. The REST stage is the portion of an ADF Data Flow where the Data Flow sends all masking requests to the DCS for Azure service. The time taken for this stage of the Data Flow represents the time taken to mask the source data.
The DCS for Azure batch masking endpoint (/v1/batchMask) enforces a maximum payload limit of 2MB. Identifying a batch size that is close to the 2MB payload limit (without exceeding it) reduces the number of overall API calls that need to be made to the DCS for Azure service. Using too small a batch size will mean the Mapping Data Flows will have increased round trips, making superfluous API calls to the DCS for Azure service.
Factors such as the number of columns to be masked, the character length of the column identifiers, and the character length of the data to be masked, all affect how many rows you will be able to post in a single batch payload.
See the DCS for Azure Best Practices document for help determining appropriate batch sizes for your datasets.
Integration Runtimes (IRs) are Apache Spark clusters that execute Data Flows. Running Data Flows in parallel will result in multiple running IRs, one for each Data Flow. Larger IRs have more memory and CPU and can parallelize operations across more Spark nodes. Larger IRs can improve masking performance and reduce execution time of the masking workload (REST stage) of an ADF pipeline, but will be more costly to run. Larger IRs and memory optimized IRs are recommended for stability and performance with larger tables.
ADF Pipeline configuration
ADF Pipelines can be configured to run one or many Data Flows; sequentially or in parallel. Executing Data Flows in parallel allows for masking multiple tables or subsets of tables in parallel, therefore masking data faster.
If it is necessary to mask tables in a certain order, ADF Pipelines will need to be configured to run Data Flows sequentially in the required order.
A single ADF Pipeline can run up to 40 activities. For masking more than 40 tables, or depending on the vCore limit and IR sizes, it could be necessary to break the Data Flows up into multiple ADF Pipelines that are executed sequentially.