When running Presidio on a Kubernetes cluster you can set a Kubernetes CronJob to scan your data periodicly. You will need to configure the scan's input and the destination to which the analyzed and anonymized results will be stored.
- A detailed design of the Ingress Control and the API Service can be found here.
- Retrieves all new items in the provided storage.
- Analyzes/anonymizes these new items.
- Outputs the data into a configured destination.
- Marks the items as scanned using a Redis cache.
To schedule a periodic data scan create the following json.
Note: Example is given using the HTTPie syntax.
echo -n '{
"Name": "scan-job",
"trigger": {
"schedule": {
"recurrencePeriod": "* * * * *"
}
},
"scanRequest": {
"analyzeTemplate": {
"fields": [
{
"name": "PHONE_NUMBER"
},
{
"name": "LOCATION"
},
{
"name": "EMAIL_ADDRESS"
}
]
},
"anonymizeTemplate": {
"fieldTypeTransformations": [
{
"fields": [
{
"name": "PHONE_NUMBER"
}
],
"transformation": {
"replaceValue": {
"newValue": "<PHONE_NUMBER>"
}
}
},
{
"fields": [
{
"name": "LOCATION"
}
],
"transformation": {
"redactValue": {}
}
},
{
"fields": [
{
"name": "EMAIL_ADDRESS"
}
],
"transformation": {
"hashValue": {}
}
}
]
},
"scanTemplate": {
"cloudStorageConfig": {
"blobStorageConfig": {
"accountName": "<ACCOUNT_NAME>",
"accountKey": "<ACCOUNT_KEY>",
"containerName": "<CONTAINER_NAME>"
}
}
},
"datasinkTemplate": {
"analyzeDatasink": [
{
"dbConfig": {
"connectionString": "<CONNECTION_STRING>",
"tableName": "<TABLE_NAME>",
"type": "<DB_TYPE>"
}
}
],
"anonymizeDatasink": [
{
"cloudStorageConfig": {
"blobStorageConfig": {
"accountName": "<ACCOUNT_NAME>",
"accountKey": "<ACCOUNT_KEY",
"containerName": "<CONTAINER_NAME>"
}
}
}
]
}
}
}' | http <api-service-address>/api/v1/projects/proj1/schedule-scanner-cronjob
Defines which fields the input should be scanned for.
A list of all the supported fields can be found here.
Defines the anonymization method that should be executed per each field.
If not provided, anonymization will not be done.
Defines the job's input source. Use the following configuration to define the desired input.
- Supported storage solutions:
- Azure Blob Storage
- AWS S3
- More data types will be added soon!
Defines the job's output destination.
Analyzer and anonymizer data sink arrays defines the output destination of analyze and anonymize results respectively.
Use the following configuration defending on the desired output.
- Supported storage solutions:
- Azure Blob Storage
- AWS S3
- Supported database solutions:
- MySQL
- SQL Server
- SQLite3
- PostgreSQL
- Oracle
- Supported streams solutions:
- Azure EventHub
- Kafka
For AWS S3, use the following configuration:
"cloudStorageConfig": {
"S3Config": {
"accessId": "<AccessId>",
"accessKey": "<AccessKey>",
"region": "<Region>",
"bucketName": "<BucketName>"
}
}
For Azure Blob Storage, use the following configuration:
"cloudStorageConfig": {
"blobStorageConfig": {
"accountName": "<AccountName>",
"accountKey": "<AccountKey>",
"containerName": "<ContainerName>"
}
}
We are using Xorm library for DB operations.
Please refer to Xorm's documentation for additional information regarding the DB configuration.
- MySql
<userName>@<serverName>:<password>@tcp(<serverName>.<hostName>:3306)/<databaseName>?allowNativePasswords=true&tls=true
- PostgreSQL
postgres://<userName>@<serverName>:<password>@<serverName>.<hostName>/<databaseName>?sslmode=verify-full
- SQL Server
odbc:server=<serverName>.database.windows.net;user id=<userId>;password=<password>;port=1433;database=<databaseName>
For Azure Event Hub, use the following configuration:
"streamConfig": {
"ehConfig": {
"ehConnectionString": "<ehConnectionString>", // EH connection string. It is recommended to generate a connection string from EH and NOT from EH namespace.
"storageAccountName": "<storageAccountName>", // Storage account name for Azure EH EPH pattern
"storageAccountKeyValue": "<storageAccountKeyValue>", // Storage account key for Azure EH EPH pattern
"containerValue": "<containerValue>" // Storage container name for Azure EH EPH pattern
}
}
For Kafka use the following configuration:
"streamConfig": {
"kafkaConfig": {
"address": "<address>",
"saslUsername": "<saslUsername>",
"saslPassword": "<saslPassword>"
}
}
Set the '<recurrencePeriod>' according to the execution interval you'd like.
Parallelism is not supported! A new job won't be triggered until the previous job is finished.