Skip to content

Commit

Permalink
Merge branch 'main' into edgarrmondragon/feat/test-valid-schema
Browse files Browse the repository at this point in the history
  • Loading branch information
edgarrmondragon authored Aug 13, 2024
2 parents 39b28ea + 1333278 commit d03834d
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 0 deletions.
42 changes: 42 additions & 0 deletions docs/stream_maps.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,8 @@ can be referenced directly by mapping expressions.
- `fake` - a [`Faker`](inv:faker:std:doc#index) instance, configurable via `faker_config`
(see previous example) - see the built-in [standard providers](inv:faker:std:doc#providers)
for available methods
- `Faker` - the [`Faker`](inv:faker:std:doc#fakerclass) class. This was made available to enable consistent data
masking by allowing users to call `Faker.seed()`.

```{tip}
The `fake` object is only available if the plugin specifies `faker` as an additional dependency (through the `singer-sdk` `faker` extra, or directly).
Expand Down Expand Up @@ -435,6 +437,46 @@ stream_maps:
```
````

### Masking data with Faker

It is best practice (or even a legal requirement) to mask PII/PHI in lower environments. Stream mappers have access to the `Faker` library, which can be used to generate random data in various forms/formats.

```yaml
stream_maps:
customers:
# IMPORTANT: the `fake` variable name will only be available if faker_config is defined
first_name: fake.first_name() # generates a new random name each time
faker_config:
# set specific seed
seed: 0
# set specific locales
locale:
- en_US
- en_GB
```
Be sure to checkout the [`faker` documentation](https://faker.readthedocs.io/en/master/) for all the fake data generation possibilities.

Note that in the example above, `faker` will generate a new random value each time the `first_name()` function is invoked. This means if 3 records have a `first_name` value of `Mike`, then they will each have a different name after being mapped (for example, `Alistair`, `Debra`, `Scooby`). This can actually lead to issues when developing in the lower environments.

Some users require consistent masking (for example, the first name `Mike` is always masked as `Debra`). Consistent masking preserves the relationship between tables and rows, while still hiding the real value. When a random mask is generated every time, relationships between tables/rows are effectively lost, making it impossible to test things like sql `JOIN`s. This can cause highly unpredictable behavior when running the same code in lower environments vs production.

To generate consistent masked values, you must provide the **same seed each time** before invoking the faker function.

```yaml
stream_maps:
customers:
# will always generate the same value for the same seed
first_name: Faker.seed(_['first_name']) or fake.first_name()
faker_config:
# IMPORTANT: `fake` and `Faker` names are only available if faker_config is defined.
locale: en_US
```
Remember, these expressions are evaluated by the [`simpleval`](https://github.com/danthedeckie/simpleeval) expression library, which only allows a single python expression (which is the reason for the `or` syntax above).

This means if you require more advanced masking logic, which cannot be defined in a single python expression, you may need to consider a custom stream mapper.

### Aliasing a stream using `__alias__`

To alias a stream, simply add the operation `"__alias__": "new_name"` to the stream
Expand Down
3 changes: 3 additions & 0 deletions singer_sdk/mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,7 +337,10 @@ def _eval(
names["config"] = self.map_config # Allow map config access within transform

if self.fake:
from faker import Faker # noqa: PLC0415

names["fake"] = self.fake
names["Faker"] = Faker

if property_name and property_name in record:
# Allow access to original property value if applicable
Expand Down

0 comments on commit d03834d

Please sign in to comment.