From 1333278007c8e4daf03f82049de102f041bda878 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Ram=C3=ADrez=20Mondrag=C3=B3n?= <16805946+edgarrmondragon@users.noreply.github.com> Date: Tue, 13 Aug 2024 14:36:37 -0600 Subject: [PATCH] feat(mappers): Stream map expressions now have access to the `Faker` class, rather than just a faker instance (#2598) * faker libary, rather than just faker instance, is now available to mapper expressions (provided a faker config is found). * added section on using faker for data masking, including how to use the faker library to re-seed * Update docs/stream_maps.md * Update docs/stream_maps.md --------- Co-authored-by: michael_calvo Co-authored-by: Michael Calvo --- docs/stream_maps.md | 42 ++++++++++++++++++++++++++++++++++++++++++ singer_sdk/mapper.py | 3 +++ 2 files changed, 45 insertions(+) diff --git a/docs/stream_maps.md b/docs/stream_maps.md index e348833cf..9d4a7d8a3 100644 --- a/docs/stream_maps.md +++ b/docs/stream_maps.md @@ -249,6 +249,8 @@ can be referenced directly by mapping expressions. - `fake` - a [`Faker`](inv:faker:std:doc#index) instance, configurable via `faker_config` (see previous example) - see the built-in [standard providers](inv:faker:std:doc#providers) for available methods +- `Faker` - the [`Faker`](inv:faker:std:doc#fakerclass) class. This was made available to enable consistent data + masking by allowing users to call `Faker.seed()`. ```{tip} The `fake` object is only available if the plugin specifies `faker` as an additional dependency (through the `singer-sdk` `faker` extra, or directly). @@ -435,6 +437,46 @@ stream_maps: ``` ```` +### Masking data with Faker + +It is best practice (or even a legal requirement) to mask PII/PHI in lower environments. Stream mappers have access to the `Faker` library, which can be used to generate random data in various forms/formats. + +```yaml +stream_maps: + customers: + # IMPORTANT: the `fake` variable name will only be available if faker_config is defined + first_name: fake.first_name() # generates a new random name each time +faker_config: + # set specific seed + seed: 0 + # set specific locales + locale: + - en_US + - en_GB +``` + +Be sure to checkout the [`faker` documentation](https://faker.readthedocs.io/en/master/) for all the fake data generation possibilities. + +Note that in the example above, `faker` will generate a new random value each time the `first_name()` function is invoked. This means if 3 records have a `first_name` value of `Mike`, then they will each have a different name after being mapped (for example, `Alistair`, `Debra`, `Scooby`). This can actually lead to issues when developing in the lower environments. + +Some users require consistent masking (for example, the first name `Mike` is always masked as `Debra`). Consistent masking preserves the relationship between tables and rows, while still hiding the real value. When a random mask is generated every time, relationships between tables/rows are effectively lost, making it impossible to test things like sql `JOIN`s. This can cause highly unpredictable behavior when running the same code in lower environments vs production. + +To generate consistent masked values, you must provide the **same seed each time** before invoking the faker function. + +```yaml +stream_maps: + customers: + # will always generate the same value for the same seed + first_name: Faker.seed(_['first_name']) or fake.first_name() +faker_config: + # IMPORTANT: `fake` and `Faker` names are only available if faker_config is defined. + locale: en_US +``` + +Remember, these expressions are evaluated by the [`simpleval`](https://github.com/danthedeckie/simpleeval) expression library, which only allows a single python expression (which is the reason for the `or` syntax above). + +This means if you require more advanced masking logic, which cannot be defined in a single python expression, you may need to consider a custom stream mapper. + ### Aliasing a stream using `__alias__` To alias a stream, simply add the operation `"__alias__": "new_name"` to the stream diff --git a/singer_sdk/mapper.py b/singer_sdk/mapper.py index fce1277fb..a2e7bc956 100644 --- a/singer_sdk/mapper.py +++ b/singer_sdk/mapper.py @@ -337,7 +337,10 @@ def _eval( names["config"] = self.map_config # Allow map config access within transform if self.fake: + from faker import Faker # noqa: PLC0415 + names["fake"] = self.fake + names["Faker"] = Faker if property_name and property_name in record: # Allow access to original property value if applicable