Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique References #684

Merged
merged 32 commits into from
Jun 8, 2022
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
9e9a945
WIP: deep references using SQLAlchemy
Apr 1, 2022
a8431da
WIP: deep references using SQL Lite
Apr 2, 2022
384ef26
Start of support for nicknames
Apr 2, 2022
cbdbdff
Tests of corner cases
Apr 3, 2022
29c9875
Lazy loading
Apr 6, 2022
091f343
Lookup in nicknames
Apr 6, 2022
be8f726
Recreate just_once objects after continuation
Apr 7, 2022
2c91ae9
Dump defaultdict
Apr 7, 2022
ad35da8
One table per sobject and lazy-loaded references to sobjects
Apr 8, 2022
c4982d1
Indexes
Apr 8, 2022
9461f40
Cleanups in memory feature
Apr 8, 2022
2c74746
Docs and tests
Apr 9, 2022
a0d9772
Add nickname_id for looking up by nickname
Apr 13, 2022
f69955e
Start roughing in unique feature
Apr 13, 2022
66751ad
Merge remote-tracking branch 'origin/main' into feature/deep-references
May 21, 2022
95388a9
Add tests and remove comments
May 21, 2022
83a82fe
Minor refactorings
May 30, 2022
48d72b7
Minor cleanups
May 30, 2022
bd45550
Support serializing LazyLoadedObjectReference's.
May 30, 2022
1cff911
Lazy load stuff even Nicknames
May 30, 2022
6c6be64
Fix bug & benchmark
May 31, 2022
997d957
Missing file
May 31, 2022
dfd9dd5
Unique random_references
Jun 2, 2022
ef63376
Merge remote-tracking branch 'origin/main' into feature/unique-refere…
Jun 3, 2022
98d9b04
Fix tests and error messages.
Jun 3, 2022
0384dd0
Merge remote-tracking branch 'origin/main' into feature/unique-refere…
Jun 8, 2022
fe34a8a
Fix pragma
Jun 8, 2022
9d408bf
Add a feature for scoping to parent
Jun 8, 2022
4fe0097
Docs
Jun 8, 2022
d6a8e84
Docs
Jun 8, 2022
cd3508d
Fix minor bugs
Jun 8, 2022
64b30bb
Update terminology
Jun 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 79 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -596,12 +596,88 @@ The `random_reference` property creates a reference to a random, existing row fr

To create a reference, `random_reference` looks for a row created in the current iteration of the recipe and matching the specified object type or nickname. In the above recipe, each `random_reference` specified in `ownedBy` will point to one of the ten `Owner` objects created in the same iteration. If you iterate over the recipe multiple times, in other words, each `Pet` object will be matched with one of the ten `Owner` objects created during the same iteration.

If `random_reference` finds no matches in the current iteration, it looks in previous iterations. This can happen, for example, when you try to create a reference to an object created with the `just_once` flag.
If `random_reference` finds no matches in the current iteration, it looks in previous iterations. This can happen, for example, when you try to create a reference to an object created with the `just_once` flag. Snowfakery cannot currently generate a `random_reference` to a row that will be created in a future iteration of a recipe.

Snowfakery cannot currently generate a `random_reference` to a row that will be created in a future iteration of a recipe.
#### Unique random references

Performance tip: Tables and nicknames that are referred to by `random_reference` are indexed, which makes them slightly slower to generate than normal. This should seldom be a problem in practice, but if you experience performance problems you could switch to a normal reference to see if that improves things.
`random_reference` has a `unique` parameter which ensures that each target row is used only once.

```yaml
- object: Owner
count: 10
fields:
name:
fake: Name
- object: Pet
count: 10
fields:
ownedBy:
random_reference:
to: Owner
unique: True
```

In the case above, the relationship between Owners and Pets will be one-to-one in a random order, rather than a totally random distribution which would tend to have some Owners with multiple pets.

In the case above, it is clear that the scope of the uniqueness should be the Pets, but in the case of join tables, like Salesforce's Campaign Member, this is ambiguous and must be specified like this:

```yaml
# examples/salesforce/campaign-member.yml
- object: Campaign
count: 5
fields:
Name: Campaign ${{child_index}}
- object: Contact
count: 3
fields:
FirstName:
fake: FirstName
LastName:
fake: LastName
friends:
- object: CampaignMember
count: 5
fields:
ContactId:
reference: Contact
CampaignId:
random_reference:
to: Campaign
parent: Contact
unique: True
```

The `parent` parameter clarifies that the scope of the uniqueness is the local Contact.
Each of the Contacts will have CampaignMembers that point to unique campaigns, like
this:

```sh
Campaign(id=1, Name=Campaign 0)
Campaign(id=2, Name=Campaign 1)
Campaign(id=3, Name=Campaign 2)
Campaign(id=4, Name=Campaign 3)
Campaign(id=5, Name=Campaign 4)
Contact(id=1, FirstName=Catherine, LastName=Hanna)
CampaignMember(id=1, ContactId=Contact(1), CampaignId=Campaign(2))
CampaignMember(id=2, ContactId=Contact(1), CampaignId=Campaign(5))
CampaignMember(id=3, ContactId=Contact(1), CampaignId=Campaign(3))
CampaignMember(id=4, ContactId=Contact(1), CampaignId=Campaign(4))
CampaignMember(id=5, ContactId=Contact(1), CampaignId=Campaign(1))
Contact(id=2, FirstName=Mary, LastName=Valencia)
CampaignMember(id=6, ContactId=Contact(2), CampaignId=Campaign(1))
CampaignMember(id=7, ContactId=Contact(2), CampaignId=Campaign(4))
CampaignMember(id=8, ContactId=Contact(2), CampaignId=Campaign(5))
CampaignMember(id=9, ContactId=Contact(2), CampaignId=Campaign(2))
CampaignMember(id=10, ContactId=Contact(2), CampaignId=Campaign(3))
Contact(id=3, FirstName=Jake, LastName=Mullen)
CampaignMember(id=11, ContactId=Contact(3), CampaignId=Campaign(1))
CampaignMember(id=12, ContactId=Contact(3), CampaignId=Campaign(4))
CampaignMember(id=13, ContactId=Contact(3), CampaignId=Campaign(3))
CampaignMember(id=14, ContactId=Contact(3), CampaignId=Campaign(5))
CampaignMember(id=15, ContactId=Contact(3), CampaignId=Campaign(2))
```

Performance tip: Tables and nicknames that are referred to by `random_reference` are indexed, which makes them slightly slower to generate than normal. This should seldom be a problem in practice, but if you experience performance problems you could switch to a normal reference to see if that improves things.
### `fake`

The `fake` function generates fake data. This function is defined further in the [Fake Data Tutorial](fakedata.md)
Expand Down
22 changes: 22 additions & 0 deletions examples/salesforce/campaign-member.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
- object: Campaign
count: 5
fields:
Name: Campaign ${{child_index}}
- object: Contact
count: 3
fields:
FirstName:
fake: FirstName
LastName:
fake: LastName
friends:
- object: CampaignMember
count: 5
fields:
ContactId:
reference: Contact
CampaignId:
random_reference:
to: Campaign
parent: Contact
unique: True
3 changes: 2 additions & 1 deletion snowfakery/data_generator_runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -505,7 +505,7 @@ class RuntimeContext:
current_template = None
local_vars = None
unique_context_identifier = None
recalculate_every_time = False
recalculate_every_time = False # by default, data is recalculated constantly

def __init__(
self,
Expand All @@ -521,6 +521,7 @@ def __init__(
self.parent = parent_context
if self.parent:
self._plugin_context_vars = self.parent._plugin_context_vars.new_child()
# are we in a re-calculate everything context?
self.recalculate_every_time = parent_context.recalculate_every_time
else:
self._plugin_context_vars = ChainMap()
Expand Down
66 changes: 62 additions & 4 deletions snowfakery/row_history.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,10 @@
from random import randint

from snowfakery import data_gen_exceptions as exc
from snowfakery.object_rows import LazyLoadedObjectReference
from snowfakery.object_rows import LazyLoadedObjectReference, ObjectReference, ObjectRow
from snowfakery.plugins import PluginResultIterator
from snowfakery.utils.pickle import restricted_dumps, restricted_loads
from snowfakery.utils.randomized_range import UpdatableRandomRange


class RowHistory:
Expand Down Expand Up @@ -64,7 +66,7 @@ def save_row(self, tablename: str, nickname: T.Optional[str], row: dict):
(row_id, nickname, nickname_id, data),
)

def random_row_reference(self, name: str, scope: str, unique: bool):
def random_row_reference(self, name: str, scope: str, randint: callable):
"""Find a random row and load it"""
if scope not in ("prior-and-current-iterations", "current-iteration"):
raise exc.DataGenError(
Expand Down Expand Up @@ -95,8 +97,6 @@ def random_row_reference(self, name: str, scope: str, unique: bool):
self.already_warned = True
min_id = 1
elif nickname:
# nickname counters are reset every loop, so 1 is the right choice
# OR they are just_once in which case
min_id = self.local_counters.get(nickname, 0) + 1
else:
min_id = self.local_counters.get(tablename, 0) + 1
Expand Down Expand Up @@ -161,3 +161,61 @@ def _make_history_table(conn, tablename):
c.execute(
f'CREATE UNIQUE INDEX "{tablename}_nickname_id" ON "{tablename}" (nickname, nickname_id);'
)


class RandomReferenceContext(PluginResultIterator):
# initialize the object's state.
rng = None

def __init__(
self,
row_history: RowHistory,
to: str,
scope: str = "current-iteration",
unique: bool = False,
):
self.row_history = row_history
self.to = to
self.scope = scope
self.unique = unique
if unique:
self.random_func = self.unique_random
else:
self.random_func = randint

def next(self) -> T.Union[ObjectReference, ObjectRow]:
try:
return self.row_history.random_row_reference(
self.to, self.scope, self.random_func
)
except StopIteration as e:
if self.random_func == self.unique_random:
raise exc.DataGenError(
f"Cannot find an unused `{self.to}`` to link to"
) from e
else: # pragma: no cover
raise e

def unique_random(self, a, b):
"""Goal: use an Uniquifying RNG until all of its values have been
used up, then make a new one, with higher values.

e.g. random_range(1,5) then random_range(5, 10)

The parent might call it like:
unique_random(1,2) -> random_range(1,3) -> 2
unique_random(1,4) -> random_range(1,3) -> 1
unique_random(1,6) -> random_range(3,7) -> 5 # reset
unique_random(1,8) -> random_range(3,7) -> 3
unique_random(1,10) -> random_range(3,7) -> 4
unique_random(1,12) -> random_range(3,7) -> 6
unique_random(1,14) -> random_range(7,14) -> 13 # reset
...
"""
b += 1 # randint uses top-inclusive semantics,
# random_range uses top-exclusive semantics
if self.rng is None:
self.rng = UpdatableRandomRange(a, b)
else:
self.rng.set_new_range(a, b)
return next(self.rng)
36 changes: 17 additions & 19 deletions snowfakery/template_funcs.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
import sys
import random
from functools import lru_cache
import sys
from ast import literal_eval
from datetime import date, datetime
from functools import lru_cache
from typing import Any, List, Tuple, Union

import dateutil.parser
from dateutil.relativedelta import relativedelta
from ast import literal_eval

from typing import Union, List, Tuple, Any

from faker import Faker
from faker.providers.date_time import Provider as DateProvider

from .data_gen_exceptions import DataGenError

import snowfakery.data_generator_runtime # noqa
from snowfakery.plugins import SnowfakeryPlugin, PluginContext, lazy
from snowfakery.object_rows import ObjectReference, ObjectRow
from snowfakery.utils.template_utils import StringGenerator
from snowfakery.standard_plugins.UniqueId import UniqueId
from snowfakery.fakedata.fake_data_generator import UTCAsRelDelta, _normalize_timezone
from snowfakery.object_rows import ObjectReference
from snowfakery.plugins import PluginContext, SnowfakeryPlugin, lazy, memorable
from snowfakery.row_history import RandomReferenceContext
from snowfakery.standard_plugins.UniqueId import UniqueId
from snowfakery.utils.template_utils import StringGenerator

from .data_gen_exceptions import DataGenError

FieldDefinition = "snowfakery.data_generator_runtime_object_model.FieldDefinition"

Expand Down Expand Up @@ -256,13 +256,15 @@ def choice(
probability = parse_weight_str(self.context, probability)
return probability or when, pick

@memorable
def random_reference(
self,
to: str,
*,
parent: str = None,
scope: str = "current-iteration",
unique: bool = False,
) -> Union[ObjectReference, ObjectRow]:
) -> "RandomReferenceContext":
"""Select a random, already-created row from 'sobject'

- object: Owner
Expand All @@ -278,12 +280,8 @@ def random_reference(

See the docs for more info.
"""
if unique:
# next feature to implement
raise NotImplementedError()

return self.context.interpreter.row_history.random_row_reference(
to, scope, unique
return RandomReferenceContext(
self.context.interpreter.row_history, to, scope, unique
)

@lazy
Expand Down
Loading