Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique References #684

Merged
merged 32 commits into from
Jun 8, 2022
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
9e9a945
WIP: deep references using SQLAlchemy
Apr 1, 2022
a8431da
WIP: deep references using SQL Lite
Apr 2, 2022
384ef26
Start of support for nicknames
Apr 2, 2022
cbdbdff
Tests of corner cases
Apr 3, 2022
29c9875
Lazy loading
Apr 6, 2022
091f343
Lookup in nicknames
Apr 6, 2022
be8f726
Recreate just_once objects after continuation
Apr 7, 2022
2c91ae9
Dump defaultdict
Apr 7, 2022
ad35da8
One table per sobject and lazy-loaded references to sobjects
Apr 8, 2022
c4982d1
Indexes
Apr 8, 2022
9461f40
Cleanups in memory feature
Apr 8, 2022
2c74746
Docs and tests
Apr 9, 2022
a0d9772
Add nickname_id for looking up by nickname
Apr 13, 2022
f69955e
Start roughing in unique feature
Apr 13, 2022
66751ad
Merge remote-tracking branch 'origin/main' into feature/deep-references
May 21, 2022
95388a9
Add tests and remove comments
May 21, 2022
83a82fe
Minor refactorings
May 30, 2022
48d72b7
Minor cleanups
May 30, 2022
bd45550
Support serializing LazyLoadedObjectReference's.
May 30, 2022
1cff911
Lazy load stuff even Nicknames
May 30, 2022
6c6be64
Fix bug & benchmark
May 31, 2022
997d957
Missing file
May 31, 2022
dfd9dd5
Unique random_references
Jun 2, 2022
ef63376
Merge remote-tracking branch 'origin/main' into feature/unique-refere…
Jun 3, 2022
98d9b04
Fix tests and error messages.
Jun 3, 2022
0384dd0
Merge remote-tracking branch 'origin/main' into feature/unique-refere…
Jun 8, 2022
fe34a8a
Fix pragma
Jun 8, 2022
9d408bf
Add a feature for scoping to parent
Jun 8, 2022
4fe0097
Docs
Jun 8, 2022
d6a8e84
Docs
Jun 8, 2022
cd3508d
Fix minor bugs
Jun 8, 2022
64b30bb
Update terminology
Jun 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 58 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -596,12 +596,67 @@ The `random_reference` property creates a reference to a random, existing row fr

To create a reference, `random_reference` looks for a row created in the current iteration of the recipe and matching the specified object type or nickname. In the above recipe, each `random_reference` specified in `ownedBy` will point to one of the ten `Owner` objects created in the same iteration. If you iterate over the recipe multiple times, in other words, each `Pet` object will be matched with one of the ten `Owner` objects created during the same iteration.

If `random_reference` finds no matches in the current iteration, it looks in previous iterations. This can happen, for example, when you try to create a reference to an object created with the `just_once` flag.
If `random_reference` finds no matches in the current iteration, it looks in previous iterations. This can happen, for example, when you try to create a reference to an object created with the `just_once` flag. Snowfakery cannot currently generate a `random_reference` to a row that will be created in a future iteration of a recipe.

Snowfakery cannot currently generate a `random_reference` to a row that will be created in a future iteration of a recipe.
#### Unique random references

Performance tip: Tables and nicknames that are referred to by `random_reference` are indexed, which makes them slightly slower to generate than normal. This should seldom be a problem in practice, but if you experience performance problems you could switch to a normal reference to see if that improves things.
`random_reference` has a `unique` parameter which ensures that each target row is used only once.

```yaml
- object: Owner
count: 10
fields:
name:
fake: Name
- object: Pet
count: 10
fields:
ownedBy:
random_reference:
to: Owner
unique: True
```

In the case above, the relationship between Owners and Pets will be one-to-one in a random order, rather than a totally random distribution which would tend to have some Owners with multiple pets.

In the case above, it is clear that the scope of the uniqueness should be the Pets, but in the case of join tables, like Salesforce's Campaign Member, this is ambiguous and must be specified like this:

'''yaml
# examples/salesforce/campaign-member.yml

'''

The `parent` parameter clarifies that the scope of the uniqueness is the local Contact.
Each of the Contacts will have CampaignMembers that point to unique campaigns, like
this:

```json
Campaign(id=1, Name=Campaign 0)
Campaign(id=2, Name=Campaign 1)
Campaign(id=3, Name=Campaign 2)
Campaign(id=4, Name=Campaign 3)
Campaign(id=5, Name=Campaign 4)
Contact(id=1, FirstName=Catherine, LastName=Hanna)
CampaignMember(id=1, ContactId=Contact(1), CampaignId=Campaign(2))
CampaignMember(id=2, ContactId=Contact(1), CampaignId=Campaign(5))
CampaignMember(id=3, ContactId=Contact(1), CampaignId=Campaign(3))
CampaignMember(id=4, ContactId=Contact(1), CampaignId=Campaign(4))
CampaignMember(id=5, ContactId=Contact(1), CampaignId=Campaign(1))
Contact(id=2, FirstName=Mary, LastName=Valencia)
CampaignMember(id=6, ContactId=Contact(2), CampaignId=Campaign(1))
CampaignMember(id=7, ContactId=Contact(2), CampaignId=Campaign(4))
CampaignMember(id=8, ContactId=Contact(2), CampaignId=Campaign(5))
CampaignMember(id=9, ContactId=Contact(2), CampaignId=Campaign(2))
CampaignMember(id=10, ContactId=Contact(2), CampaignId=Campaign(3))
Contact(id=3, FirstName=Jake, LastName=Mullen)
CampaignMember(id=11, ContactId=Contact(3), CampaignId=Campaign(1))
CampaignMember(id=12, ContactId=Contact(3), CampaignId=Campaign(4))
CampaignMember(id=13, ContactId=Contact(3), CampaignId=Campaign(3))
CampaignMember(id=14, ContactId=Contact(3), CampaignId=Campaign(5))
CampaignMember(id=15, ContactId=Contact(3), CampaignId=Campaign(2))
```

Performance tip: Tables and nicknames that are referred to by `random_reference` are indexed, which makes them slightly slower to generate than normal. This should seldom be a problem in practice, but if you experience performance problems you could switch to a normal reference to see if that improves things.
### `fake`

The `fake` function generates fake data. This function is defined further in the [Fake Data Tutorial](fakedata.md)
Expand Down
22 changes: 22 additions & 0 deletions examples/salesforce/campaign-member.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
- object: Campaign
count: 5
fields:
Name: Campaign ${{child_index}}
- object: Contact
count: 3
fields:
FirstName:
fake: FirstName
LastName:
fake: LastName
friends:
- object: CampaignMember
count: 5
fields:
ContactId:
reference: Contact
CampaignId:
random_reference:
to: Campaign
parent: Contact
unique: True
3 changes: 2 additions & 1 deletion snowfakery/data_generator_runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -505,7 +505,7 @@ class RuntimeContext:
current_template = None
local_vars = None
unique_context_identifier = None
recalculate_every_time = False
recalculate_every_time = False # by default, data is recalculated constantly

def __init__(
self,
Expand All @@ -521,6 +521,7 @@ def __init__(
self.parent = parent_context
if self.parent:
self._plugin_context_vars = self.parent._plugin_context_vars.new_child()
# are we in a re-calculate everything context?
self.recalculate_every_time = parent_context.recalculate_every_time
else:
self._plugin_context_vars = ChainMap()
Expand Down
66 changes: 62 additions & 4 deletions snowfakery/row_history.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,10 @@
from random import randint

from snowfakery import data_gen_exceptions as exc
from snowfakery.object_rows import LazyLoadedObjectReference
from snowfakery.object_rows import LazyLoadedObjectReference, ObjectReference, ObjectRow
from snowfakery.plugins import PluginResultIterator
from snowfakery.utils.pickle import restricted_dumps, restricted_loads
from snowfakery.utils.randomized_range import UpdatableRandomRange


class RowHistory:
Expand Down Expand Up @@ -64,7 +66,7 @@ def save_row(self, tablename: str, nickname: T.Optional[str], row: dict):
(row_id, nickname, nickname_id, data),
)

def random_row_reference(self, name: str, scope: str, unique: bool):
def random_row_reference(self, name: str, scope: str, randint: callable):
"""Find a random row and load it"""
if scope not in ("prior-and-current-iterations", "current-iteration"):
raise exc.DataGenError(
Expand Down Expand Up @@ -95,8 +97,6 @@ def random_row_reference(self, name: str, scope: str, unique: bool):
self.already_warned = True
min_id = 1
elif nickname:
# nickname counters are reset every loop, so 1 is the right choice
# OR they are just_once in which case
min_id = self.local_counters.get(nickname, 0) + 1
else:
min_id = self.local_counters.get(tablename, 0) + 1
Expand Down Expand Up @@ -161,3 +161,61 @@ def _make_history_table(conn, tablename):
c.execute(
f'CREATE UNIQUE INDEX "{tablename}_nickname_id" ON "{tablename}" (nickname, nickname_id);'
)


class RandomReferenceContext(PluginResultIterator):
# initialize the object's state.
rng = None

def __init__(
self,
row_history: RowHistory,
to: str,
scope: str = "current-iteration",
unique: bool = False,
):
self.row_history = row_history
self.to = to
self.scope = scope
self.unique = unique
if unique:
self.random_func = self.unique_random
else:
self.random_func = randint

def next(self) -> T.Union[ObjectReference, ObjectRow]:
try:
return self.row_history.random_row_reference(
self.to, self.scope, self.random_func
)
except StopIteration as e:
if self.random_func == self.unique_random:
raise exc.DataGenError(
f"Cannot find an unused `{self.to}`` to link to"
) from e
else: # pragma: no cover
raise e

def unique_random(self, a, b):
"""Goal: use an Uniquifying RNG until all of its values have been
used up, then make a new one, with higher values.

e.g. random_range(1,5) then random_range(5, 10)

The parent might call it like:
unique_random(1,2) -> random_range(1,3) -> 2
unique_random(1,4) -> random_range(1,3) -> 1
unique_random(1,6) -> random_range(3,7) -> 5 # reset
unique_random(1,8) -> random_range(3,7) -> 3
unique_random(1,10) -> random_range(3,7) -> 4
unique_random(1,12) -> random_range(3,7) -> 6
unique_random(1,14) -> random_range(7,14) -> 13 # reset
...
"""
b += 1 # randint uses top-inclusive semantics,
# random_range uses top-exclusive semantics
if self.rng is None:
self.rng = UpdatableRandomRange(a, b)
else:
self.rng.set_new_range(a, b)
return next(self.rng)
36 changes: 17 additions & 19 deletions snowfakery/template_funcs.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
import sys
import random
from functools import lru_cache
import sys
from ast import literal_eval
from datetime import date, datetime
from functools import lru_cache
from typing import Any, List, Tuple, Union

import dateutil.parser
from dateutil.relativedelta import relativedelta
from ast import literal_eval

from typing import Union, List, Tuple, Any

from faker import Faker
from faker.providers.date_time import Provider as DateProvider

from .data_gen_exceptions import DataGenError

import snowfakery.data_generator_runtime # noqa
from snowfakery.plugins import SnowfakeryPlugin, PluginContext, lazy
from snowfakery.object_rows import ObjectReference, ObjectRow
from snowfakery.utils.template_utils import StringGenerator
from snowfakery.standard_plugins.UniqueId import UniqueId
from snowfakery.fakedata.fake_data_generator import UTCAsRelDelta, _normalize_timezone
from snowfakery.object_rows import ObjectReference
from snowfakery.plugins import PluginContext, SnowfakeryPlugin, lazy, memorable
from snowfakery.row_history import RandomReferenceContext
from snowfakery.standard_plugins.UniqueId import UniqueId
from snowfakery.utils.template_utils import StringGenerator

from .data_gen_exceptions import DataGenError

FieldDefinition = "snowfakery.data_generator_runtime_object_model.FieldDefinition"

Expand Down Expand Up @@ -256,13 +256,15 @@ def choice(
probability = parse_weight_str(self.context, probability)
return probability or when, pick

@memorable
def random_reference(
self,
to: str,
*,
parent: str = None,
scope: str = "current-iteration",
unique: bool = False,
) -> Union[ObjectReference, ObjectRow]:
) -> "RandomReferenceContext":
"""Select a random, already-created row from 'sobject'

- object: Owner
Expand All @@ -278,12 +280,8 @@ def random_reference(

See the docs for more info.
"""
if unique:
# next feature to implement
raise NotImplementedError()

return self.context.interpreter.row_history.random_row_reference(
to, scope, unique
return RandomReferenceContext(
self.context.interpreter.row_history, to, scope, unique
)

@lazy
Expand Down
118 changes: 118 additions & 0 deletions snowfakery/utils/randomized_range.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
import typing as T
import random
import math


class UpdatableRandomRange:
def __init__(self, start: int, stop: int = None):
assert stop > start
self.start = start
self._set_new_range_immediately(start, stop)

def set_new_top(self, new_top: int):
# do not replace RNG until old one is exhausted
assert new_top >= self.cur_stop
self.cur_stop = new_top

def set_new_range(self, new_bottom: int, new_top: int):
"""Update the range subject to constraints

There are two modes:

If you update the range by changing only the top value,
the generator will finish generating the first list before
expanding its scope.

So if you configured it with range(0,10) and then
range(0,20) you would get

shuffle(list(range(0,10)) + shuffle(list(range(10,20))

Not:

shuffle(list(range(0,10) + list(range(10,20))

If you update the range by changing both values, the previous
generator is just discarded, because you presumably don't
want those values anymore. The new bottom must be higher
than the old top. This preserves the rule that no value is
ever produced twice.
"""
if new_bottom == self.start:
self.set_new_top(new_top)
else:
assert new_bottom >= self.orig_stop, (new_bottom, self.orig_stop)
self._set_new_range_immediately(new_bottom, new_top)

def _set_new_range_immediately(self, new_bottom: int, new_top: int):
assert new_top > new_bottom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to get cause up in semantics but lower limit and upper limit are the mathematical terms should we stay true to this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jofsky Changed in 64b30bb

self.start = new_bottom
self.orig_stop = self.cur_stop = new_top
self.num_generator = random_range(self.start, self.orig_stop)

def __iter__(self):
return self

def __next__(self):
rv = next(self.num_generator, None)

if rv is not None:
return rv

if self.cur_stop <= self.orig_stop:
raise StopIteration()

self.start = self.orig_stop
self.num_generator = random_range(self.start, self.cur_stop)
self.orig_stop = self.cur_stop
return next(self.num_generator)


def random_range(start: int, stop: int) -> T.Generator[int, None, None]:
"""
Return a randomized "range" using a Linear Congruential Generator
to produce the number sequence. Parameters are the same as for
python builtin "range".
Memory -- storage for 8 integers, regardless of parameters.
Compute -- at most 2*"maximum" steps required to generate sequence.
Based on https://stackoverflow.com/a/53551417/113477

# Set a default values the same way "range" does.
"""
step = 1 # step is hard-coded to "1" because it seemed to be buggy
# and not important for our use-case

# Use a mapping to convert a standard range into the desired range.
def mapping(i):
return (i * step) + start

# Compute the number of numbers in this range.
maximum = (stop - start) // step

# Seed range with a random integer.
value = random.randint(0, maximum)
#
# Construct an offset, multiplier, and modulus for a linear
# congruential generator. These generators are cyclic and
# non-repeating when they maintain the properties:
#
# 1) "modulus" and "offset" are relatively prime.
# 2) ["multiplier" - 1] is divisible by all prime factors of "modulus".
# 3) ["multiplier" - 1] is divisible by 4 if "modulus" is divisible by 4.
#
offset = random.randint(0, maximum) * 2 + 1 # Pick a random odd-valued offset.
multiplier = (
4 * (maximum // 4) + 1
) # Pick a multiplier 1 greater than a multiple of 4.
modulus = int(
2 ** math.ceil(math.log2(maximum))
) # Pick a modulus just big enough to generate all numbers (power of 2).
# Track how many random numbers have been returned.
found = 0
while found < maximum:
# If this is a valid value, yield it in generator fashion.
if value < maximum:
found += 1
yield mapping(value)
# Calculate the next value in the sequence.
value = (value * multiplier + offset) % modulus
Loading