Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use typed choice sequence in database #4241

Merged
merged 5 commits into from
Jan 16, 2025
Merged

Conversation

tybug
Copy link
Member

@tybug tybug commented Jan 12, 2025

The database now stores entries as (sort_key_ir(data.ir_nodes), data.choices), letting us sort entries by shrink ordering. If ir_from_bytes deserialization fails, return None. Caller is responsible for handling this case (usually by clearing the bad entry from the db).

I think it's worth storing the choice complexity in the db, for better sorting. The case of interest is when serialization is large but the complexity is low. This happens mainly via min_size= or shrink_towards. I'm not certain how big of a problem this is - probably most issues require storing some (small + high complexity) or (large + low complexity) value, then changing the test case to add or remove a min_size, changing the relative complexity baseline. I think the worst this would manifest as is us repeatedly trying clearly-suboptimal examples from the database, or rejecting possible-shrinks from the secondary corpus as "too complex" when they aren't. We'd still like to avoid these though.

Also bundles c0f8979, which is an independent cleanup commit.

@tybug
Copy link
Member Author

tybug commented Jan 13, 2025

I've addressed a readthedocs brownout-soon-to-be-blackout (ci, blog) here as well since it was so simple. If this pull doesn't get merged next then we can cherrypick that over to whatever does.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 14, 2025

I think it's worth storing the choice complexity in the db, for better sorting. The case of interest is when serialization is large but the complexity is low. This happens mainly via min_size= or shrink_towards. I'm not certain how big of a problem this is - probably most issues require storing some (small + high complexity) or (large + low complexity) value, then changing the test case to add or remove a min_size, changing the relative complexity baseline. I think the worst this would manifest as is us repeatedly trying clearly-suboptimal examples from the database, or rejecting possible-shrinks from the secondary corpus as "too complex" when they aren't. We'd still like to avoid these though.

I'm not convinced that storing choice complexity this is worth the complication to our code and almost doubling the size of database entries - shortlex over our serialization format is already usually pretty close to shrink ordering in reasonable cases. If we don't have some reason to think it helps a lot, I'd strongly prefer to drop it.

PR otherwise looks ready to merge 👍

@tybug
Copy link
Member Author

tybug commented Jan 14, 2025

ok, I'm good with that! I think there were a few db test failures when I tried, but they were in the ~slightly bad category, not the ~moderately bad category. We can come back to this pull if we see bad regressions as a result.

hypothesis-python/tests/conjecture/test_ir.py Outdated Show resolved Hide resolved
Comment on lines +99 to +100
def shortlex(s):
return (len(s), s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems ridiculous that we didn't have this defined anywhere before, but apparently not!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not by this name, but this is really just sort_key renamed!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we already document that the database might be lost when you update Hypothesis, it feels to me like this is spending too much time explaining what we mean; I'd rather give a relatively brief description like the first paragraph and then focus instead on what users should do (ie: do not rely on the database across versions, consider using @example(), ensure that any environments with a shared DB (eg CI and local) use the same version where possible.)

@tybug tybug enabled auto-merge January 16, 2025 05:23
@tybug tybug merged commit cca3c71 into HypothesisWorks:master Jan 16, 2025
47 checks passed
@tybug tybug deleted the db-choices branch January 16, 2025 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants