Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate processors from common assets #931

Merged
merged 29 commits into from
Nov 13, 2024
Merged

migrate processors from common assets #931

merged 29 commits into from
Nov 13, 2024

Conversation

jsnoble
Copy link
Member

@jsnoble jsnoble commented Oct 24, 2024

  • date_guard
  • count_unique
  • copy_metadata_field
  • filter
  • filter_by_required_fields
  • filter_by_unknown_fields
  • sample_exact
  • sample_random
  • json_parse
  • remove_empty_fields
  • set_field_conditional

@godber
Copy link
Member

godber commented Oct 29, 2024

Here are some suggestions for renaming drop_doc. Given that the existing algorithm is probabilistic, we probably ought to convey that to the user. So I suggest we begin implementing a number of sample_FOO processors. Where the existing dropdoc processor is sample_random with the property probability, and we describe probability as the chance between 1 and 100 of the record being included in the final result set.

Edit: I think we should not include the shuffle option in this processor.

@godber
Copy link
Member

godber commented Oct 29, 2024

Then we can implement sample_exact, which would have a percentage property, which we would describe as the percentage of records to be kept.

We would add an index to each record, shuffle the records, slice the result up to the percentage, then return that set, sorted by the indices.

The difference between sample_random and sample_exact is that random won't return a fixed percentage and sample_exact will.

Edit: renamed sample_shuffle to sample_exact.

@godber
Copy link
Member

godber commented Oct 29, 2024

Maybe:

  • keep_probability, probability_keep
  • keep_percent, percent_keep

@godber
Copy link
Member

godber commented Nov 4, 2024

@jsnoble I talked with Kimbro about the naming of the sample_random, and that seems OK. He suggested we name sample_shuffle to sample_exact to better match the users point of view rather than being implementation oriented.

We can file an issue to implement sample_exact in the future.

@godber
Copy link
Member

godber commented Nov 5, 2024

You didn't add anything in the docs/operations folder.

asset/src/count_unique/schema.ts Outdated Show resolved Hide resolved
asset/src/count_unique/schema.ts Outdated Show resolved Hide resolved
asset/src/date_guard/processor.ts Outdated Show resolved Hide resolved
asset/src/date_guard/schema.ts Outdated Show resolved Hide resolved
asset/src/date_guard/schema.ts Outdated Show resolved Hide resolved
asset/src/set_field_conditional/schema.ts Outdated Show resolved Hide resolved
asset/src/set_field_conditional/schema.ts Outdated Show resolved Hide resolved
asset/src/set_field_conditional/schema.ts Outdated Show resolved Hide resolved
asset/src/set_field_conditional/processor.ts Outdated Show resolved Hide resolved
test/remove_empty_fields/processor-spec.ts Outdated Show resolved Hide resolved
Copy link
Member

@ciorg ciorg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ciorg ciorg merged commit 6a683b0 into master Nov 13, 2024
7 checks passed
@ciorg ciorg deleted the merge-common-asset branch November 13, 2024 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants