-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Polars GroupBy #1836
Docs: Polars GroupBy #1836
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great structure! Some formatting comments, and a couple places where the discussion could be expanded; Also check out the CI failures.
"- Multiple Variable Groupby\n", | ||
"- Filtering\n", | ||
"\n", | ||
"For each method, we will compare the actual values to the computed differentially private values to demonstrate utility. The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. We will use the [sample data](https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip) from the Labour Force Survey in France. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use relative links rather than pointing to nightly. If we were to change the location of the polars module, we want to get a CI warning on that PR, rather than waiting till the next day.
(This will be easier to fix with a local build. This could be in my court?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chuck you said you'll finish up the last bit here!
- For #1722 This should be merged first: #1812 #1834 #1836 --------- Co-authored-by: Gurman Dhaliwal <[email protected]> Co-authored-by: gurman-dhaliwal <[email protected]> Co-authored-by: Chuck McCallum <[email protected]> Co-authored-by: Gurman Dhaliwal <[email protected]> Co-authored-by: Chuck McCallum <[email protected]>
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In this section, you will learn how to compute differentially private statistics while applying key data manipulation techniques such as: \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add one-liners for each setting, explaining what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the compositor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its all in another introduction, that chuck will link here later
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Filtering can also be viewed as a type of partitioning. If the result of filtering a smaller subset, it can lead to an increase in noise and decrease in utility. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last sentence doesn't make much sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about this?
Similar to partitioning, if the result of filtering is a smaller subset, it can lead to an increase in noise and a decrease in utility.
CI failure:
I'll fix this much. |
"![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )\n", | ||
"\n", | ||
"# Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n", | ||
"# so we'll set `ignore_errors` to true to avoid conversion errors. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a more elegant way to parse correctly and avoid conversion errors (than ignoring them). maybe we only pick the columns we're interesting in using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In response to Chike's comment on ignore_errors: This feels more like a library feature TODO than a notebook TODO. We want this functionality in the library, but since it isn't in the library, this is the next best option in a notebook context.
"source": [ | ||
"Now to get the differentially private statistics, add `dp.noise` after the aggregate function is specified and `.release` after the entire query before `.collect`. \n", | ||
"\n", | ||
"Calling `.release` is always the final step in compiling your differentially private data in a usable form and ensuring it is compliant with differential privacy guarantees. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "compiling" is the right word in this context. Also "in a usable form" seems like elusive language. Why is it usable? Usability seems like a different concept from "safety" or "privacy" which is the relevant concept being discussed right here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you recommend adding?
Previewed locally, and made some copy edits. Looking for approval from one more person before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this. I've left some higher-level feedback that may take some time to incorporate.
"![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )\n", | ||
"\n", | ||
"# Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n", | ||
"# so we'll set `ignore_errors` to true to avoid conversion errors. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In response to Chike's comment on ignore_errors: This feels more like a library feature TODO than a notebook TODO. We want this functionality in the library, but since it isn't in the library, this is the next best option in a notebook context.
"- add `.dp.noise()` after the aggregate function.\n", | ||
"- add `.release()` after the entire query, but before `.collect()`. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also mention that you construct the query off of context.query()
instead of df
.
"sns.lineplot(x=count_year_actual[\"YEAR\"].to_list(), y=count_year_actual[\"Actual Count\"].to_list(), marker=\"o\", label=\"Actual\")\n", | ||
"sns.lineplot(x=count_year_dp[\"YEAR\"].to_list(), y=count_year_dp[\"DP Count\"].to_list(), marker=\"o\", label=\"DP\")\n", | ||
"plt.title('Actual vs Differentially Private Counts by Year')\n", | ||
"plt.xlabel('Year')\n", | ||
"plt.ylabel('Count')\n", | ||
"plt.legend(title='Legend')\n", | ||
"plt.show()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, can we generally write documentation that uses error bars from accuracy estimates when possible, instead of comparing against the real data? We want to avoid settings where the user manipulates and looks at real data when not necessary. The error bars would accomplish the same thing here, but without ever touching the data.
"metadata": {}, | ||
"source": [ | ||
"Filtering down to small groups has the same risks as partitioning to small groups:\n", | ||
"it can lead to an increase in noise and a decrease in utility. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The absolute amount of noise remains the same, but when counts are small the relative noise is higher.
"source": [ | ||
"In this section, you will learn how to manipulate the data before computing differentially private statistics. We will cover: \n", | ||
"\n", | ||
"- Singular Variable `group_by`\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notebook should start with a demo without public keys, and then only later add margin descriptors for public keys. This way in the typical flow the user encounters code that requires fewer descriptors on the input data, and the user can choose to add descriptors if they feel the descriptors don't affect the integrity of the privacy guarantee.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Aggregrations and Filtering" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"# Aggregrations and Filtering" | |
"# Group By" |
Why is filtering in here?
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"While the differentially private counts still follow the same trend as the actual values, there is more variance because of the smaller sizes of the year-quarter groups. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variance is the same, but for smaller counts the same amount of noise is relatively larger.
"\n", | ||
"### Fine-grained grouping adds more noise\n", | ||
"\n", | ||
"Larger datasets require less noise to be added to preserve privacy. Grouping by multiple keys can lead to smaller partitions, and adding noise to the smaller partitions may lower the utility of your results. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on the query. Count queries require a constant amount of noise relative to the dataset size. Grouping by more keys can lead to adding the same quantity of noise to more (smaller) bins.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Filtering" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section feels like it belongs in a data pre-processing notebook.
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite. This stack of pull requests is managed by Graphite. Learn more about stacking. Join @GKD-stack and the rest of your teammates on Graphite |
31e3d03
to
f1778a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't actually looked at the notebook yet, but some comments on the surrounding files.
pyarrow | ||
hvplot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had prepared a PR that gets rid of a separate requirements file for notebooks:
That doesn't need to move forward, but if it did, we'd probably want to be more conservative about adding new dependencies.
@@ -100,20 +100,18 @@ The specific methods that will be demonstrated are: | |||
* Quantiles | |||
|
|||
* Grouping | |||
* Protected Group Keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Protected Group Keys | |
* Protected Group Keys |
Sphinx needs a line break between indent levels for correct rendering.
* Limitations with ``filter`` | ||
|
||
This section will explain the limitations and properties of common Polars functions that are unique to their usage in OpenDP. | ||
This section explains how to build stable dataframe transformations with Polars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use RST toctrees here? I could do that, if you don't have all the installs for the doc build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could do this once the misc notebooks are merged?
@@ -66,6 +66,7 @@ where | |||
lazyframe_utility(&lf, alpha) | |||
} | |||
|
|||
#[derive(Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to edit the Rust in this PR, if at all possible: if the examples in the docs rely on changes here, then we should get another release out before we point people to the nightly docs. It's adding more steps.
(But if this is a change we really need, don't let me block!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broke it out into a separate PR. It also killed the commit history here, unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh: Github seems to be confused. The base PR in the stack is merged, and then checked out main
, and confirmed that this #[derive(Clone)]
is in there... So it seems like it shouldn't be marked as a change here, as well? Which makes me wonder about how well this UI is representing other changes in this PR.
I'm going to try diffing this branch with main
locally, and will see what that looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more comments.
"- Public group keys\n", | ||
"- Public group lengths\n", | ||
"\n", | ||
"The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. \n", | |
"The [API Reference](../../api/python/opendp.polars.rst#module-opendp.polars) provides more information about the methods. \n", |
- "documentation" -> "API Reference"
- Prefer a relative link instead of nightly. Sphinx will resolve the
rst
tohtml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat how it will resolve through notebook conversion.
" # grouping keys by \"YEAR\" and \"QUARTER\" are public information\n", | ||
" (\"YEAR\", \"QUARTER\"): dp.polars.Margin(\n", | ||
" public_info=\"keys\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make any sense to do age and employment status again, so we can have more of an apples-to-apples comparison here? Or else emphasize that we are grouping on different keys in this example, precisely because we want things that are not personal characteristics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the query to illustrate how in this different setting, it is reasonable to have public keys.
actually added everything Apply suggestions from code review Co-authored-by: Chuck McCallum <[email protected]> revisions done Delete docs/source/getting-started/tabular-data/agg_and_filter.ipynb added paragraph for excessive grouping caution changed data s changed data s remove toctree, add metadata more linebreaks reran added margin parameter more revisions in dp.Margin -> dp.polars.Margin revisions in but polars error persists Delete docs/source/getting-started/assessing-utility/noise.ipynb fix m conflict mike revisions
54802f4
to
0909cf1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm up-to-date locally, but I'm not reproducing the error in the docs build.
@@ -66,6 +66,7 @@ where | |||
lazyframe_utility(&lf, alpha) | |||
} | |||
|
|||
#[derive(Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh: Github seems to be confused. The base PR in the stack is merged, and then checked out main
, and confirmed that this #[derive(Clone)]
is in there... So it seems like it shouldn't be marked as a change here, as well? Which makes me wonder about how well this UI is representing other changes in this PR.
I'm going to try diffing this branch with main
locally, and will see what that looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI passes - A little confused about the git history, and there will be some followup edits in the un-orphan pass, but this is good to merge.
Mike made the changes he needed
closes #1835