Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Polars GroupBy #1836

Merged
merged 4 commits into from
Aug 14, 2024
Merged

Docs: Polars GroupBy #1836

merged 4 commits into from
Aug 14, 2024

Conversation

GKD-stack
Copy link
Contributor

closes #1835

@GKD-stack GKD-stack requested a review from mccalluc July 22, 2024 19:35
@GKD-stack GKD-stack self-assigned this Jul 22, 2024
Copy link
Contributor

@mccalluc mccalluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great structure! Some formatting comments, and a couple places where the discussion could be expanded; Also check out the CI failures.

"- Multiple Variable Groupby\n",
"- Filtering\n",
"\n",
"For each method, we will compare the actual values to the computed differentially private values to demonstrate utility. The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. We will use the [sample data](https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip) from the Labour Force Survey in France. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use relative links rather than pointing to nightly. If we were to change the location of the polars module, we want to get a CI warning on that PR, rather than waiting till the next day.

(This will be easier to fix with a local build. This could be in my court?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chuck you said you'll finish up the last bit here!

This was referenced Jul 24, 2024
@mccalluc mccalluc self-requested a review July 24, 2024 19:37
@GKD-stack GKD-stack requested a review from Shoeboxam July 24, 2024 21:39
mccalluc added a commit that referenced this pull request Jul 30, 2024
- For #1722 

This should be merged first: #1812
#1834
#1836

---------

Co-authored-by: Gurman Dhaliwal <[email protected]>
Co-authored-by: gurman-dhaliwal <[email protected]>
Co-authored-by: Chuck McCallum <[email protected]>
Co-authored-by: Gurman Dhaliwal <[email protected]>
Co-authored-by: Chuck McCallum <[email protected]>
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, you will learn how to compute differentially private statistics while applying key data manipulation techniques such as: \n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add one-liners for each setting, explaining what it means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the compositor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its all in another introduction, that chuck will link here later

"cell_type": "markdown",
"metadata": {},
"source": [
"Filtering can also be viewed as a type of partitioning. If the result of filtering a smaller subset, it can lead to an increase in noise and decrease in utility. "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last sentence doesn't make much sense.

Copy link
Contributor Author

@GKD-stack GKD-stack Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this?

Similar to partitioning, if the result of filtering is a smaller subset, it can lead to an increase in noise and a decrease in utility.

@mccalluc
Copy link
Contributor

mccalluc commented Aug 1, 2024

CI failure:

----> 9         ("YEAR", ): dp.Margin(public_info= "keys", max_partition_length=estimated_max_partition_len, max_partition_contributions=4),
     10         ("YEAR", "QUARTER",): dp.Margin(public_info= "keys", max_partition_length=estimated_max_partition_len, max_partition_contributions=1),
     11         (): dp.Margin(public_info= "lengths",max_partition_length=estimated_max_partition_len, max_num_partitions=1),
     12     },
     13 )

AttributeError: module 'opendp.prelude' has no attribute 'Margin'

I'll fix this much.

"![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )\n",
"\n",
"# Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n",
"# so we'll set `ignore_errors` to true to avoid conversion errors. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a more elegant way to parse correctly and avoid conversion errors (than ignoring them). maybe we only pick the columns we're interesting in using.

Copy link
Member

@Shoeboxam Shoeboxam Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In response to Chike's comment on ignore_errors: This feels more like a library feature TODO than a notebook TODO. We want this functionality in the library, but since it isn't in the library, this is the next best option in a notebook context.

"source": [
"Now to get the differentially private statistics, add `dp.noise` after the aggregate function is specified and `.release` after the entire query before `.collect`. \n",
"\n",
"Calling `.release` is always the final step in compiling your differentially private data in a usable form and ensuring it is compliant with differential privacy guarantees. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "compiling" is the right word in this context. Also "in a usable form" seems like elusive language. Why is it usable? Usability seems like a different concept from "safety" or "privacy" which is the relevant concept being discussed right here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you recommend adding?

@mccalluc mccalluc added docs Changes to the docs in this repo docs-polars labels Aug 12, 2024
@mccalluc
Copy link
Contributor

Previewed locally, and made some copy edits. Looking for approval from one more person before merging.

Copy link
Member

@Shoeboxam Shoeboxam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this. I've left some higher-level feedback that may take some time to incorporate.

"![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )\n",
"\n",
"# Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n",
"# so we'll set `ignore_errors` to true to avoid conversion errors. \n",
Copy link
Member

@Shoeboxam Shoeboxam Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In response to Chike's comment on ignore_errors: This feels more like a library feature TODO than a notebook TODO. We want this functionality in the library, but since it isn't in the library, this is the next best option in a notebook context.

Comment on lines 153 to 154
"- add `.dp.noise()` after the aggregate function.\n",
"- add `.release()` after the entire query, but before `.collect()`. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also mention that you construct the query off of context.query() instead of df.

Comment on lines 232 to 238
"sns.lineplot(x=count_year_actual[\"YEAR\"].to_list(), y=count_year_actual[\"Actual Count\"].to_list(), marker=\"o\", label=\"Actual\")\n",
"sns.lineplot(x=count_year_dp[\"YEAR\"].to_list(), y=count_year_dp[\"DP Count\"].to_list(), marker=\"o\", label=\"DP\")\n",
"plt.title('Actual vs Differentially Private Counts by Year')\n",
"plt.xlabel('Year')\n",
"plt.ylabel('Count')\n",
"plt.legend(title='Legend')\n",
"plt.show()"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, can we generally write documentation that uses error bars from accuracy estimates when possible, instead of comparing against the real data? We want to avoid settings where the user manipulates and looks at real data when not necessary. The error bars would accomplish the same thing here, but without ever touching the data.

"metadata": {},
"source": [
"Filtering down to small groups has the same risks as partitioning to small groups:\n",
"it can lead to an increase in noise and a decrease in utility. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The absolute amount of noise remains the same, but when counts are small the relative noise is higher.

"source": [
"In this section, you will learn how to manipulate the data before computing differentially private statistics. We will cover: \n",
"\n",
"- Singular Variable `group_by`\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebook should start with a demo without public keys, and then only later add margin descriptors for public keys. This way in the typical flow the user encounters code that requires fewer descriptors on the input data, and the user can choose to add descriptors if they feel the descriptors don't affect the integrity of the privacy guarantee.

"cell_type": "markdown",
"metadata": {},
"source": [
"# Aggregrations and Filtering"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"# Aggregrations and Filtering"
"# Group By"

Why is filtering in here?

"cell_type": "markdown",
"metadata": {},
"source": [
"While the differentially private counts still follow the same trend as the actual values, there is more variance because of the smaller sizes of the year-quarter groups. \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variance is the same, but for smaller counts the same amount of noise is relatively larger.

"\n",
"### Fine-grained grouping adds more noise\n",
"\n",
"Larger datasets require less noise to be added to preserve privacy. Grouping by multiple keys can lead to smaller partitions, and adding noise to the smaller partitions may lower the utility of your results. \n",
Copy link
Member

@Shoeboxam Shoeboxam Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on the query. Count queries require a constant amount of noise relative to the dataset size. Grouping by more keys can lead to adding the same quantity of noise to more (smaller) bins.

"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section feels like it belongs in a data pre-processing notebook.

Copy link
Member

Shoeboxam commented Aug 13, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @GKD-stack and the rest of your teammates on Graphite Graphite

@Shoeboxam Shoeboxam requested a review from mccalluc August 13, 2024 22:29
@Shoeboxam Shoeboxam force-pushed the 1835-ga branch 4 times, most recently from 31e3d03 to f1778a5 Compare August 14, 2024 02:03
Copy link
Contributor

@mccalluc mccalluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't actually looked at the notebook yet, but some comments on the surrounding files.

pyarrow
hvplot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had prepared a PR that gets rid of a separate requirements file for notebooks:

That doesn't need to move forward, but if it did, we'd probably want to be more conservative about adding new dependencies.

@@ -100,20 +100,18 @@ The specific methods that will be demonstrated are:
* Quantiles

* Grouping
* Protected Group Keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Protected Group Keys
* Protected Group Keys

Sphinx needs a line break between indent levels for correct rendering.

* Limitations with ``filter``

This section will explain the limitations and properties of common Polars functions that are unique to their usage in OpenDP.
This section explains how to build stable dataframe transformations with Polars.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use RST toctrees here? I could do that, if you don't have all the installs for the doc build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could do this once the misc notebooks are merged?

@@ -66,6 +66,7 @@ where
lazyframe_utility(&lf, alpha)
}

#[derive(Clone)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to edit the Rust in this PR, if at all possible: if the examples in the docs rely on changes here, then we should get another release out before we point people to the nightly docs. It's adding more steps.

(But if this is a change we really need, don't let me block!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broke it out into a separate PR. It also killed the commit history here, unfortunately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh: Github seems to be confused. The base PR in the stack is merged, and then checked out main, and confirmed that this #[derive(Clone)] is in there... So it seems like it shouldn't be marked as a change here, as well? Which makes me wonder about how well this UI is representing other changes in this PR.

I'm going to try diffing this branch with main locally, and will see what that looks like.

@Shoeboxam Shoeboxam changed the base branch from main to 1921-accuracy-means August 14, 2024 14:53
Copy link
Contributor

@mccalluc mccalluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more comments.

docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved
"- Public group keys\n",
"- Public group lengths\n",
"\n",
"The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. \n",
"The [API Reference](../../api/python/opendp.polars.rst#module-opendp.polars) provides more information about the methods. \n",
  • "documentation" -> "API Reference"
  • Prefer a relative link instead of nightly. Sphinx will resolve the rst to html.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat how it will resolve through notebook conversion.

docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved
docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved
docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved
Comment on lines +460 to +462
" # grouping keys by \"YEAR\" and \"QUARTER\" are public information\n",
" (\"YEAR\", \"QUARTER\"): dp.polars.Margin(\n",
" public_info=\"keys\",\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make any sense to do age and employment status again, so we can have more of an apples-to-apples comparison here? Or else emphasize that we are grouping on different keys in this example, precisely because we want things that are not personal characteristics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the query to illustrate how in this different setting, it is reasonable to have public keys.

docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved
docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved
docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved
actually added everything

Apply suggestions from code review

Co-authored-by: Chuck McCallum <[email protected]>

revisions done

Delete docs/source/getting-started/tabular-data/agg_and_filter.ipynb

added paragraph for excessive grouping caution

changed data s

changed data s

remove toctree, add metadata

more linebreaks

reran

added margin parameter

more revisions in

dp.Margin -> dp.polars.Margin

revisions in but polars error persists

Delete docs/source/getting-started/assessing-utility/noise.ipynb

fix m conflict

mike revisions
Copy link
Contributor

@mccalluc mccalluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comments!

docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved
docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved
@Shoeboxam Shoeboxam requested a review from mccalluc August 14, 2024 16:00
@Shoeboxam Shoeboxam changed the title Docs: Grouping & Filtering Docs: Polars GroupBy Aug 14, 2024
Base automatically changed from 1921-accuracy-means to main August 14, 2024 19:08
Copy link
Contributor

@mccalluc mccalluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm up-to-date locally, but I'm not reproducing the error in the docs build.

@@ -66,6 +66,7 @@ where
lazyframe_utility(&lf, alpha)
}

#[derive(Clone)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh: Github seems to be confused. The base PR in the stack is merged, and then checked out main, and confirmed that this #[derive(Clone)] is in there... So it seems like it shouldn't be marked as a change here, as well? Which makes me wonder about how well this UI is representing other changes in this PR.

I'm going to try diffing this branch with main locally, and will see what that looks like.

Copy link
Contributor

@mccalluc mccalluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI passes - A little confused about the git history, and there will be some followup edits in the un-orphan pass, but this is good to merge.

@mccalluc mccalluc dismissed stale reviews from Shoeboxam and vikrantsinghal August 14, 2024 20:50

Mike made the changes he needed

@mccalluc mccalluc merged commit fc7bfde into main Aug 14, 2024
16 checks passed
@mccalluc mccalluc deleted the 1835-ga branch August 14, 2024 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Changes to the docs in this repo docs-polars
Projects
Development

Successfully merging this pull request may close these issues.

Docs: Grouping + Agg Notebook
5 participants