Docs: Polars GroupBy #1836

GKD-stack · 2024-07-22T19:35:03Z

mccalluc

Great structure! Some formatting comments, and a couple places where the discussion could be expanded; Also check out the CI failures.

docs/source/getting-started/tabular-data/agg_and_filter.ipynb

mccalluc · 2024-07-23T13:15:27Z

docs/source/getting-started/tabular-data/agg_and_filter.ipynb

+    "- Multiple Variable Groupby\n",
+    "- Filtering\n",
+    "\n",
+    "For each method, we will compare the actual values to the computed differentially private values to demonstrate utility. The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. We will use the [sample data](https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip) from the Labour Force Survey in France. "


Use relative links rather than pointing to nightly. If we were to change the location of the polars module, we want to get a CI warning on that PR, rather than waiting till the next day.

(This will be easier to fix with a local build. This could be in my court?)

chuck you said you'll finish up the last bit here!

docs/source/getting-started/tabular-data/agg_and_filter.ipynb

- For #1722 This should be merged first: #1812 #1834 #1836 --------- Co-authored-by: Gurman Dhaliwal <[email protected]> Co-authored-by: gurman-dhaliwal <[email protected]> Co-authored-by: Chuck McCallum <[email protected]> Co-authored-by: Gurman Dhaliwal <[email protected]> Co-authored-by: Chuck McCallum <[email protected]>

vikrantsinghal · 2024-08-01T16:41:49Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this section, you will learn how to compute differentially private statistics while applying key data manipulation techniques such as: \n",


Maybe add one-liners for each setting, explaining what it means.

For the compositor?

its all in another introduction, that chuck will link here later

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

vikrantsinghal · 2024-08-01T17:01:11Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Filtering can also be viewed as a type of partitioning. If the result of filtering a smaller subset, it can lead to an increase in noise and decrease in utility. "


The last sentence doesn't make much sense.

what about this?

Similar to partitioning, if the result of filtering is a smaller subset, it can lead to an increase in noise and a decrease in utility.

mccalluc · 2024-08-01T19:20:58Z

CI failure:

----> 9         ("YEAR", ): dp.Margin(public_info= "keys", max_partition_length=estimated_max_partition_len, max_partition_contributions=4),
     10         ("YEAR", "QUARTER",): dp.Margin(public_info= "keys", max_partition_length=estimated_max_partition_len, max_partition_contributions=1),
     11         (): dp.Margin(public_info= "lengths",max_partition_length=estimated_max_partition_len, max_num_partitions=1),
     12     },
     13 )

AttributeError: module 'opendp.prelude' has no attribute 'Margin'

I'll fix this much.

chikeabuah · 2024-08-02T13:35:02Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+    "![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )\n",
+    "\n",
+    "# Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n",
+    "# so we'll set `ignore_errors` to true to avoid conversion errors.     \n",


is there a more elegant way to parse correctly and avoid conversion errors (than ignoring them). maybe we only pick the columns we're interesting in using.

In response to Chike's comment on ignore_errors: This feels more like a library feature TODO than a notebook TODO. We want this functionality in the library, but since it isn't in the library, this is the next best option in a notebook context.

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

chikeabuah · 2024-08-02T14:14:33Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+   "source": [
+    "Now to get the differentially private statistics, add `dp.noise` after the aggregate function is specified and `.release` after the entire query before `.collect`. \n",
+    "\n",
+    "Calling `.release` is always the final step in compiling your differentially private data in a usable form and ensuring it is compliant with differential privacy guarantees. \n",


I don't think "compiling" is the right word in this context. Also "in a usable form" seems like elusive language. Why is it usable? Usability seems like a different concept from "safety" or "privacy" which is the relevant concept being discussed right here.

what do you recommend adding?

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

mccalluc · 2024-08-12T19:30:03Z

Previewed locally, and made some copy edits. Looking for approval from one more person before merging.

Shoeboxam

Thanks for your work on this. I've left some higher-level feedback that may take some time to incorporate.

Shoeboxam · 2024-08-13T02:33:59Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+    "![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )\n",
+    "\n",
+    "# Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n",
+    "# so we'll set `ignore_errors` to true to avoid conversion errors.     \n",


In response to Chike's comment on ignore_errors: This feels more like a library feature TODO than a notebook TODO. We want this functionality in the library, but since it isn't in the library, this is the next best option in a notebook context.

Shoeboxam · 2024-08-13T02:40:15Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+    "- add `.dp.noise()` after the aggregate function.\n",
+    "- add `.release()` after the entire query, but before `.collect()`. "


Maybe also mention that you construct the query off of context.query() instead of df.

Shoeboxam · 2024-08-13T02:44:07Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+    "sns.lineplot(x=count_year_actual[\"YEAR\"].to_list(), y=count_year_actual[\"Actual Count\"].to_list(), marker=\"o\", label=\"Actual\")\n",
+    "sns.lineplot(x=count_year_dp[\"YEAR\"].to_list(), y=count_year_dp[\"DP Count\"].to_list(), marker=\"o\", label=\"DP\")\n",
+    "plt.title('Actual vs Differentially Private Counts by Year')\n",
+    "plt.xlabel('Year')\n",
+    "plt.ylabel('Count')\n",
+    "plt.legend(title='Legend')\n",
+    "plt.show()"


For future reference, can we generally write documentation that uses error bars from accuracy estimates when possible, instead of comparing against the real data? We want to avoid settings where the user manipulates and looks at real data when not necessary. The error bars would accomplish the same thing here, but without ever touching the data.

Shoeboxam · 2024-08-13T02:47:49Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+   "metadata": {},
+   "source": [
+    "Filtering down to small groups has the same risks as partitioning to small groups:\n",
+    "it can lead to an increase in noise and a decrease in utility. "


The absolute amount of noise remains the same, but when counts are small the relative noise is higher.

Shoeboxam · 2024-08-13T02:51:28Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+   "source": [
+    "In this section, you will learn how to manipulate the data before computing differentially private statistics. We will cover: \n",
+    "\n",
+    "- Singular Variable `group_by`\n",


The notebook should start with a demo without public keys, and then only later add margin descriptors for public keys. This way in the typical flow the user encounters code that requires fewer descriptors on the input data, and the user can choose to add descriptors if they feel the descriptors don't affect the integrity of the privacy guarantee.

Shoeboxam · 2024-08-13T03:05:55Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Aggregrations and Filtering"


Suggested change

"# Aggregrations and Filtering"

"# Group By"

Why is filtering in here?

Shoeboxam · 2024-08-13T03:09:02Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "While the differentially private counts still follow the same trend as the actual values, there is more variance because of the smaller sizes of the year-quarter groups. \n",


The variance is the same, but for smaller counts the same amount of noise is relatively larger.

Shoeboxam · 2024-08-13T03:11:39Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+    "\n",
+    "### Fine-grained grouping adds more noise\n",
+    "\n",
+    "Larger datasets require less noise to be added to preserve privacy. Grouping by multiple keys can lead to smaller partitions, and adding noise to the smaller partitions may lower the utility of your results. \n",


This depends on the query. Count queries require a constant amount of noise relative to the dataset size. Grouping by more keys can lead to adding the same quantity of noise to more (smaller) bins.

Shoeboxam · 2024-08-13T03:12:19Z

docs/source/getting-started/tabular-data/agg-and-filter.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Filtering"


This section feels like it belongs in a data pre-processing notebook.

Shoeboxam · 2024-08-13T22:27:02Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

Docs: Polars GroupBy #1836 👈
Polars: support means in accuracy utility #1922
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @GKD-stack and the rest of your teammates on Graphite

mccalluc

Haven't actually looked at the notebook yet, but some comments on the surrounding files.

mccalluc · 2024-08-14T13:25:19Z

docs/requirements_notebooks.txt

+pyarrow
+hvplot


I had prepared a PR that gets rid of a separate requirements file for notebooks:

naively merge notebook tests #1880

That doesn't need to move forward, but if it did, we'd probably want to be more conservative about adding new dependencies.

mccalluc · 2024-08-14T13:27:02Z

docs/source/getting-started/tabular-data/index.rst

@@ -100,20 +100,18 @@ The specific methods that will be demonstrated are:
  * Quantiles 

 * Grouping
+  * Protected Group Keys


Suggested change

* Protected Group Keys

* Protected Group Keys

Sphinx needs a line break between indent levels for correct rendering.

mccalluc · 2024-08-14T13:32:54Z

docs/source/getting-started/tabular-data/index.rst

-  * Limitations with ``filter`` 
-
-This section will explain the limitations and properties of common Polars functions that are unique to their usage in OpenDP. 
+This section explains how to build stable dataframe transformations with Polars. 


Would it make sense to use RST toctrees here? I could do that, if you don't have all the installs for the doc build.

Maybe we could do this once the misc notebooks are merged?

mccalluc · 2024-08-14T13:36:01Z

rust/src/accuracy/polars/mod.rs

@@ -66,6 +66,7 @@ where
    lazyframe_utility(&lf, alpha)
 }

+#[derive(Clone)]


I'd prefer not to edit the Rust in this PR, if at all possible: if the examples in the docs rely on changes here, then we should get another release out before we point people to the nightly docs. It's adding more steps.

(But if this is a change we really need, don't let me block!)

Broke it out into a separate PR. It also killed the commit history here, unfortunately.

Huh: Github seems to be confused. The base PR in the stack is merged, and then checked out main, and confirmed that this #[derive(Clone)] is in there... So it seems like it shouldn't be marked as a change here, as well? Which makes me wonder about how well this UI is representing other changes in this PR.

I'm going to try diffing this branch with main locally, and will see what that looks like.

mccalluc

a few more comments.

docs/source/getting-started/tabular-data/group-by.ipynb

mccalluc · 2024-08-14T13:47:11Z

docs/source/getting-started/tabular-data/group-by.ipynb

+    "- Public group keys\n",
+    "- Public group lengths\n",
+    "\n",
+    "The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. \n",


Suggested change

"The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. \n",

"The [API Reference](../../api/python/opendp.polars.rst#module-opendp.polars) provides more information about the methods. \n",

"documentation" -> "API Reference"

Prefer a relative link instead of nightly. Sphinx will resolve the rst to html.

Neat how it will resolve through notebook conversion.

docs/source/getting-started/tabular-data/group-by.ipynb

mccalluc · 2024-08-14T14:36:32Z

docs/source/getting-started/tabular-data/group-by.ipynb

+    "        # grouping keys by \"YEAR\" and \"QUARTER\" are public information\n",
+    "        (\"YEAR\", \"QUARTER\"): dp.polars.Margin(\n",
+    "            public_info=\"keys\",\n",


Would it make any sense to do age and employment status again, so we can have more of an apples-to-apples comparison here? Or else emphasize that we are grouping on different keys in this example, precisely because we want things that are not personal characteristics?

I changed the query to illustrate how in this different setting, it is reasonable to have public keys.

docs/source/getting-started/tabular-data/group-by.ipynb

actually added everything Apply suggestions from code review Co-authored-by: Chuck McCallum <[email protected]> revisions done Delete docs/source/getting-started/tabular-data/agg_and_filter.ipynb added paragraph for excessive grouping caution changed data s changed data s remove toctree, add metadata more linebreaks reran added margin parameter more revisions in dp.Margin -> dp.polars.Margin revisions in but polars error persists Delete docs/source/getting-started/assessing-utility/noise.ipynb fix m conflict mike revisions

mccalluc

Last comments!

docs/source/getting-started/tabular-data/group-by.ipynb

mccalluc

I think I'm up-to-date locally, but I'm not reproducing the error in the docs build.

mccalluc · 2024-08-14T19:34:04Z

rust/src/accuracy/polars/mod.rs

@@ -66,6 +66,7 @@ where
    lazyframe_utility(&lf, alpha)
 }

+#[derive(Clone)]


Huh: Github seems to be confused. The base PR in the stack is merged, and then checked out main, and confirmed that this #[derive(Clone)] is in there... So it seems like it shouldn't be marked as a change here, as well? Which makes me wonder about how well this UI is representing other changes in this PR.

I'm going to try diffing this branch with main locally, and will see what that looks like.

mccalluc

CI passes - A little confused about the git history, and there will be some followup edits in the un-orphan pass, but this is good to merge.

Mike made the changes he needed

GKD-stack requested a review from mccalluc July 22, 2024 19:35

GKD-stack self-assigned this Jul 22, 2024

mccalluc reviewed Jul 23, 2024

View reviewed changes

This was referenced Jul 24, 2024

Plug-In API Using Regression #1770

Merged

docs: Noise comparison #1735

Merged

mccalluc self-requested a review July 24, 2024 19:37

GKD-stack requested a review from Shoeboxam July 24, 2024 21:39

vikrantsinghal previously requested changes Aug 1, 2024

View reviewed changes

GKD-stack requested a review from vikrantsinghal August 1, 2024 18:13

chikeabuah reviewed Aug 2, 2024

View reviewed changes

mccalluc removed request for mccalluc, Shoeboxam and vikrantsinghal August 12, 2024 15:13

mccalluc added docs Changes to the docs in this repo docs-polars labels Aug 12, 2024

mccalluc requested review from chikeabuah, Shoeboxam and vikrantsinghal August 12, 2024 19:29

Shoeboxam previously requested changes Aug 13, 2024

View reviewed changes

Shoeboxam force-pushed the 1835-ga branch from 0e8b845 to 87dca68 Compare August 13, 2024 22:26

Shoeboxam force-pushed the 1835-ga branch from 87dca68 to f3cad56 Compare August 13, 2024 22:28

Shoeboxam requested a review from mccalluc August 13, 2024 22:29

Shoeboxam force-pushed the 1835-ga branch 4 times, most recently from 31e3d03 to f1778a5 Compare August 14, 2024 02:03

Shoeboxam force-pushed the 1835-ga branch from f1778a5 to 7700159 Compare August 14, 2024 02:15

mccalluc reviewed Aug 14, 2024

View reviewed changes

Shoeboxam force-pushed the 1835-ga branch from 7700159 to 4a0be5a Compare August 14, 2024 14:53

Shoeboxam changed the base branch from main to 1921-accuracy-means August 14, 2024 14:53

Shoeboxam mentioned this pull request Aug 14, 2024

Polars: support means in accuracy utility #1922

Merged

mccalluc reviewed Aug 14, 2024

View reviewed changes

Shoeboxam added 2 commits August 14, 2024 11:02

Polars: support means in accuracy utility

0909cf1

Shoeboxam force-pushed the 1921-accuracy-means branch from 54802f4 to 0909cf1 Compare August 14, 2024 15:02

Shoeboxam force-pushed the 1835-ga branch from 4a0be5a to def1df9 Compare August 14, 2024 15:02

mccalluc reviewed Aug 14, 2024

View reviewed changes

docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved

docs/source/getting-started/tabular-data/group-by.ipynb Outdated Show resolved Hide resolved

respond to review 8/14

1244e7c

Shoeboxam requested a review from mccalluc August 14, 2024 16:00

Shoeboxam changed the title ~~Docs: Grouping & Filtering~~ Docs: Polars GroupBy Aug 14, 2024

Base automatically changed from 1921-accuracy-means to main August 14, 2024 19:08

mccalluc reviewed Aug 14, 2024

View reviewed changes

fix link: opendp.polars -> opendp.extras.polars

997a209

mccalluc approved these changes Aug 14, 2024

View reviewed changes

mccalluc merged commit fc7bfde into main Aug 14, 2024
16 checks passed

mccalluc deleted the 1835-ga branch August 14, 2024 20:50

This was referenced Aug 14, 2024

Edits to merged 1837 keys #1899

Closed

More tweaks to group-by docs, based on other PR #1925

Closed

Shoeboxam mentioned this pull request Aug 18, 2024

Notebook: grouping, filtering, and visualization for Polars #1771

Closed

		"- add `.dp.noise()` after the aggregate function.\n",
		"- add `.release()` after the entire query, but before `.collect()`. "

	"The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) provides more information about the methods. \n",
	"The [API Reference](../../api/python/opendp.polars.rst#module-opendp.polars) provides more information about the methods. \n",

		pyarrow
		hvplot

Docs: Polars GroupBy #1836

Docs: Polars GroupBy #1836

Conversation

GKD-stack commented Jul 22, 2024

mccalluc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GKD-stack Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

mccalluc commented Aug 1, 2024

Choose a reason for hiding this comment

Shoeboxam Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccalluc commented Aug 12, 2024

Shoeboxam left a comment

Choose a reason for hiding this comment

Shoeboxam Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shoeboxam Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shoeboxam commented Aug 13, 2024 • edited Loading

mccalluc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccalluc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccalluc left a comment

Choose a reason for hiding this comment

mccalluc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccalluc left a comment

Choose a reason for hiding this comment

mccalluc left a comment •

edited

Loading

GKD-stack Aug 1, 2024 •

edited

Loading

Shoeboxam Aug 13, 2024 •

edited

Loading

Shoeboxam Aug 13, 2024 •

edited

Loading

Shoeboxam Aug 13, 2024 •

edited

Loading

Shoeboxam commented Aug 13, 2024 •

edited

Loading