Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Polars GroupBy #1836

Merged
merged 4 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/requirements_notebooks.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ seaborn
scipy
scikit-learn
polars==1.1.0
pyarrow
pyarrow
hvplot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had prepared a PR that gets rid of a separate requirements file for notebooks:

That doesn't need to move forward, but if it did, we'd probably want to be more conservative about adding new dependencies.

928 changes: 928 additions & 0 deletions docs/source/getting-started/tabular-data/group-by.ipynb

Large diffs are not rendered by default.

18 changes: 8 additions & 10 deletions docs/source/getting-started/tabular-data/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,20 +100,18 @@ The specific methods that will be demonstrated are:
* Quantiles

* Grouping
* Protected Group Keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Protected Group Keys
* Protected Group Keys

Sphinx needs a line break between indent levels for correct rendering.

* Public Group Keys
* Public Group Lengths

* Grouping By Multiple Variables
* Filtering
This section explains strategies for how to release statistics on grouped data.

* Public vs. Private Grouping Lengths
* Data Preparation

This section will explain the implications and limitations of having public and private keys and/or lengths when grouping.
* using ``with_columns``
* using ``filter``

* Data Preparation Limitations

* Limitations with ``with_columns``
* Limitations with ``filter``

This section will explain the limitations and properties of common Polars functions that are unique to their usage in OpenDP.
This section explains how to build stable dataframe transformations with Polars.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use RST toctrees here? I could do that, if you don't have all the installs for the doc build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could do this once the misc notebooks are merged?


Compositor Overview
-------------------
Expand Down
280 changes: 0 additions & 280 deletions docs/source/getting-started/tabular-data/keys.ipynb

This file was deleted.

27 changes: 20 additions & 7 deletions rust/src/accuracy/polars/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ where
lazyframe_utility(&lf, alpha)
}

#[derive(Clone)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to edit the Rust in this PR, if at all possible: if the examples in the docs rely on changes here, then we should get another release out before we point people to the nightly docs. It's adding more steps.

(But if this is a change we really need, don't let me block!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broke it out into a separate PR. It also killed the commit history here, unfortunately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh: Github seems to be confused. The base PR in the stack is merged, and then checked out main, and confirmed that this #[derive(Clone)] is in there... So it seems like it shouldn't be marked as a change here, as well? Which makes me wonder about how well this UI is representing other changes in this PR.

I'm going to try diffing this branch with main locally, and will see what that looks like.

struct UtilitySummary {
pub name: String,
pub aggregate: String,
Expand Down Expand Up @@ -188,26 +189,38 @@ fn expr_utility<'a>(
}]);
}

match expr {
Expr::Len => Ok(vec![UtilitySummary {
name,
Ok(match expr {
Expr::Len => vec![UtilitySummary {
name: name.clone(),
aggregate: "Len".to_string(),
distribution: None,
scale: None,
accuracy: alpha.is_some().then_some(0.0),
threshold: t_value,
}]),
}],

Expr::Function { input, .. } => Ok(input
Expr::Function { input, .. } => input
.iter()
.map(|e| expr_utility(e, alpha, threshold.clone()))
.collect::<Fallible<Vec<_>>>()?
.into_iter()
.flatten()
.collect()),
.collect(),

_ => fallible!(FailedFunction, "unrecognized primitive"),
Expr::BinaryExpr { left, op: _, right } => [
expr_utility(&left, alpha, threshold.clone())?,
expr_utility(&right, alpha, threshold)?,
]
.concat(),

e => return fallible!(FailedFunction, "unrecognized primitive: {:?}", e),
}
.into_iter()
.map(|mut summary| {
summary.name = name.clone();
summary
})
.collect())
}

fn expr_aggregate(expr: &Expr) -> Fallible<&'static str> {
Expand Down
45 changes: 45 additions & 0 deletions rust/src/accuracy/polars/test.rs
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,48 @@ fn test_describe_polars_measurement_accuracy() -> Fallible<()> {

Ok(())
}

#[test]
fn test_describe_polars_measurement_accuracy_mean() -> Fallible<()> {
let lf_domain = LazyFrameDomain::new(vec![
SeriesDomain::new("A", AtomDomain::<i32>::default()),
SeriesDomain::new("B", AtomDomain::<f64>::default()),
])?
.with_margin::<&str>(
&[],
Margin::new()
.with_public_lengths()
.with_max_partition_length(10),
)?;

let lf = df!("A" => &[3, 4, 5], "B" => &[1., 3., 7.])?.lazy();

let meas = make_private_lazyframe(
lf_domain,
SymmetricDistance,
MaxDivergence::default(),
lf.select([col("A").dp().mean((3, 5), Some(1.0))]),
None,
None,
)?;

let description = describe_polars_measurement_accuracy(meas.clone(), None)?;

let mut expected = df![
"column" => &["A", "A"],
"aggregate" => &["Sum", "Len"],
"distribution" => &[Some("Integer Laplace"), None],
"scale" => &[Some(1.0), None]
]?;
println!("{:?}", expected);
assert_eq!(expected, description);

let description = describe_polars_measurement_accuracy(meas.clone(), Some(0.05))?;

let accuracy = discrete_laplacian_scale_to_accuracy(1.0, 0.05)?;
expected.with_column(Series::new("accuracy", &[Some(accuracy), Some(0.0)]))?;
println!("{:?}", expected);
assert_eq!(expected, description);

Ok(())
}