Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add access example to DataFrame docs (#1001) #1004

Merged
merged 3 commits into from
Oct 24, 2024

Conversation

mooreryan
Copy link
Contributor

To start, I added one example that shows what happens when trying to access a data frame column that does not exist.

I'm not sure how much to go into some of the comments in #1001, e.g., the fact that an error is raised instead of returning nil and other aspects of the Access behaviour that might be surprising.

What do you think, should it be kept simple like this or should I add something specifically about Access behaviour?

Add an example showing what happens when trying to access a data frame column that does not exist.
Copy link
Contributor

@billylanchantin billylanchantin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I requested a slight simplification we can do.

What other aspects of the Access implementation did you maybe want to highlight?

lib/explorer/data_frame.ex Outdated Show resolved Hide resolved
@mooreryan
Copy link
Contributor Author

What other aspects of the Access implementation did you maybe want to highlight?

Mainly some of the aspects from this discussion: https://elixirforum.com/t/access-behaviour-for-explorer-dataframe/66691. (Though, looking at the username, I think you might already be in that discussion!)

To summarize that thread, I was surprised by the fact that DataFrame had a different Access behavior than what the docs for the Access behavior would imply. E.g., raising rather than returning nil, fetch raising rather than returning :error, Access.fetch! raising ArgumentError rather than KeyError, etc.

I wasn't sure if these doc change was the spot to get into all of that though.

@billylanchantin
Copy link
Contributor

Yep that's me!

I wanted to clarify because some points (like not needing to implement fetch!) were cleared up in the post but others weren't. I figured it'd be a good idea to make a fresh list all the stuff we might want to document here.

If we do want to include more than what's ["not_a_column"] raising vs. returning nil, I think I'd prefer to include a brief overview of the reasoning behind the design choices. Something like:

### Missing columns

Explorer will generally raise an `ArgumentError` if the argument to `[]` is not
found, but not always. For example:

...your example...

This deviates from other `Access` implementations which return `nil` when the
key is not found. We raise early to help users catch errors quicker since most
DataFrame operations could not meaningfully continue if the result was `nil`.

We also raise an `ArgumentError` rather than a `KeyError`. This is because `[]`
is designed to take a wider collection of arguments than a single key, e.g.
lists and regexes.

However, note that some `[]` calls will not raise if their argument is not
found. This occurs when argument to `[]` is more of a search criteria like
a regex:

    iex> df = Explorer.DataFrame.new(col_1: [1], col_2: [2], other: ["a"])
    iex> col_n_regex = ~r/^col_\d$/
    iex> df[col_n_regex]
    #Explorer.DataFrame<
      Polars[1 x 2]
      col_1 s64 [1]
      col_2 s64 [2]
    >
    iex> nothing_will_match_regex = ~r/not_a_pattern/
    iex> df[nothing_will_match_regex]
    #Explorer.DataFrame<
      Polars[0 x 0]
    >

I'm not married to any of that language. But explaining the "why" might be nice (if the team agrees!).

@mooreryan
Copy link
Contributor Author

But explaining the "why" might be nice

I like the idea of adding some clarification of the reasoning of why things were designed this way. For me, it was pretty surprising to see the same action (trying to access a column) have different result depending on the argument, (sometimes raising, sometimes returning an empty dataframe, etc.).

(note: this got a bit rambling, there is a summary at the bottom)

More examples

To be honest, I don't understand the reasoning behind sometimes raising and sometimes returning an empty dataframe. E.g., for this dataframe:

iex(1)> df = DataFrame.new(a: [1, 2])
#Explorer.DataFrame<
  Polars[2 x 1]
  a s64 [1, 2]
>

This raises:

iex(2)> df["b"]
** (ArgumentError) could not find column name "b". The available columns are: ["a"].
If you are attempting to interpolate a value, use ^b.

    (explorer 0.9.2) lib/explorer/shared.ex:252: Explorer.Shared.maybe_raise_column_not_found/3
    (explorer 0.9.2) lib/explorer/shared.ex:187: anonymous fn/4 in Explorer.Shared.to_existing_columns/3
    (elixir 1.17.2) lib/enum.ex:1829: Enum."-map_reduce/3-lists^mapfoldl/2-0-"/3
    (explorer 0.9.2) lib/explorer/shared.ex:175: Explorer.Shared.to_existing_columns/3
    (explorer 0.9.2) lib/explorer/data_frame.ex:3984: Explorer.DataFrame.pull/2
    (explorer 0.9.2) lib/explorer/data_frame.ex:373: Explorer.DataFrame.fetch/2
    (elixir 1.17.2) lib/access.ex:322: Access.get/3
    iex:2: (file)

But this returns an empty data frame:

iex(2)> df[~r/b/]
#Explorer.DataFrame<
  Polars[0 x 0]
>

Asking for the 2nd column raises:

iex(3)> df[1]
** (ArgumentError) no column exists at index 1
    (explorer 0.9.2) lib/explorer/shared.ex:240: Explorer.Shared.fetch_column_at!/2
    (explorer 0.9.2) lib/explorer/shared.ex:178: anonymous fn/4 in Explorer.Shared.to_existing_columns/3
    (elixir 1.17.2) lib/enum.ex:1829: Enum."-map_reduce/3-lists^mapfoldl/2-0-"/3
    (explorer 0.9.2) lib/explorer/shared.ex:175: Explorer.Shared.to_existing_columns/3
    (explorer 0.9.2) lib/explorer/data_frame.ex:3984: Explorer.DataFrame.pull/2
    (explorer 0.9.2) lib/explorer/data_frame.ex:373: Explorer.DataFrame.fetch/2
    (elixir 1.17.2) lib/access.ex:322: Access.get/3
    iex:3: (file)

But this returns a data frame with just column a even though I'm asking for the first both the 1st and 2nd columns:

iex(3)> df[0..1]
#Explorer.DataFrame<
  Polars[2 x 1]
  a s64 [1, 2]
>

Comparing to R's tidyverse

I'm not making a value judgement on this, rather trying to explain what I found confusing. Consider this R tidyverse example. Here is a data frame.

 > df <- tibble(a = 1:2)
> df
# A tibble: 2 × 1
      a
  <int>
1     1
2     2

And here attempting to access a column that doesn't exist:

 > df["b"]
Error in `df["b"]`:
! Can't subset columns that don't exist.
✖ Column `b` doesn't exist.

That gives an error similar to Explorer. And the result of selecting columns from a data frame with regex also behaves like Explorer, by giving an "empty" dataframe.

> df |> select(matches("b"))
# A tibble: 2 × 0

Same behavior when accessing a column by index when that index is out of bounds (R using 1-based indexing):

> df[2]
Error in `df[2]`:
! Can't subset columns past the end.
ℹ Location 2 doesn't exist.
ℹ There is only 1 column.

But doing the equivalent of the df[0..1] example from Explorer:

> df[1:2]
Error in `df[1:2]`:
! Can't subset columns past the end.
ℹ Location 2 doesn't exist.
ℹ There is only 1 column.

That raises an error in R, whereas Explorer gives the data frame with just column a returned.

Summary

Explorer's access behavior is surprising to me in some ways. Not making a judgement on how it should be, since I don't have enough experience with Explorer, just trying to think of ways that docs could make it clearer.

I think some of this stuff is out of the scope for this small PR though. In fact, it would probably make sense as part of a cheat sheet or something like Explorer for Tidyverse users. I may have a go at putting something like that together, but it can be a big task.

(There was a discussion here about cheat sheets too.)

@billylanchantin
Copy link
Contributor

To summarize and expand a little, here are your uses in the three relevant libraries:

Use Explorer Tidy Polars
Specific column df["b"] df["b"] df["b"]
String match df[~r/b/] select(df, matches("b")) df.select(cs.matches("b"))
Range past end df[0..1] df[1:2] df[0:2]

And here are their outputs:

Use Explorer Tidy Polars Agreement?
Specific column ArgumentError Error ColumnNotFoundError ✅ Yes
String match DataFrame DataFrame DataFrame ✅ Yes
Range past end DataFrame Error DataFrame ❌ No

So there's only a mismatch on the range one if I'm following. And I agree it's a bit borderline (and admit I was surprised ranges didn't raise). But when designing the API, the Explorer team is balancing a few tensions: what makes sense in Elixir, what data scientists are used to, what's feasible to implement today via Polars, etc. In this case, though I can't say for sure, I think we just defaulted to how Polars handles ranges rather than the tidyverse. Both seem somewhat reasonable to me.


But regardless of the reasoning behind individual design decisions, clearer/better docs are almost always the right call.

And as for the cheatsheet stuff, it would be appreciated! Even a start that can be improved upon later.

@josevalim
Copy link
Member

I think we should raise for ranges out of bounds, for consistency within DataFrames and also with Nx.

@billylanchantin
Copy link
Contributor

Issue for raising with ranges: #1005

Thanks for pointing it out, @mooreryan!

@mooreryan
Copy link
Contributor Author

But when designing the API, the Explorer team is balancing a few tensions: what makes sense in Elixir, what data scientists are used to, what's feasible to implement today via Polars, etc.
...
But regardless of the reasoning behind individual design decisions, clearer/better docs are almost always the right call.

This sort of info would be pretty cool to have in docs somewhere--I'm just not sure where they would ideally go.

@billylanchantin
Copy link
Contributor

Yeah I'm not sure. Maybe our contributor's guidelines?

The trailing newline causes issues in doctests.  See comments of PR elixir-explorer#1004 for context.
@josevalim josevalim merged commit 6e40319 into elixir-explorer:main Oct 24, 2024
3 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants