Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Table.add_group_number, with operations Unique and Equal_Count #11818

Merged
merged 25 commits into from
Dec 18, 2024

Conversation

GregoryTravis
Copy link
Contributor

@GregoryTravis GregoryTravis commented Dec 9, 2024

Pull Request Description

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.
  • If meaningful changes were made to logic or tests affecting Enso Cloud integration in the libraries,
    or the Snowflake database integration, a run of the Extra Tests has been scheduled.
    • If applicable, it is suggested to paste a link to a successful run of the Extra Tests.

@GregoryTravis
Copy link
Contributor Author

Screen Shot 2024-12-09 at 4 37 03 PM


## Create the specified number of buckets with the same number of rows in
each bucket (except possibly the last one).
Equal_Count bucket_count:Integer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@radeusgd Suggested we use the term 'bucket' here, such as Equal_Size_Bucket. I agree that 'bucket' might be a good word here, but since this is just one of several grouping methods, I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole component should be talking about buckets instead of groups. I think group is too closely assocaited with aggregating,

@GregoryTravis GregoryTravis marked this pull request as ready for review December 10, 2024 17:36
GROUP Standard.Base.Values
ICON column_add
Adds a new column to the table enumerating groups of rows, assigning each
row to one group number. All rows in each group will get the sane number.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
row to one group number. All rows in each group will get the sane number.
row to one group number. All rows in each group will get the same number.

@radeusgd
Copy link
Member

Are we sure that group_by should be a parameter to the method?

Looking at #11782, it seems that it is only used in Unique grouping method. All other grouping methods seem to be constructing groups based on other conditions. Thus including this argument as a 'standalone' argument is confusing since it can be used only in this scenario. I suggest to move it to make it an argument of the Unique constructor, so that it only shows up in that specific mode.

- name: The name of the new column. Defaults to "Group".
- from: The starting value for the enumeration. Defaults to 0.
- step: The amount to increment the enumeration by. Defaults to 1.
- group_by: Specifies the columns to group by, for grouping methods that
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to put this into Grouping_Method.Unique rather than as a separate argument -- and perhaps the same with order_by -- because each method has different requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to send this -- @radeusgd you also suggested it, so I did it.

@GregoryTravis
Copy link
Contributor Author

GregoryTravis commented Dec 11, 2024

Moving group_by into Unique is a better UX, but I'm not sure how to create a widget for the on parameter within Grouping_Method. I imagine I would have to define the widget like this:

    @on [some widget]
    Unique (on:(Vector | Text | Integer | Regex)=(Missing_Argument.throw "on"))

But then it doesn't have the table argument to get the column names from.

Is there an example of something like this?

EDIT: never mind, I am looking at the on paramter to Table.join.

@GregoryTravis
Copy link
Contributor Author

Widgets:

Screen Shot 2024-12-11 at 3 07 06 PM

@from (Widget.Numeric_Input display=..Always)
@group_by (Widget_Helpers.make_column_name_multi_selector display=..When_Modified)
@order_by (Widget_Helpers.make_order_by_selector display=..When_Modified)
add_group_number self (grouping_method:Grouping_Method=..Unique) (name:Text="Group") (from:Integer=0) (step:Integer=1) (on_problems:Problem_Behavior=..Report_Warning) -> Table =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
add_group_number self (grouping_method:Grouping_Method=..Unique) (name:Text="Group") (from:Integer=0) (step:Integer=1) (on_problems:Problem_Behavior=..Report_Warning) -> Table =
add_group_number self (grouping_method:Grouping_Method) (name:Text="Group") (from:Integer=0) (step:Integer=1) (on_problems:Problem_Behavior=..Report_Warning) -> Table =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Missing_Argument.

Copy link
Member

@radeusgd radeusgd Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to default to the Unique some grouping method?

Perhaps Equal_Count is a good candidate for a default as it works regardless of what columns are there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do have a default, I would say Unique because it's used much more.

In general, do we prefer to have defaults for things, or is it also ok to have no default, which forces the user to choose, but also it shows up as a warning, which the user might not want to see every time they use this method. I'm fine either way -- no default, or Unique [first column].

@AdRiley
Copy link
Member

AdRiley commented Dec 12, 2024

What do we think of calling it bucket?
Or add_bucket_number?

The on widget isn't working for me.

image

@AdRiley
Copy link
Member

AdRiley commented Dec 12, 2024

order_by also not working
image

@GregoryTravis
Copy link
Contributor Author

@AdRiley I had forgotten to push my widget changes.

I'm not sure about bucket; I think it also connotes collapsing things into groups. This method just adds a column, so I think add_group_number or add_bucket_number would be better.

@jdunkerley I know this is step 1 in implementing something like Alteryx's "tile" tool, which sounds more like arranging and labeling then combining, what do you think?

@GregoryTravis
Copy link
Contributor Author

Renamed to Equal_Bucket_Count.

@AdRiley
Copy link
Member

AdRiley commented Dec 17, 2024

I'm confused what the difference is between group and bucket? Is there a difference?

This reverts commit 7ada84c.
@GregoryTravis
Copy link
Contributor Author

Revered to "group" after discussion with @jdunkerley and @AdRiley

Comment on lines +39 to +48
_prepare_group_by table problem_builder group_by =
table.columns_helper.select_columns_helper group_by Case_Sensitivity.Default True problem_builder . map c->c.java_column

_prepare_ordering table problem_builder order_by =
ordering = Table_Helpers.resolve_order_by table.columns order_by problem_builder
ordering_columns = ordering.map c->c.column.java_column
directions = ordering.map c->c.associated_selector.direction.to_sign
[ordering_columns, directions]

_illegal_if b msg ~cont = if b then Error.throw (Illegal_Argument.Error msg) else cont
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either mark the whole module private (if applicable) or mark the internal helper methods private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked module as private.

Comment on lines +95 to +98
group_builder.specify "must specify group_by with Unique" <|
t = table_builder_from_rows ['x', 'y', 'z'] [[1, 0, 2], [0, 1, 0], [1, 2, 0], [0, 1, 1], [1, 0, 1], [1, 2, 1]]

t.add_group_number ..Unique "g" . should_fail_with Missing_Argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to also test edge case where on is specified but is empty (on=[])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +50 to +75
group_builder.specify "should add group number by equal counts" <|
t = table_builder [['x', [1, 2, 3, 4, 5]], ['y', [5, 4, 3, 2, 1]], ['z', [1, 5, 4, 2, 3]]]

g0 = t.add_group_number (..Equal_Count 3) "g"
g0.at 'g' . to_vector . should_equal [0, 0, 1, 1, 2]

g1 = t.add_group_number (..Equal_Count 3 order_by=['x']) "g"
g1.at 'g' . to_vector . should_equal [0, 0, 1, 1, 2]

g2 = t.add_group_number (..Equal_Count 3 order_by=['y']) "g"
g2.at 'g' . to_vector . should_equal [2, 1, 1, 0, 0]

g3 = t.add_group_number (..Equal_Count 3 order_by='z') "g"
g3.at 'g' . to_vector . should_equal [0, 2, 1, 0, 1]

g4 = t.add_group_number (..Equal_Count 2) "g"
g4.at 'g' . to_vector . should_equal [0, 0, 0, 1, 1]

g5 = t.add_group_number (..Equal_Count 2 order_by=['x']) "g"
g5.at 'g' . to_vector . should_equal [0, 0, 0, 1, 1]

g6 = t.add_group_number (..Equal_Count 2 order_by=['y']) "g"
g6.at 'g' . to_vector . should_equal [1, 1, 0, 0, 0]

g7 = t.add_group_number (..Equal_Count 2 order_by='z') "g"
g7.at 'g' . to_vector . should_equal [0, 1, 1, 0, 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to also test edge cases:

  • ..Equal_Count 0 -> should fail with illegal argument I imagine
  • ..Equal_Count -1 -> also should fail, cannot be negative
  • ..Equal_Count 1 -> should work and just assign the same group number to all entries - not very useful on its own, but may be useful when the number of groups is a parameter coming from somewhere - splitting into 1 group is technically a valid thing to do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 126 to 131
group_builder.specify "should rename existing column upon a name clash, and attach a warning" <|
t = table_builder_from_rows ['x', 'y', 'z'] [[1, 0, 2], [0, 1, 0], [1, 2, 0], [0, 1, 1], [1, 0, 1], [1, 2, 1]]

g0 = t.add_group_number (..Unique on=['x', 'y']) "y"
g0.at 'y 1' . to_vector . should_equal [0, 1, 2, 1, 0, 2]
Problems.expect_warning Duplicate_Output_Column_Names g0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the values in columns were something else the numbers 0, 1, 2 (e.g. a, b, c) it would be far easier to tell what this test does - as I wasn't really sure if it was checking if y 1 is the new column or existing column - both will have values 0, 1, 2 just in maybe different order.
Also it would still be good to check not only y 1 but also y.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth addressing Radek's test comments and lets make the helper private for now

Comment on lines +88 to +102
private static class StepIterator {
private final long step;
private long current;

public StepIterator(long start, long step) {
this.step = step;
this.current = start;
}

public long next() {
var toReturn = current;
current = Math.addExact(current, step);
return toReturn;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be more or less equivalent to AddRowNumber.RangeIterator. Is there any reason to keep these 2 separate or maybe we shall merge them? I like that StepIterator avoid boxing, perhaps that is enough to keep it separate - but then we may add a comment on it explaining the why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They behave differently -- RangeIterator allows you to call currentValue before next and is perhaps in an uninitialized state -- and it seems like that was possibly done on purpose; and both next and currentValue return value.

Arrays.stream(orderingColumns).map(Column::getStorage).toArray(Storage[]::new);
List<OrderedMultiValueKey> keys =
new ArrayList<>(
IntStream.range(0, (int) numRows)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are doing casts like this all over the place, but I always feel a bit uneasy, shall we prefer Math.toIntExact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private static class EqualCountGenerator {
private final long start;
private final long step;
private long current = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this currentIndex. Otherwise it is hard to say if this is current index or value and these are different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jdunkerley jdunkerley merged commit 739ee3a into develop Dec 18, 2024
38 checks passed
@jdunkerley jdunkerley deleted the wip/gmt/11782-add-group-number branch December 18, 2024 16:35
jdunkerley pushed a commit that referenced this pull request Dec 18, 2024
…al_Count` (#11818)

* wip

* one test

* wip

* wip

* more tests

* changelog

* docs, more tests

* switch in enso

* counter overflow

* run under db tests

* fmt

* merge

* move group + order into method

* wip

* bucket

* Revert "bucket"

This reverts commit 7ada84c.

* aliases

* review

(cherry picked from commit 739ee3a)
somebody1234 pushed a commit that referenced this pull request Dec 24, 2024
…al_Count` (#11818)

* wip

* one test

* wip

* wip

* more tests

* changelog

* docs, more tests

* switch in enso

* counter overflow

* run under db tests

* fmt

* merge

* move group + order into method

* wip

* bucket

* Revert "bucket"

This reverts commit 7ada84c.

* aliases

* review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants