Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GroupBy, findSegments, and Unique #846

Closed
reuster986 opened this issue Jun 16, 2021 · 0 comments · Fixed by #1330
Closed

Refactor GroupBy, findSegments, and Unique #846

reuster986 opened this issue Jun 16, 2021 · 0 comments · Fixed by #1330
Assignees
Labels
enhancement New feature or request

Comments

@reuster986
Copy link
Collaborator

I have two goals for this issue, and wanted to start a discussion about them:

  • Separate grouping and sorting semantics

    • argsort and coargsort should actually sort the array(s). Currently, calling coargsort on a list including a Strings or Categorical will only group, not sort.
    • GroupBy should guarantee grouping, but not necessarily sorting. Strings and Categorical should have separate APIs for sorting and grouping, and GroupBy should call the latter.
  • Consolidate GroupBy/findSegments logic and migrate to Chapel

    • The uniqueMsg function in Chapel actually does 95% of what GroupBy needs. I propose refactoring uniqueMsg and its sub-fuctions to
      • handle all groupable types: int64 pdarray, Strings, and Categorical, as well as lists of these
      • optionally return a permutation, segments, and unique key indices, in addition to the unique values. These are already computed internally (or are trivially derivable from what is), and comprise all the information necessary to construct a GroupBy.
    • Doing so will reduce code (by rendering findSegmentsMsg unnecessary) and improve performance in some cases (e.g. when arrays are packed into chapel tuples for coargsort).
    • This work would be greatly simplified by having a MultiArray class in the server -- a GenSymEntry that holds multiple equal-length columns, like a dataframe but without all the methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants