Move to keyset pagination #220

d33bs · 2024-08-01T17:03:42Z

Description

This PR modifies "chunk" behavior to follow keyset pagination patterns. We gather sets of pages as pagesets which are based on a key per table which is used to help create pages of rows from tables. pagesets are gathered ahead of time to retain the ability to parallelize where appropriate. This change required a new configuration parameter to be specified per table and the overall join.

Operationally this reduces the amount of data stored in tables throughout, reduces the complexity of the sorting performed, and avoids unnecessary data loading which occurs during the use of OFFSET.

Changes here made it unnecessary to create metadata columns and as a result I've removed those from all spots I could find.

Changes here are intended to address #214 , which turned out to be more related to data size than the sorting implementation. This is confirmed by a now passing large_data_tests test of JUMP data using the full dataset (instead of a truncated version).

Closes #214

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

falquaddoomi

Neat solution to the problem you were facing; I agree that it's less brittle than using LIMIT and OFFSET since you're now relying on the data rather than implementation-specific definitions of LIMIT and OFFSET.

I think you could add a bit more description of what columns are good candidates for keys; from my reading of your code I assume they'd be unique integers.

Anyway, nice work, and glad to see this issue put to rest!

cytotable/convert.py

Co-Authored-By: Faisal Alquaddoomi <[email protected]>

d33bs · 2024-08-02T20:20:04Z

Thanks @falquaddoomi for the review and comments! I made some adjustments based on what you mentioned.

jenna-tomkinson

I left a few comments, mainly clarification about the fix. Nice job! 🎉

cytotable/constants.py

cytotable/convert.py

tests/test_convert_threaded.py

Co-Authored-By: Jenna Tomkinson <[email protected]>

gwaybio

Exciting to see this enhancement! I made several comments, mostly aimed at improving clarity in comments and variable names. After addressing, feel free to merge 👍

cytotable/convert.py

cytotable/utils.py

docs/source/overview.md

Co-Authored-By: Gregory Way <[email protected]>

d33bs · 2024-08-26T22:08:06Z

Thank you @gwaybio, @jenna-tomkinson and @falquaddoomi for your reviews! After making adjustments for all comments I believe this is now ready and I'll merge this in.

d33bs added 12 commits July 30, 2024 17:05

move to keyset pagination

67d774f

linting

1a4c6ed

adjustments for further data integration

df1f7a5

test adjustments

23e8f0b

linting corrections

0d82ecb

run dev workflow

e8e9565

fix docs

b809695

remove custom sql join for large data test

c3bfea5

test with rowcount update

1e391c2

linting, docs, cleanup

c2aaadb

cleanup

df677f4

remove test branch

0501350

d33bs requested review from falquaddoomi, kenibrewer and jenna-tomkinson August 1, 2024 17:38

d33bs marked this pull request as ready for review August 1, 2024 17:38

falquaddoomi reviewed Aug 1, 2024

View reviewed changes

cytotable/convert.py Outdated Show resolved Hide resolved

cytotable/convert.py Outdated Show resolved Hide resolved

cytotable/convert.py Show resolved Hide resolved

cytotable/convert.py Outdated Show resolved Hide resolved

d33bs requested a review from gwaybio August 1, 2024 22:24

d33bs and others added 4 commits August 2, 2024 13:08

rework pageset generation

6fbf1b2

linting

2517df6

fix types and docstring

9001152

Co-Authored-By: Faisal Alquaddoomi <[email protected]>

docs and types

2deef1e

d33bs requested a review from falquaddoomi August 5, 2024 14:20

jenna-tomkinson reviewed Aug 5, 2024

View reviewed changes

update to use pageset instead of chunk fxn names

73f9ac4

Co-Authored-By: Jenna Tomkinson <[email protected]>

gwaybio approved these changes Aug 25, 2024

View reviewed changes

d33bs and others added 3 commits August 26, 2024 08:45

update variable names

b59aac5

Co-Authored-By: Gregory Way <[email protected]>

better exception docs

c04ebf3

better docstring parameter description

d0c6892

Co-Authored-By: Gregory Way <[email protected]>

d33bs and others added 7 commits August 26, 2024 10:23

better documentation and exception msg

e9681f7

remove exception

85678e2

point to the example in docs

178149c

Co-Authored-By: Gregory Way <[email protected]>

add further notes about when/why to customize

e33de20

linting

22c326b

remove test raise

f3a671d

sorted results from concatenation by natural sort

c41796e

d33bs merged commit 81584ba into cytomining:main Aug 26, 2024
12 checks passed

d33bs deleted the pagination-changes branch August 26, 2024 22:08

d33bs mentioned this pull request Sep 1, 2024

Reduce join operation memory consumption #223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to keyset pagination #220

Move to keyset pagination #220

d33bs commented Aug 1, 2024 •

edited

Loading

falquaddoomi left a comment

d33bs commented Aug 2, 2024

jenna-tomkinson left a comment

gwaybio left a comment

d33bs commented Aug 26, 2024

Move to keyset pagination #220

Move to keyset pagination #220

Conversation

d33bs commented Aug 1, 2024 • edited Loading

Description

What is the nature of your change?

Checklist

falquaddoomi left a comment

Choose a reason for hiding this comment

d33bs commented Aug 2, 2024

jenna-tomkinson left a comment

Choose a reason for hiding this comment

gwaybio left a comment

Choose a reason for hiding this comment

d33bs commented Aug 26, 2024

d33bs commented Aug 1, 2024 •

edited

Loading