Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(field_index): get field index w.r.t. pre-join table schemata #1078

Merged
merged 3 commits into from
Jul 29, 2024

Conversation

gforsyth
Copy link
Member

@gforsyth gforsyth commented Jul 22, 2024

JoinChains provide the schema of the joined table (which is great for Ibis)
but for substrait we need the Field index computed with respect to
the original table schemata. In practice, this means rolling through
the tables in a JoinChain and computing the field index without
removing the join key

Given

Table 1
  a: int
  b: int

Table 2
  a: int
  c: int

JoinChain[r0]
 JoinLink[inner, r1]
   r0.a == r1.a
 values:
   a: r0.a
   b: r0.b
   c: r1.c

If we ask for the field index of c, the JoinChain schema will give
us an index of 2, but it should be 3 because

 0: table 1 a
 1: table 1 b
 2: table 2 a
 3: table 2 c

So now we pull out the correct JoinReference object and use that to
index into the tables in the JoinChain and offset by the length of the
schema of those preceding tables.

Resolves #1072

JoinChains provide the schema of the joined table (which is great for Ibis)
but for substrait we need the Field index computed with respect to
the original table schemata.  In practice, this means rolling through
the tables in a JoinChain and computing the field index _without_
removing the join key

Given
Table 1
  a: int
  b: int

Table 2
  a: int
  c: int

JoinChain[r0]
 JoinLink[inner, r1]
   r0.a == r1.a
 values:
   a: r0.a
   b: r0.b
   c: r1.c

If we ask for the field index of `c`, the JoinChain schema will give
us an index of `2`, but it should be `3` because

 0: table 1 a
 1: table 1 b
 2: table 2 a
 3: table 2 c

So now we pull out the correct JoinReference object and use that to
index into the tables in the JoinChain and offset by the length of the
schema of those preceding tables.
@gforsyth gforsyth force-pushed the fix_join_chain_field_offsets branch from bdf1acf to e10b9b8 Compare July 22, 2024 16:06
@gforsyth gforsyth requested a review from cpcloud July 23, 2024 12:50
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only thing I think should definitely be changed is the .get call!

ibis_substrait/compiler/translate.py Outdated Show resolved Hide resolved
ibis_substrait/compiler/translate.py Outdated Show resolved Hide resolved
ibis_substrait/compiler/translate.py Outdated Show resolved Hide resolved
@@ -732,7 +732,7 @@
"selection": {
"directReference": {
"structField": {
"field": 1
"field": 45
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 45x improvement in accuracy!!

@gforsyth gforsyth requested a review from cpcloud July 26, 2024 12:16
@gforsyth gforsyth force-pushed the fix_join_chain_field_offsets branch from 47df9f7 to 5418744 Compare July 26, 2024 12:22
@gforsyth gforsyth merged commit 7095b19 into ibis-project:main Jul 29, 2024
12 checks passed
@gforsyth gforsyth deleted the fix_join_chain_field_offsets branch July 29, 2024 14:24
gforsyth pushed a commit that referenced this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Aggregation field index off by one when joining two tables.
2 participants