Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple JOINs not working #248

Closed
jacopotagliabue opened this issue Oct 17, 2024 · 7 comments · Fixed by #250
Closed

Simple JOINs not working #248

jacopotagliabue opened this issue Oct 17, 2024 · 7 comments · Fixed by #250
Assignees
Labels
bug Something isn't working high priority Prompt attention needed

Comments

@jacopotagliabue
Copy link

Hey, thanks for your work, I really appreciate the effort!

I am having trouble with very simple JOINs - proprietary code so I need to replace a few labels, but the TL;DR is that I dropped in the library as a PySpark replacement as suggested in the "getting started". Installation was relatively simple.!

Unfortunately, I'm stuck at the first dataframe join. The only column in my DF to be joined is myfield

 print(_df.columns) -> [myfield]

But then the join operation on the other df gives me this error:

header = header.join(
        _df, 
        ['myfield'], 
        'inner'
    )
pyspark.errors.exceptions.connect.SparkRuntimeException: type_coercion
caused by
Schema error: No field named "myfield". Valid fields are "#117", "#118", "#119", "?table?"."#18", "?table?"."#27", "?table?"."#10", "?table?"."#17", "?table?"."#13", "?table?"."#9", "?table?"."#120".

Something is wrong with field name obviously as I have no names corresponding to this error message (note that I did not obfuscate the error message if not for myfield).

More generally, as the library matures it would be great (in the README?) to have an explicit compatibility table with Spark ops, so users can get a feeling of the support for "drop-in replacement" before embarking in setting up a conversion project.

Happy to test fixes and other stuff if needed.

@linhr
Copy link
Contributor

linhr commented Oct 17, 2024

Thanks for reaching out!

Yeah there may be issues with join operations. The field names starting with # are internal field names when generating the query plan. Our Spark test suites have a few known errors with a similar message as the one you posted. So there is something wrong in join operation planning. We will post here with our findings and let you know when the fix is ready.

Having a compatibility table is a good idea! I've created an issue (#249) to track this.

@shehabgamin
Copy link
Contributor

I'll start looking into this within the next hour. Please let us know if you run into any other issues and we'll make sure to prioritize!

@shehabgamin shehabgamin self-assigned this Oct 17, 2024
@shehabgamin shehabgamin added bug Something isn't working high priority Prompt attention needed labels Oct 17, 2024
@shehabgamin shehabgamin linked a pull request Oct 17, 2024 that will close this issue
@jacopotagliabue
Copy link
Author

Thanks! No rush on my side - look forward to seeing how this evolve

@shehabgamin
Copy link
Contributor

Thanks! No rush on my side - look forward to seeing how this evolve

#250 now passes all tests in the DataFrame.join() doctest (from the Spark codebase), except for one.

This is the failing test case:

df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name")
df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")])
df.join(df2, 'name', 'outer').select('name', 'height').sort(desc("name")).show()

Differences (unified diff with -expected +actual):
    @@ -2,6 +2,7 @@
     | name|height|
     +-----+------+
    -|  Tom|    80|
     |  Bob|    85|
     |Alice|  NULL|
    +| NULL|    80|
     +-----+------+

We'll merge this PR and roll out a new release shortly. Follow-up work will continue afterward to address the last test case!

@shehabgamin
Copy link
Contributor

Sorry didn't mean to close!

@linhr
Copy link
Contributor

linhr commented Oct 17, 2024

This is the failing test case:

I've created a separate issue (#251) to track the remaining work.

@shehabgamin
Copy link
Contributor

@jacopotagliabue Sail v0.1.5 is now released. Please let us know if there is anything else we can do for ya!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority Prompt attention needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants