Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] improve Spark Connect compatibility for types and count behavior #3352

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andrewgazelka
Copy link
Member

@andrewgazelka andrewgazelka commented Nov 20, 2024

This commit enhances Spark Connect compatibility in two key areas:

  1. Type Compatibility:

    • Add conversion of unsigned integers to signed integers for Spark compatibility
    • Implement to_spark_compatible_datatype for proper type casting
    • Cast columns before generating responses in execute stream
  2. Count Behavior:

    • Fix special case for count(lit(1)) to match Spark's behavior
    • When counting literal(1), now returns total row count instead of always 1
    • Add test case to verify count behavior matches Spark

This addresses compatibility issues when running Daft through Spark Connect,
particularly for unsigned integer types which aren't supported in Spark's type
system and count aggregations that need to match Spark's semantics.

Related: #3421

Copy link
Member Author

andrewgazelka commented Nov 20, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

@andrewgazelka andrewgazelka marked this pull request as ready for review November 21, 2024 00:32
@andrewgazelka andrewgazelka changed the base branch from andrew/connect-alias to graphite-base/3352 November 21, 2024 04:13
@andrewgazelka andrewgazelka changed the base branch from graphite-base/3352 to main November 21, 2024 04:17
@andrewgazelka andrewgazelka force-pushed the andrew/fix-cast-arrow-conversion branch 2 times, most recently from 57a2dd0 to 72aa46c Compare November 21, 2024 04:21
Copy link

codspeed-hq bot commented Nov 21, 2024

CodSpeed Performance Report

Merging #3352 will degrade performances by 24.95%

Comparing andrew/fix-cast-arrow-conversion (ae31759) with main (fa1d9d7)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 15 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main andrew/fix-cast-arrow-conversion Change
test_iter_rows_first_row[100 Small Files] 364.1 ms 273.2 ms +33.3%
test_show[100 Small Files] 23.8 ms 31.7 ms -24.95%

Copy link

graphite-app bot commented Nov 21, 2024

Graphite Automations

"Notify author when CI fails" took an action on this PR • (11/21/24)

1 teammate was notified to this PR based on Andrew Gazelka's automation.

@andrewgazelka andrewgazelka force-pushed the andrew/fix-cast-arrow-conversion branch 5 times, most recently from ab26cbc to b625819 Compare November 25, 2024 21:06
@andrewgazelka andrewgazelka changed the title [FIX] (WIP) casting of arrays from daft to arrow with unsigned [FIX] Nov 25, 2024
@andrewgazelka andrewgazelka changed the title [FIX] [FIX] improve Spark Connect compatibility for types and count behavior Nov 25, 2024
@andrewgazelka andrewgazelka changed the title [FIX] improve Spark Connect compatibility for types and count behavior [BUG] improve Spark Connect compatibility for types and count behavior Nov 25, 2024
@github-actions github-actions bot added the bug Something isn't working label Nov 25, 2024
Copy link

codecov bot commented Nov 25, 2024

Codecov Report

Attention: Patch coverage is 73.33333% with 12 lines in your changes missing coverage. Please review.

Project coverage is 77.44%. Comparing base (fa1d9d7) to head (ae31759).

Files with missing lines Patch % Lines
src/daft-connect/src/translation/datatype.rs 37.50% 10 Missing ⚠️
...onnect/src/translation/expr/unresolved_function.rs 80.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3352      +/-   ##
==========================================
+ Coverage   77.35%   77.44%   +0.08%     
==========================================
  Files         684      684              
  Lines       83637    83680      +43     
==========================================
+ Hits        64694    64802     +108     
+ Misses      18943    18878      -65     
Files with missing lines Coverage Δ
src/daft-connect/src/lib.rs 64.21% <ø> (ø)
src/daft-connect/src/op/execute/root.rs 98.36% <100.00%> (+2.70%) ⬆️
...onnect/src/translation/expr/unresolved_function.rs 75.60% <80.00%> (+75.60%) ⬆️
src/daft-connect/src/translation/datatype.rs 11.29% <37.50%> (+3.88%) ⬆️

... and 8 files with indirect coverage changes

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to do a physical cast on the tables. Shouldn't this all be handled during planning, not execution?

Iterating over the materialized tables like this is extremely expensive and inefficient.

Copy link
Member Author

@universalmind303 how would I do casting on a LogicalPlanBuilder directly? I might be overthinking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants