[comet-parquet-exec] Merge upstream/main and resolve conflicts #1183

mbutrovich · 2024-12-18T21:41:33Z

This reflects a merge of upstream/main (as of this morning) and then a resolution of the conflicts. This catches up comet-parquet-exec feature branch on about a month of changes, including a release.

COMET_FULL_NATIVE_SCAN_ENABLED:
Tests: succeeded 728, failed 64, canceled 2, ignored 52, pending 0

COMET_NATIVE_RECORDBATCH_READER_ENABLED :
Tests: succeeded 722, failed 70, canceled 2, ignored 52, pending 0

These new test failures are mostly converting timestamps to timestamps, for example: Cause: org.apache.comet.CometNativeException: Cannot cast file schema field _19 of type Timestamp(Microsecond, Some("UTC")) to required schema field of type Timestamp(Microsecond, Some("America/Los_Angeles"))

@andygrove understands the issue, so we'll fix it after this PR merges.

* feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+

…ry allocator (apache#1063)

apache#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports

… config (apache#1087)

* Update version number for build * update docs

apache#1091)

…che#1093)

* update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72

## Which issue does this PR close? Closes apache#1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types

* include first batch in ScanExec metrics * record row count metric * fix regression

* Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows

…apache#1108)

* Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <[email protected]> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <[email protected]>

…apache#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <[email protected]> * fix the nit in comment --------- Co-authored-by: himadripal <[email protected]> Co-authored-by: Andy Grove <[email protected]>

* fix: Use RDD partition index * fix * fix * fix

…pache#1129) * Use exact class comparison for parquet scan * Add test * Add comment

* fix metrics issues * clippy * update tests

…iew (apache#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback

* save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback

* Remove unused StringView struct * remove more dead code

* add some notes on shuffle * reads * improve docs

## Which issue does this PR close? Part of apache#372 and apache#551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled

* Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches

…che#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports

## Which issue does this PR close? Part of apache#372 and apache#551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled

…e#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies

…factoring (apache#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix

…he#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array

…ache#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test

…e#1176)

## Which issue does this PR close? ## Rationale for this change After apache#1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution

* improve shuffle metrics * docs * more metrics * refactor * address feedback

# Conflicts: # native/Cargo.lock # native/Cargo.toml # native/core/src/execution/jni_api.rs # native/core/src/execution/planner.rs # native/core/src/execution/schema_adapter.rs # native/spark-expr/src/cast.rs # native/spark-expr/src/lib.rs # native/spark-expr/src/test_common/mod.rs # native/spark-expr/src/utils.rs # spark/src/main/scala/org/apache/comet/CometExecIterator.scala # spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala # spark/src/main/scala/org/apache/comet/Native.scala # spark/src/main/scala/org/apache/spark/sql/comet/operators.scala # spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala # spark/src/test/scala/org/apache/comet/exec/CometExecSuite.scala

mbutrovich · 2024-12-18T21:45:59Z

See #1182 to see the diff of this branch versus upstream/main, which should give an idea of what comet-parquet-exec feature branch's diff against upstream/main is after this merges.

Remove println from test.

Test Fix

NoeB and others added 30 commits November 13, 2024 16:57

chore: Simplify CometShuffleMemoryAllocator to use Spark unified memo…

c32bf0c

…ry allocator (apache#1063)

docs: Update benchmarking.md (apache#1085)

f3da844

feat: Require offHeap memory to be enabled (always use unified memory) (

2c832b4

apache#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports

test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE…

7cec285

… config (apache#1087)

Add changelog for 0.4.0 (apache#1089)

10ef62a

chore: Prepare for 0.5.0 development (apache#1090)

0c9a403

* Update version number for build * update docs

build: Skip installation of spark-integration and fuzz testing modules (

406ffef

apache#1091)

Add hint for finding the GPG key to use when publishing to maven (apa…

bfd7054

…che#1093)

docs: Update documentation for 0.4.0 release (apache#1096)

59da6ce

* update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72

chore: Include first ScanExec batch in metrics (apache#1105)

b64c13d

* include first batch in ScanExec metrics * record row count metric * fix regression

chore: Improve CometScan metrics (apache#1100)

19dd58d

* Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows

chore: Add custom metric for native shuffle fetching batches from JVM (…

e602305

…apache#1108)

docs: fix readme FGPA/FPGA typo (apache#1117)

7b1a290

fix: Use RDD partition index (apache#1112)

5400fd7

* fix: Use RDD partition index * fix * fix * fix

fix: Various metrics bug fixes and improvements (apache#1111)

ebdde77

fix: Don't create CometScanExec for subclasses of ParquetFileFormat (a…

9b250c4

…pache#1129) * Use exact class comparison for parquet scan * Add test * Add comment

fix: Fix metrics regressions (apache#1132)

95727aa

* fix metrics issues * clippy * update tests

docs: Add more technical detail and new diagram to Comet plugin overv…

36a2307

…iew (apache#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback

Stop passing Java config map into native createPlan (apache#1101)

2671e0c

feat: Improve ScanExec native metrics (apache#1133)

8d7bcb8

* save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback

chore: Remove unused StringView struct (apache#1143)

587c29b

* Remove unused StringView struct * remove more dead code

docs: Add some documentation explaining how shuffle works (apache#1148)

b95dc1d

* add some notes on shuffle * reads * improve docs

chore: Refactor cast to use SparkCastOptions param (apache#1146)

8d83cc1

* Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches

Enable more scenarios in CometExecBenchmark. (apache#1151)

21503ca

chore: Move more expressions from core crate to spark-expr crate (apa…

73f1405

…che#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports

andygrove and others added 16 commits December 9, 2024 17:45

remove dead code (apache#1155)

5c45fdc

chore: Move string kernels and expressions to spark-expr crate (apach…

49cf0d7

…e#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies

chore: Add ignored tests for reading complex types from Parquet (apac…

f1d0879

…he#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array

fix: Document enabling comet explain plan usage in Spark (4.0) (apach…

46a28db

…e#1176)

feat: Improve shuffle metrics (second attempt) (apache#1175)

e297d23

* improve shuffle metrics * docs * more metrics * refactor * address feedback

Fix redundancy in Cargo.lock.

3b0bda3

Format, more post-merge cleanup.

1ea24dd

Compiles

2f4768d

Compiles

858f0de

Remove empty file.

360c16d

Attempt to fix JNI issue and native test build issues.

f8eee9e

mbutrovich mentioned this pull request Dec 18, 2024

[do-not-merge] Diff updated comet-parquet-exec feature branch against main #1182

Closed

parthchandra and others added 4 commits December 18, 2024 16:29

Test Fix

c13d6a0

Update planner.rs

6814a99

Remove println from test.

Merge pull request #4 from parthchandra/merge_upstream_main

a8355f0

Test Fix

Merge remote-tracking branch 'upstream/main' into merge_upstream_main

1630632

andygrove merged commit 3c43234 into apache:comet-parquet-exec Dec 20, 2024

mbutrovich deleted the merge_upstream_main branch January 2, 2025 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[comet-parquet-exec] Merge upstream/main and resolve conflicts #1183

[comet-parquet-exec] Merge upstream/main and resolve conflicts #1183

mbutrovich commented Dec 18, 2024

mbutrovich commented Dec 18, 2024

[comet-parquet-exec] Merge upstream/main and resolve conflicts #1183

[comet-parquet-exec] Merge upstream/main and resolve conflicts #1183

Conversation

mbutrovich commented Dec 18, 2024

mbutrovich commented Dec 18, 2024