Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: [comet-parquet-exec] merge from main 20240116 #1299

Conversation

parthchandra
Copy link
Contributor

Merge from main

NoeB and others added 30 commits November 13, 2024 16:57
* feat: support array_append

* formatted code

* rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde

* remove unwrap

* Fix for Spark 3.3

* refactor array_append binary expression serde code

* Disabled array_append test for spark 4.0+
apache#1062)

* Require offHeap memory

* remove unused import

* use off heap memory in stability tests

* reorder imports
* Update version number for build

* update docs
* update TPC-H results

* update Maven links

* update benchmarking guide and add TPC-DS results

* include q72
## Which issue does this PR close?

Closes apache#1067

## Rationale for this change

Bug fix. A few expressions were failing some unsigned type related tests

## What changes are included in this PR?

 - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits
 - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()`  (`>` vs `>=`)

## How are these changes tested?

Put back tests for unsigned types
* include first batch in ScanExec metrics

* record row count metric

* fix regression
* Add native metrics for plan creation

* make messages consistent

* Include get_next_batch cost in metrics

* formatting

* fix double count of rows
* Part of the implementation of array_insert

* Missing methods

* Working version

* Reformat code

* Fix code-style

* Add comments about spark's implementation.

* Implement negative indices

+ fix tests for spark < 3.4

* Fix code-style

* Fix scalastyle

* Fix tests for spark < 3.4

* Fixes & tests

- added test for the negative index
- added test for the legacy spark mode

* Use assume(isSpark34Plus) in tests

* Test else-branch & improve coverage

* Update native/spark-expr/src/list.rs

Co-authored-by: Andy Grove <[email protected]>

* Fix fallback test

In one case there is a zero in index and test fails due to spark error

* Adjust the behaviour for the NULL case to Spark

* Move the logic of type checking to the method

* Fix code-style

---------

Co-authored-by: Andy Grove <[email protected]>
…apache#1086)

* enable decimal to decimal cast of different precision and scale

* add more test cases for negative scale and higher precision

* add check for compatibility for decimal to decimal

* fix code style

* Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala

Co-authored-by: Andy Grove <[email protected]>

* fix the nit in comment

---------

Co-authored-by: himadripal <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
* fix: Use RDD partition index

* fix

* fix

* fix
…pache#1129)

* Use exact class comparison for parquet scan

* Add test

* Add comment
* fix metrics issues

* clippy

* update tests
…iew (apache#1119)

* Add more technical detail and new diagram to Comet plugin overview

* update diagram

* add info on Arrow IPC

* update diagram

* update diagram

* update docs

* address feedback
* save

* remove shuffle jvm metric and update tuning guide

* docs

* add source for all ScanExecs

* address feedback

* address feedback
* Remove unused StringView struct

* remove more dead code
* add some notes on shuffle

* reads

* improve docs
## Which issue does this PR close?

Part of apache#372 and apache#551

## Rationale for this change

To be ready for Spark 4.0

## What changes are included in this PR?

This PR enables more Spark 4.0 tests that were fixed by recent changes

## How are these changes tested?

tests enabled
* Refactor cast to use SparkCastOptions param

* update tests

* update benches

* update benches

* update benches
…che#1152)

* move aggregate expressions to spark-expr crate

* move more expressions

* move benchmark

* normalize_nan

* bitwise not

* comet scalar funcs

* update bench imports
rluvaton and others added 20 commits January 7, 2025 23:10
…ark grouping (apache#1218)

* extract predicate_functions expressions to folders based on spark grouping

* code review changes

---------

Co-authored-by: Andy Grove <[email protected]>
## Which issue does this PR close?

## Rationale for this change

Because `isCometShuffleEnabled` is false by default, some tests were not reached

## What changes are included in this PR?

Removed `isCometShuffleEnabled` and updated spark test diff

## How are these changes tested?

existing test
…ing (apache#1221)

* extract hash_funcs expressions to folders based on spark grouping

* extract hash_funcs expressions to folders based on spark grouping

---------

Co-authored-by: Andy Grove <[email protected]>
* wip: array remove

* added comet expression test

* updated test cases

* fixed array_remove function for null values

* removed commented code

* remove unnecessary code

* updated the test for 'array_remove'

* added test for array_remove in case the input array is null

* wip: case array is empty

* removed test case for empty array
* fall back to Spark for distinct aggregates

* update expected plans for 3.4

* update expected plans for 3.5

* force build

* add comment
…formance (apache#1190)

* Implement faster encoder for shuffle blocks

* make code more concise

* enable fast encoding for columnar shuffle

* update benches

* test all int types

* test float

* remaining types

* add Snappy and Zstd(6) back to benchmark

* fix regression

* Update native/core/src/execution/shuffle/codec.rs

Co-authored-by: Liang-Chi Hsieh <[email protected]>

* address feedback

* support nullable flag

---------

Co-authored-by: Liang-Chi Hsieh <[email protected]>
* fix: disable initCap by default

* Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

Co-authored-by: Andy Grove <[email protected]>

* address review comments

---------

Co-authored-by: Andy Grove <[email protected]>
* Add changelog

* revert accidental change

* move 2 items to performance section
* fix: cast timestamp to decimal is unsupported

* fix style

* revert test name and mark as ignore

* add comment
* start 0.6.0 development

* update some docs

* Revert a change

* update CI
* fix links and provide complete scripts

* fix path

* fix incorrect text
@parthchandra parthchandra changed the title Comet parquet exec merge 20240116 chore: Comet parquet exec merge 20240116 Jan 17, 2025
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthchandra

@andygrove andygrove changed the title chore: Comet parquet exec merge 20240116 chore: [comet-parquet-exec] merge from main 20240116 Jan 17, 2025
@@ -190,7 +190,7 @@ class CometSparkSessionExtensions

// data source V1
case scanExec @ FileSourceScanExec(
HadoopFsRelation(_, partitionSchema, _, _, _: ParquetFileFormat, _),
HadoopFsRelation(_, partitionSchema, _, _, _, _),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct, because it would make us replace non-parquet scans as well.

main branch has:

          // data source V1
          case scanExec @ FileSourceScanExec(
                HadoopFsRelation(_, partitionSchema, _, _, fileFormat, _),
                _: Seq[_],
                requiredSchema,
                _,
                _,
                _,
                _,
                _,
                _)
              if CometScanExec.isFileFormatSupported(fileFormat)
                && CometScanExec.isSchemaSupported(requiredSchema)
                && CometScanExec.isSchemaSupported(partitionSchema) =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right

@parthchandra
Copy link
Contributor Author

@andygrove I added another fix since you approved. This was also a merge issue and caused 6 test failures if native_datafusion was enabled.
Latest count :

native_comet: Tests: succeeded 797, failed 0, canceled 2, ignored 50, pending 0
native_datafusion: Tests: succeeded 628, failed 40, canceled 2, ignored 50, pending 0
native_iceberg_compat: Tests: succeeded 623, failed 45, canceled 2, ignored 50, pending 0

@andygrove andygrove merged commit c17a0f6 into apache:comet-parquet-exec Jan 18, 2025
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.