Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unary_dyn_mut in arth #3708

Closed
Ted-Jiang opened this issue Feb 13, 2023 · 15 comments
Closed

Support unary_dyn_mut in arth #3708

Ted-Jiang opened this issue Feb 13, 2023 · 15 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@Ted-Jiang
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
apache/datafusion@48732b4 all math compute kernel like add_dyn_scalar, multi_dyn_scalar...
I try to use use copy one write (mock func add_scalar_mut use

pub fn unary_mut<I, F>(
)
run select t.a + 1 from t
got

warning: `datafusion` (bench "math_query_sql") generated 1 warning
    Finished bench [optimized] target(s) in 7m 26s
     Running benches/math_query_sql.rs (target/release/deps/math_query_sql-64399df00616f72b)
Benchmarking tess_add: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
tess_add                time:   [1.2643 ms 1.2664 ms 1.2685 ms]
                        change: [-78.711% -78.544% -78.361%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

(venv) yangjiang@LM-SHC-15009782 arrow-datafusion %

Describe the solution you'd like

Add unary_dyn_mut which take ownership of array then support cow math operator.

Describe alternatives you've considered

Additional context

@Ted-Jiang Ted-Jiang added the enhancement Any new improvement worthy of a entry in the changelog label Feb 13, 2023
@tustvold
Copy link
Contributor

Could you provide a link to the benchmark you were running, when this was previously proposed the numbers were not nearly so flattering - #3134 (comment)

I'm mainly apprehensive as we already have issues with the amount of codegen for the arithmetic kernels 😅

@viirya
Copy link
Member

viirya commented Feb 13, 2023

Cannot it just use unary_mut? Looks like unary_dyn_mut simply calls unary_mut.

@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Feb 14, 2023

Cannot it just use unary_mut? Looks like unary_dyn_mut simply calls unary_mut.

There is already a

pub fn unary_mut<I, F>(
it takes PrimitiveArray as input

@Ted-Jiang
Copy link
Member Author

Could you provide a link to the benchmark you were running, when this was previously proposed the numbers were not nearly so flattering - #3134 (comment)

I'm mainly apprehensive as we already have issues with the amount of codegen for the arithmetic kernels 😅

I used unary_mut in datafusion like

pub(crate) fn add_scalar_cow<T>(
    array: PrimitiveArray<T>,
    scalar: T::Native,
) -> Result<PrimitiveArray<T>, ArrowError>
where
    T: ArrowNumericType,
    T::Native: ArrowNativeTypeOp,
{
    match unary_mut(array, |value| value.add_wrapping(scalar)) {
        Ok(array) => Ok(array),
        // Fall back need copy array
        Err(array) => add_scalar(&array, scalar),
    }
}

But there a lot code change during that time, like support dictionaryArray.... So i try to modify less api in datafusion, support unary_dyn_mut

@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Feb 14, 2023

@tustvold
Copy link
Contributor

I think it would help move this forward if there we could get a standalone benchmark in arrow-rs, the level of performance improvement makes me sceptical that there isn't something else going on - e.g checked to unchecked overflow or something

@Ted-Jiang
Copy link
Member Author

I think it would help move this forward if there we could get a standalone benchmark in arrow-rs, the level of performance improvement makes me sceptical that there isn't something else going on - e.g checked to unchecked overflow or something

yeap, i prefer reimplement it in datafusion find the cause (I didn't notice before already bench in arrow-rs). By the way I think without performance gain, still need this api for user 😆

@tustvold
Copy link
Contributor

By the way I think without performance gain, still need this api for user

What makes you say this, the memory savings are likely irrelevant, so I'm not sure what the advantage would be if not performance?

@Ted-Jiang
Copy link
Member Author

By the way I think without performance gain, still need this api for user

What makes you say this, the memory savings are likely irrelevant, so I'm not sure what the advantage would be if not performance?

I mean even if without huge performance gain , it will still avoid re-allocate memory by manual.

@Ted-Jiang
Copy link
Member Author

@tustvold
I did bench in https://github.com/apache/arrow-rs/pull/3709/files#diff-d31b0761ffe79b72672cd516aa8ef0792c004328f31994dc616e677e8eb3c50cR31-R59
Sadly same as you mentioned before only little improvement 😢 .

But I did a poc test in datafusion apache/datafusion#5285
got

warning: `datafusion-physical-expr` (lib) generated 28 warnings
    Finished bench [optimized] target(s) in 4m 06s
     Running benches/math_query_sql.rs (target/release/deps/math_query_sql-2162bfde908d94df)
add_cow                 time:   [4.5467 ms 4.6540 ms 4.7887 ms]
                        change: [-84.767% -82.599% -79.807%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Feb 15, 2023

One thing need mention, i do profile on https://github.com/apache/arrow-rs/pull/3709/files#diff-d31b0761ffe79b72672cd516aa8ef0792c004328f31994dc616e677e8eb3c50cR31-R59. I think they are do the same amount work:
add_scalar_dyn has 463 examples
image

add_scalar_dyn_mut has 292 examples
image

@tustvold I am not an expert about this 😭 , could your share your opinion 🆘

@Ted-Jiang
Copy link
Member Author

And test with datafusion-cli





0 rows in set. Query took 0.018 seconds.
❯ select l_partkey from test limit 21;
+-----------+
| l_partkey |
+-----------+
| 155190    |
| 67310     |
| 63700     |
| 2132      |
| 24027     |
| 15635     |
| 106170    |
| 4297      |
| 19036     |
| 128449    |
| 29380     |
| 183095    |
| 62143     |
| 88035     |
| 108570    |
| 123927    |
| 37531     |
| 139636    |
| 182052    |
| 145243    |
| 94780     |
+-----------+
21 rows in set. Query took 0.039 seconds.
❯ select l_partkey+1 from test limit 21;
+---------------------------+
| test.l_partkey + Int64(1) |
+---------------------------+
| 155191                    |
| 67311                     |
| 63701                     |
| 2133                      |
| 24028                     |
| 15636                     |
| 106171                    |
| 4298                      |
| 19037                     |
| 128450                    |
| 29381                     |
| 183096                    |
| 62144                     |
| 88036                     |
| 108571                    |
| 123928                    |
| 37532                     |
| 139637                    |
| 182053                    |
| 145244                    |
| 94781                     |
+---------------------------+
21 rows in set. Query took 0.034 seconds.
❯

Seems not error during add

@Dandandan
Copy link
Contributor

A difference with #3134 (comment) is that a batch size of 512 (in math_query_sql) vs 512000 (in the linked issue).

I would expect a larger difference to arise for simple arithmetic operations (like adding) when choosing a small batch size as the operation itself could be fast compared to the allocation. Also in the posted benchmark in the comments a allocation is done in the loop, whereas the query from @Ted-Jiang does a scalar addition.
Would be good if we have a benchmark to show the difference.

@Ted-Jiang
Copy link
Member Author

@Dandandan Thanks for your info ❤️ I try to update https://github.com/apache/arrow-rs/pull/3709/files#diff-d31b0761ffe79b72672cd516aa8ef0792c004328f31994dc616e677e8eb3c50cR31-R59 to

const BATCH_SIZE: i32 = 512;
const BATCH_NUM: i32 = 1024 * 10;

Got

warning: `arrow` (bench "cow") generated 8 warnings
    Finished bench [optimized] target(s) in 0.34s
     Running benches/cow.rs (target/release/deps/cow-506a8d861bd3158e)
add                     time:   [8.5384 ms 8.5754 ms 8.6142 ms]
                        change: [-4.1072% -3.3312% -2.6109%] (p = 0.00 < 0.05)
                        Performance has improved.

Still little improvement found 😢 . Seems there lost something in my poc in datafusion.

@tustvold
Copy link
Contributor

tustvold commented Jan 1, 2024

Closing this one for now, feel free to reopen if you wish to pursue this further

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants