Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVE optimised float WSSJ kernel #2917

Merged

Conversation

rakshithgb-fujitsu
Copy link
Contributor

@rakshithgb-fujitsu rakshithgb-fujitsu commented Sep 26, 2024

This PR introduces performance optimizations for the WSSJ Float using SVE intrinsics, resulting in significant improvements for the SVM algorithms on ARM.

Key Improvements:
Boser Method: 22% performance gain, leading to faster computation and better resource utilization.
Thunder Method: 5% performance gain, enhancing efficiency in scenarios where this method is used.

Changes:
Code Updates: New SVE intrinsics based WSSJ Float kernel.

Impact:
Performance: Faster processing times and improved efficiency for SVM algorithms observed and documented on single core.

Performance on single core (aws instance: c7g.8xl graviton3):
svm-perf

@rakshithgb-fujitsu
Copy link
Contributor Author

@keeranroth have a look at this.

Copy link
Contributor

@keeranroth keeranroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment too much on the algorithm, as I'm not familiar with SVM code. But it looks believable. There are just some style points I picked up on. Will let someone knowledgeable about the application area give some more guidance.

I can't help but feel that there is some instruction level parallelism being left on the table. A lot of the instructions are dependent. Having the masking, you might be able to duplicate the work, and simply select the result that you want at the end. I feel as though this implementation is going to be using only one pipeline at the moment. Would need profiling to confirm, though

cpp/daal/src/algorithms/svm/svm_train_common_sve_impl.i Outdated Show resolved Hide resolved
}
else
{
DAAL_ASSERT((sign & (sign - 1)) == 0) // used to make sure sign is always having 1 bit set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what get sign returns, but this assert is also true when sign = 0, so the comment isn't correct. I suspect this might not be what you want to be checking on the result of getSign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to keep this optimization done under check, since low = 0x2, if it were to ever change this debug assert would help. Where getSign is defined here - https://github.com/oneapi-src/oneDAL/blob/f8a395311b4b19be1859545d7a19ee1969d8d9ee/cpp/daal/src/algorithms/svm/svm_train_common.h#L92

cpp/daal/src/algorithms/svm/svm_train_common_sve_impl.i Outdated Show resolved Hide resolved
@rakshithgb-fujitsu
Copy link
Contributor Author

rakshithgb-fujitsu commented Sep 26, 2024

Can't comment too much on the algorithm, as I'm not familiar with SVM code. But it looks believable. There are just some style points I picked up on. Will let someone knowledgeable about the application area give some more guidance.

I can't help but feel that there is some instruction level parallelism being left on the table. A lot of the instructions are dependent. Having the masking, you might be able to duplicate the work, and simply select the result that you want at the end. I feel as though this implementation is going to be using only one pipeline at the moment. Would need profiling to confirm, though

We do think that there is more room for improvement, and we are exploring more ideas, we've finalized on this version for now and if we do optimize it further will raise another PR for it. If you do see any instruction level bottlenecks, please do point it out.

@napetrov
Copy link
Contributor

Great to see specialization for algo

@napetrov
Copy link
Contributor

/intelci: run

@samir-nasibli
Copy link
Contributor

@rakshithgb-fujitsu thank you! please address formatter check hints

@rakshithgb-fujitsu
Copy link
Contributor Author

@rakshithgb-fujitsu thank you! please address formatter check hints

not sure what is failing, clang-format.sh shows there is nothing to format. Also unable to add labels.

@Alexsandruss Alexsandruss added enhancement perf Performance optimization labels Oct 3, 2024
@Alexsandruss Alexsandruss merged commit 84c9bbb into uxlfoundation:main Oct 4, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement perf Performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants