Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduced callback to Pthread, Win32 and OpenMP backend #4577

Merged
merged 2 commits into from
Apr 22, 2024

Conversation

shivammonaka
Copy link
Contributor

Motivation: This pull request enables OpenBLAS threading to be composable with caller multithreading. It allows OpenBLAS threading to be composable with callers using Intel’s Thread Building Blocks (TBB), etc.

Brief description:

  1. This work is a completion of work-in-progress pull request (PR) WIP: allow threading backend to be replaced by caller #2255, aimed at achieving thread composability in OpenBLAS with all threading backends(OpenMP, Pthreads and win32).
  2. Caller is required to register a callback to dynamically change the threading backend. Instead of creating its own threads, OpenBLAS will now pass on the work (blas_queue_t) to the registered callback.
  3. The integration of TBB on the caller side for executing the callback has provided us with significant performance improvement, particularly in nested parallelism scenarios.
  4. A typical prototypical user callback would look like:
void myfunction_ (int sync, openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void *jobdata, int dojob_data)
{
    #pragma omp parallel for
    for(int i = 0; i < numjobs; i++)
    {
        void *element_adrr = (void *) (((char *)jobdata) + ((unsigned) i)*jobdata_elsize);
        dojob(i, element_adrr, dojob_data);
     }
    return;
}
  1. A typical callback registration would look like:
    openblas_set_threads_callback_function(myfunction_);

Kindly note: Extending this feature to Pthreads and Win32 threading backends required us to make additional changes, as outlined below:

  1. We have introduced exec_threads() function in pthreads and win32, which helps in setting the synchronization environment and calling the inner_thread routine. This change aligns with the existing OpenMP flow.
  2. We have introduced adjust_thread_buffers() function, similar to OpenMP, for initializing global thread buffers instead of the existing local buffers initialized in blas_thread_server.

@goplanid
Copy link
Contributor

goplanid commented Mar 25, 2024

@ stevengj @martin-frbg

@martin-frbg
Copy link
Collaborator

Thanks. CI suggests the changes currently make Windows/MSYS builds prone to hanging in/after a fork() for some reason - at least that is what the apparently hanging "test_kernel_regress" in utest does.
The failed jobs on Azure appear to be down to older tools not accepting C99 initializers, and wanting a dedicated build rule for the new blas_server_callback file in the Makefile.
I'll try to squeeze in some local testing...

@martin-frbg
Copy link
Collaborator

martin-frbg commented Apr 1, 2024

Sorry, I'm not sure I understand the latest update here, and which state of the respective files you are restoring with it - would you prefer to convert this to a draft for now ?

@shivammonaka
Copy link
Contributor Author

Hi @martin-frbg,
Sorry about the confusion, I was doing some local experiments and accidently pushed it on main branch. I will revert the changes and try to fix failing checks.

@shivammonaka shivammonaka marked this pull request as draft April 1, 2024 08:14
@martin-frbg
Copy link
Collaborator

Ok, thanks, it was indeed a bit confusing to see seemingly unrelated blocks copied around without explanation, and then removed again in the same style. Unfortunately I hadn't gotten further than checking that the code still works on Linux/Unix when the new callback is not used, and that it compiles and works locally on Windows with gcc (but no explanation for the segfaults with clang)

@shivammonaka shivammonaka marked this pull request as ready for review April 2, 2024 06:44
@shivammonaka
Copy link
Contributor Author

Hi @martin-frbg ,
There is 1 check still failing. Is it because of some resource constraint or Is my code hanging somewhere.

@martin-frbg
Copy link
Collaborator

these OSX jobs on Azure have been randomly timing out for a while - certainly unrelated to your PR. I still need to investigate if it is simply due to old hardware in the Azure cloud becoming too slow for dynamic_arch builds, or if some test started to hang on Nehalem

@martin-frbg martin-frbg added this to the 0.3.28 milestone Apr 4, 2024
@rageshhajela16
Copy link

these OSX jobs on Azure have been randomly timing out for a while - certainly unrelated to your PR. I still need to investigate if it is simply due to old hardware in the Azure cloud becoming too slow for dynamic_arch builds, or if some test started to hang on Nehalem

@martin-frbg Thanks for the review. Are there any other comments for us to incorporate?

@martin-frbg
Copy link
Collaborator

No concerns, I've just been very busy lately, sorry. I hope to merge this tomorrow after trying out and documenting what integration with TBB would involve.

@martin-frbg
Copy link
Collaborator

fixes #4066

@martin-frbg martin-frbg merged commit f6eadf0 into OpenMathLib:develop Apr 22, 2024
68 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants