Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace fibers with threadpool #674

Merged
merged 46 commits into from
Mar 31, 2022
Merged

Replace fibers with threadpool #674

merged 46 commits into from
Mar 31, 2022

Conversation

markaren
Copy link
Contributor

This PR replaces #671. It uses a threadpool rather than std::for_each so that compilers older than gcc9 will work.

Note that slave state is unimplemented in this PR. It will eventually be included when I have figured out how to best make it back in after removing async_slave.
Additionally the concurrency test has been commented out. The test is about file locking.

Comment on lines 31 to 42
if (!work_queue_.empty()) {

auto task = work_queue_.front();
work_queue_.pop();
task();

lck.unlock();
cv_.notify_one();

} else {
std::this_thread::yield();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this PR. Would it be possible to have an option to run fmus without spawning a separate worker thread? (i.e. running everything in a main thread sequentially). For some use-cases, the communication overhead between the main and the worker threads seem quite significant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible today, and was in fact the motivation for using fibers in the first place. (Not that fibers are required for sequential execution of FMU functions, but they provide a nice framework for combining sequential execution with asynchronous I/O so one can run both local and remote slaves without the overhead of threads.)

Copy link
Member

@kyllingstad kyllingstad Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is just that the remote-slave implementation we're using today (proxyfmu) is not built around async I/O, so it requires a worker thread per slave anyway. That is why fibers simply are an extra overhead that comes in addition to threading and I/O costs, which is what I suppose motivated @markaren's PR.

As I see it, there are two ways to improve performance in this area: Drop fibers and go all-in for proxyfmu and one-thread-per-slave (as this PR proposes), or keep fibers and implement async communication with slaves (as was the original plan).

Copy link
Contributor

@davidhjp01 davidhjp01 Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation @kyllingstad.

I was able to achieve required simulation performance by removing std::thread for slaves as well as recursive fiber creation in the libcosim master branch. I was not able to achieve this via pseudo_async_slave because it still creates fiber recursively in the fmu interface methods (calling get/setters for variables and do_step).

But now the problem is I cannot not simulate some fmus because they run directly in fibers with limited resources (created by slave_simulator::impl::do_step). And I cannot replace/extend slave_simulator as it is always added in execution::add_slave.

This PR seems fixed my issue but I still had to remove worker thread to avoid communication overhead (also worker thread consuming cpu time on checking messages).

Copy link
Contributor Author

@markaren markaren Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to have an option to run fmus without spawning a separate worker thread?

Yes, and this is one of the motivations for this PR. IMO the master algorithm itself should decide how it handles execution. In the case of this implementation it could take numThreads as an argument where < 1 means no pool.

proxy-fmu was not a motivation for this PR. My motivation was to simplify the code base and make it run faster. I've found that the fiber solution is slower than a threaded solution with or without proxy-fmu.

@markaren
Copy link
Contributor Author

It uses a threadpool rather than std::for_each so that compilers older than gcc9 will work

This is an important thing to address if this is something we want to move forward with. Do we have to support old compilers that are not C++17 feature complete?

@markaren
Copy link
Contributor Author

markaren commented Feb 16, 2022

If the target branch gets the ok, we should also make it possible to specify the count in the XML config.

Edit: This was meant as an comment in #680

@markaren
Copy link
Contributor Author

Adding the functionality to set the number of worker threads for SSP was trivial, but not so much for the OSP alternative as it does not provide algorithm spesific configuration options. This is related to #404

@ljamt
Copy link
Member

ljamt commented Mar 15, 2022

Adding the functionality to set the number of worker threads for SSP was trivial, but not so much for the OSP alternative as it does not provide algorithm spesific configuration options. This is related to #404

I see your point, but I don't think this can be solved separately. Should not be a blocker for this PR.

@ljamt
Copy link
Member

ljamt commented Mar 15, 2022

What's the reason for this PR to remain in draft mode? Any remaining issues that must be resolved?

@markaren
Copy link
Contributor Author

markaren commented Mar 15, 2022

What's the reason for this PR to remain in draft mode? Any remaining issues that must be resolved?

Not other than significant vetting.

And yeah, the concurrent file locking test needs to be fixed (currently commented out).

@markaren markaren marked this pull request as ready for review March 15, 2022 10:47
@ljamt ljamt requested a review from eidekrist March 15, 2022 11:49
Copy link
Contributor

@davidhjp01 davidhjp01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Just added some comments

@ljamt
Copy link
Member

ljamt commented Mar 28, 2022

utility_concurrency_unittest is still commented out. Should be included and fixed before merging.

@ljamt
Copy link
Member

ljamt commented Mar 28, 2022

utility_concurrency_unittest is still commented out. Should be included and fixed before merging.

Test included and passing now

@ljamt ljamt requested a review from kyllingstad March 28, 2022 14:01
@ljamt
Copy link
Member

ljamt commented Mar 29, 2022

As fibers are removed, I think cosim::utility::shared_mutex can be replaced with std::shared_mutex. Don't want to add more to this PR, and can push that as a separate PR.

@ljamt
Copy link
Member

ljamt commented Mar 30, 2022

@markaren, if you are ok with the latest changes to thread_pool.hpp are we then ready to merge?
@kyllingstad, @eidekrist, share your opinions if you disagree :)

@markaren
Copy link
Contributor Author

How are the observed differences in usage/speed/accuracy on your side? All good?
We can probably set the the default number of threads to std::thread::hardware_concurrency()-1 in fixed_step_algorithm as suggested?

@restenb
Copy link
Member

restenb commented Mar 30, 2022

How are the observed differences in usage/speed/accuracy on your side? All good? We can probably set the the default number of threads to std::thread::hardware_concurrency()-1 in fixed_step_algorithm as suggested?

Yes. I've added a unsigned int max_threads_ = std::thread::hardware_concurrency() - 1 variable to fixed_step_algorithm. With what is now a blocking only strategy, it may no longer be necessary, but I'd like to extend this in the future to include a spinlock as well. Blocking & resuming threads has a non-negligible overhead cost when done at a higher rate, such as with a very small timestep simulation.

We seem to be seeing ~15-20% improvements in simulation speed with this PR over the fiber implementation, at least with the example projects like dp-ship.

Copy link
Member

@kyllingstad kyllingstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I don't have much to add except some stylistic nitpicks here and there. I didn't look at the thread pool implementation in any detail, since it seems to have been thoroughly reviewed by others. Everything else looks good to merge as far as I'm concerned.

include/cosim/orchestration.hpp Outdated Show resolved Hide resolved
boost::fibers::condition_variable condition_;
};


/**
* A shared mutex à la `std::shared_mutex`, but with support for fibers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this class at all anymore, we can just use std::shared_mutex.

Copy link
Member

@kyllingstad kyllingstad Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see that @ljamt already suggested we do this as a separate PR.

Copy link
Member

@restenb restenb Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we noticed this as well. In addition cosim::utility::shared_mutex was used in utility_concurrency_unittest.cpp with the intention to test that file locking functioned correctly with the custom mutex, which was in turn only necessary because of fibers.

With the removal of fibers, that test can be removed as well, given that there is no point for us in unit testing std::shared_mutex. We wanted to push this as a separate PR to avoid more noise in this one, since cosim::utility::shared_mutex is used throughout concurrency.hpp / cpp.

@kyllingstad
Copy link
Member

One more thing: You may want to consider running clang-format on everything before merging. I see there are some includes that are out of alphabetic order after the async_slave --> slave change, and possibly other things. If it's not fixed now, it's going to show up in someone else's PR later.

@markaren markaren merged commit a6f3fcc into master Mar 31, 2022
@markaren markaren deleted the parallel-pool branch March 31, 2022 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants