Refactor the event channel #1912

yahoNanJing · 2022-03-03T10:45:30Z

Which issue does this PR close?

Closes #1909.

Rationale for this change

Mentioned in #1909.

What changes are included in this PR?

Introduce the following for event-based processing:

EventAction for common trait
EventLoop for common struct

Are there any user-facing changes?

It's blocked by #1911.

thinkharderdev · 2022-03-04T19:04:42Z

ballista/rust/core/src/event_loop.rs

+        tokio::spawn(async move {
+            info!("Starting the event loop {}", name);
+            while !stopped.load(Ordering::SeqCst) {
+                let event = rx_event.recv().await.unwrap();


So I understand, if the channel is closed, this will panic and the task will die?

Yes. In general, the channel won't be closed until the scheduler is down. The dangling task should be handled by another way, like speculative task execution.

In IOx and elsewhere instead of panic when the other end of a channel shuts down, we do something like

info!("Event Channel closed, shutting down"); return

so it is clear from the logs what is happening and gives the process / task a chance at a clean shutdown

Thanks @alamb. I'll add the informative log.

thinkharderdev · 2022-03-04T19:35:36Z

ballista/rust/scheduler/src/scheduler_server/mod.rs

+        {
+            if let Some(event_loop) = self.event_loop.as_mut() {
+                // It's OK here, since we are sure the mutable reference only be used in the initialization
+                unsafe {


Can we avoid this by just not wrapping the EventLoop in an Arc?

Actually, in this PR the EventLoop don't need Arc. However, from the whole processing figure in #1704, the event loop will be used in somewhere else. The related PR is under progress.

I see. In that case, does the EventLoop need to be passed around, or can we just pass a Sender to different producers? In that case, the EventLoop could still be owned and multiple producers could hold a clone of the event loops Sender.

Maybe @alamb could comment on how much we should care to avoiding unsafe.

The philosophy in Rust in general is to avoid unsafe as much as possible. Typically it unsafe is used for one of two reasons:

Interfacing with other languages (like C/C++) that don't have the same lifetime system as Rust

Performance where the overhead of doing things like array bounds checks is too great

Neither of those situations seems to apply here.

I would personally suggest figuring out some way to avoid this unsafe block - @thinkharderdev has one suggestion;

I don't know this code well enough to offer anything specific, but I would say another pattern for something like this is "Interior Mutability" which is a fancy rust way of saying use something like Mutex. Perhaps

event_loop: Option<Arc<Mutex<EventLoop<SchedulerServerEvent>>>>,

Hi @alamb, the reason I prefer not to add Mutex here is that the mutation only happens in the initialization phase. In other places, only it's only for reading. And we can avoid the overhead of locks.

@thinkharderdev, this PR is for providing some encapsulation for the channel processing and extracting some common behaviors in the EventLoop.

Right, I think it's a great idea to encapsulate the event loop, but I think we can avoid the unsafe and the overhead of a Mutex or RwLock by decoupling the producer and consumer interface. If we want to encapsulate the error handling around sending an event on the underlying channel then perhaps we can create an explicit EventProducer like:

pub struct EventProducer<E> { tx_event: mpsc::Sender<E> } impl<E> EventProducer<E> { pub async fn post_event(&self, event: E) -> Result<()> { if let Some(tx_event) = self.tx_event.as_ref() { tx_event.send(event).await.map_err(|e| { BallistaError::General(format!("Fail to send event due to {}", e)) })? } else { warn!("Event sender not exist!!!"); } Ok(()) } } impl<E> EventLoop<E> { pub fn producer(&self) -> Result<EventProducer<E>> { EventPrdocuer::new(self.tx_event.cloned().ok_or(...)?) } }

My personal recommendation is go with the safe approach (even if it seems slower at first as it may require locks). My rationale is:

I have spent countless hours tracking down strange, subtle and hard to reproduce bugs related to memory corruption when working on C/C++ systems. My Rust experience has largely been free of such pain and I think the avoidance of that pain a key advantage of Rust and makes some of the painful parts of rust (like the borrow checker) worthwhile. While the unsafe during initialization is fine now, if the code changes over time there may be a real chance of race conditions or other situations not currently anticipated

I also think the overhead of taking a mutex is likely to be low, and you could make it lower still by using using a RwLock instead.

Even better would be if you can refactor the code so that it is clear when the initialization has occurred so you need neither mutex nor unsafe, perhaps as @thinkharderdev is suggesting

I like to think in most matters I am pretty pragmatic, but the use of unsafe is something I do feel quite strongly about.

Thanks @thinkharderdev and @alamb. I'll refine the code according to the @thinkharderdev's suggestion.

alamb

Thanks @yahoNanJing -- I skimmed this PR and the basic idea of breaking the system down into smaller modules I think is a great direction.

One thing I wanted to mention that might help with review / merge speed in the future is to break such refactoring PRs into two parts:

A PR that just moves code around / breaks up files but doesn't change behavior
The PR that changes behavior

The reason is that often moving files around causes large diffs in github but if there is no change in behavior can be straightforward to review / merge

The changes in behavior may take some more thought / comment and can be harder to understand in a larger diff.

Also, a larger PR has a greater chance to accumulate conflicts so if behavior changes get delayed in review, that can also cause issues

alamb · 2022-03-06T10:59:35Z

ballista/rust/core/src/event_loop.rs

+        tokio::spawn(async move {
+            info!("Starting the event loop {}", name);
+            while !stopped.load(Ordering::SeqCst) {
+                let event = rx_event.recv().await.unwrap();


In IOx and elsewhere instead of panic when the other end of a channel shuts down, we do something like

info!("Event Channel closed, shutting down"); return

so it is clear from the logs what is happening and gives the process / task a chance at a clean shutdown

alamb · 2022-03-06T11:03:21Z

ballista/rust/scheduler/src/scheduler_server/event_loop.rs

+    EventAction<SchedulerServerEvent> for SchedulerServerEventAction<T, U>
+{
+    // TODO
+    fn on_start(&self) {}


I am not sure what the plan is here, but sometimes the todo!() macro gets used in situations so the TODO isn't silently forgotten. However, since todo!() panic's this may not be possible if the functions are called

Suggested change

fn on_start(&self) {}

fn on_start(&self) {

todo!();

}

Here, the reason not to use todo!() is to avoid panic. The reason to add TODO comments is for future error handling or other things to make the whole system more robust. If it makes confusing, we can just remove them currently.

alamb · 2022-03-06T11:12:14Z

ballista/rust/scheduler/src/scheduler_server/mod.rs

+        {
+            if let Some(event_loop) = self.event_loop.as_mut() {
+                // It's OK here, since we are sure the mutable reference only be used in the initialization
+                unsafe {


The philosophy in Rust in general is to avoid unsafe as much as possible. Typically it unsafe is used for one of two reasons:

Interfacing with other languages (like C/C++) that don't have the same lifetime system as Rust

Performance where the overhead of doing things like array bounds checks is too great

Neither of those situations seems to apply here.

I would personally suggest figuring out some way to avoid this unsafe block - @thinkharderdev has one suggestion;

I don't know this code well enough to offer anything specific, but I would say another pattern for something like this is "Interior Mutability" which is a fancy rust way of saying use something like Mutex. Perhaps

event_loop: Option<Arc<Mutex<EventLoop<SchedulerServerEvent>>>>,

liukun4515 · 2022-03-06T12:14:56Z

Thanks @yahoNanJing -- I skimmed this PR and the basic idea of breaking the system down into smaller modules I think is a great direction.

One thing I wanted to mention that might help with review / merge speed in the future is to break such refactoring PRs into two parts:

A PR that just moves code around / breaks up files but doesn't change behavior

The PR that changes behavior

The reason is that often moving files around causes large diffs in github but if there is no change in behavior can be straightforward to review / merge

The changes in behavior may take some more thought / comment and can be harder to understand in a larger diff.

Also, a larger PR has a greater chance to accumulate conflicts so if behavior changes get delayed in review, that can also cause issues

I think we can review this pull request after rebase.
@alamb

yahoNanJing · 2022-03-07T04:17:43Z

Thanks @yahoNanJing -- I skimmed this PR and the basic idea of breaking the system down into smaller modules I think is a great direction.

One thing I wanted to mention that might help with review / merge speed in the future is to break such refactoring PRs into two parts:

A PR that just moves code around / breaks up files but doesn't change behavior

The PR that changes behavior

The reason is that often moving files around causes large diffs in github but if there is no change in behavior can be straightforward to review / merge

The changes in behavior may take some more thought / comment and can be harder to understand in a larger diff.

Also, a larger PR has a greater chance to accumulate conflicts so if behavior changes get delayed in review, that can also cause issues

Thanks @alamb. The background of doing this refactoring is mainly for achieving the final goal of #1704. Without these refactoring, all of the complex things will messed up in one or two files, which is really hard for maintenance.

For the step by step refactoring, every PR will be tested and verified without changing the behavior.

yahoNanJing · 2022-03-07T04:55:50Z

Thanks @yahoNanJing -- I skimmed this PR and the basic idea of breaking the system down into smaller modules I think is a great direction.

One thing I wanted to mention that might help with review / merge speed in the future is to break such refactoring PRs into two parts:

A PR that just moves code around / breaks up files but doesn't change behavior

The PR that changes behavior

The reason is that often moving files around causes large diffs in github but if there is no change in behavior can be straightforward to review / merge

The changes in behavior may take some more thought / comment and can be harder to understand in a larger diff.

Also, a larger PR has a greater chance to accumulate conflicts so if behavior changes get delayed in review, that can also cause issues

Thanks @alamb. The background of doing this refactoring is mainly for achieving the final goal of #1704. Without these refactoring, all of the complex things will messed up in one or two files, which is really hard for maintenance.

For the step by step refactoring, every PR will be tested and verified without changing the behavior.

alamb · 2022-03-07T12:00:35Z

ballista/rust/core/src/event_loop.rs

-                    }
-                    _ => {}
+                } else {
+                    info!("Event Channel closed, shutting down");


alamb · 2022-03-07T21:39:43Z

I am happy with this PR now other than the unsafe -- if @thinkharderdev or @liukun4515 is ok with the unsafe approach, I'll be happy to merge this PR. Otherwise I really suggest going with a RwLock

thinkharderdev · 2022-03-08T00:41:41Z

I am happy with this PR now other than the unsafe -- if @thinkharderdev or @liukun4515 is ok with the unsafe approach, I'll be happy to merge this PR. Otherwise I really suggest going with a RwLock

I would also prefer we remove the unsafe block before merging.

yahoNanJing · 2022-03-08T03:40:33Z

Hi @thinkharderdev and @alamb, could you help review the latest commit for fixing unsafe.

alamb

Looks great to me. Thank you @yahoNanJing and @thinkharderdev

I believe @liukun4515 is planning to review this tomorrow so I will hold off on merging until his review is complete

thinkharderdev

Looks great!

liukun4515

LGTM

alamb · 2022-03-09T13:38:23Z

Thanks again @yahoNanJing and to @thinkharderdev and @liukun4515 for the reviews

github-actions bot added the ballista label Mar 3, 2022

thinkharderdev reviewed Mar 4, 2022

View reviewed changes

alamb reviewed Mar 6, 2022

View reviewed changes

Refactor the event channel

90f43e6

Fix for PR comments

2bccbeb

yahoNanJing mentioned this pull request Mar 7, 2022

Introduce Ballista query stage scheduler #1935

Merged

alamb reviewed Mar 7, 2022

View reviewed changes

ballista/rust/core/src/event_loop.rs

}

_ => {}

} else {

info!("Event Channel closed, shutting down");

Copy link

Contributor

alamb Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Remove unsafe

5002624

alamb approved these changes Mar 8, 2022

View reviewed changes

thinkharderdev approved these changes Mar 8, 2022

View reviewed changes

liukun4515 approved these changes Mar 9, 2022

View reviewed changes

alamb merged commit 48d6450 into apache:master Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the event channel #1912

Refactor the event channel #1912

yahoNanJing commented Mar 3, 2022

thinkharderdev Mar 4, 2022

yahoNanJing Mar 4, 2022

alamb Mar 6, 2022

yahoNanJing Mar 7, 2022

thinkharderdev Mar 4, 2022

yahoNanJing Mar 4, 2022

thinkharderdev Mar 5, 2022

alamb Mar 6, 2022

yahoNanJing Mar 7, 2022

thinkharderdev Mar 7, 2022

alamb Mar 7, 2022 •

edited

Loading

yahoNanJing Mar 8, 2022

alamb left a comment

alamb Mar 6, 2022

alamb Mar 6, 2022

yahoNanJing Mar 7, 2022 •

edited

Loading

alamb Mar 6, 2022

liukun4515 commented Mar 6, 2022

yahoNanJing commented Mar 7, 2022

yahoNanJing commented Mar 7, 2022

alamb Mar 7, 2022

alamb commented Mar 7, 2022

thinkharderdev commented Mar 8, 2022

yahoNanJing commented Mar 8, 2022

alamb left a comment

thinkharderdev left a comment

liukun4515 left a comment

alamb commented Mar 9, 2022 •

edited

Loading

Refactor the event channel #1912

Refactor the event channel #1912

Conversation

yahoNanJing commented Mar 3, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Mar 6, 2022

yahoNanJing commented Mar 7, 2022

yahoNanJing commented Mar 7, 2022

Choose a reason for hiding this comment

alamb commented Mar 7, 2022

thinkharderdev commented Mar 8, 2022

yahoNanJing commented Mar 8, 2022

alamb left a comment

Choose a reason for hiding this comment

thinkharderdev left a comment

Choose a reason for hiding this comment

liukun4515 left a comment

Choose a reason for hiding this comment

alamb commented Mar 9, 2022 • edited Loading

alamb Mar 7, 2022 •

edited

Loading

yahoNanJing Mar 7, 2022 •

edited

Loading

alamb commented Mar 9, 2022 •

edited

Loading