-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize txg_kick() process #12274
Optimize txg_kick() process #12274
Conversation
38a1f1b
to
a7e8bb7
Compare
CentOS Stream 8, zfs_get_009_pos failed at 20:16:17.54 cannot mount 'testpool/depth_fs/fs_2_depth1/fs_3_depth2/fs_2_depth3': Cannot allocate memory filesystem successfully created, but not mounted |
We see that on occasion, I've resubmitted that build. |
Appreciated it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the quiescing changes, this looks good.
module/zfs/txg.c
Outdated
if (txg_is_quiescing(dp)) { | ||
txg_thread_wait(tx, &cpr, | ||
&tx->tx_quiesce_done_cv, 0); | ||
continue; | ||
} | ||
txg = tx->tx_open_txg; | ||
tx->tx_quiescing_txg = txg; | ||
|
||
mutex_exit(&tx->tx_sync_lock); | ||
txg_quiesce(dp, txg); | ||
mutex_enter(&tx->tx_sync_lock); | ||
|
||
dprintf("quiesce done, txg %llu\n", (u_longlong_t)txg); | ||
tx->tx_quiescing_txg = 0; | ||
tx->tx_quiesced_txg = txg; | ||
DTRACE_PROBE2(txg__quiesced, dsl_pool_t *, dp, | ||
uint64_t, txg); | ||
cv_broadcast(&tx->tx_quiesce_done_cv); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code was surprisingly tricky to get right. The proposed change here makes the code more complicated (we could either do the quiesce from the quiesce thread or from the sync thread). Can you share some performance results to motivate the added complexity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quiescing process is so fast that it even costs less than context switch between threads. My intention is it may not be worth to quiesce txgs on the separated thread.
Is the performance difference noticeable. Sadly No. Quiescing/Syncing only happens once every a couple of seconds. Saving micro-seconds out of seconds is not measurable.
This portion of code may be left as it is on current "openzfs/zfs".
Or do you suggest to remove the txg_quiesce_thread all together? Maybe for the future Pull Request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest removing these changes from this PR, and perhaps look at removing the txg_quiesce_thread alltogether in a future PR. Since the performance implications are not important, I would make those changes with an eye to improving code clarity / design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just re-pushed the commit.
while (!tx->tx_exiting && !txg_has_quiesced_to_sync(dp)) {
+ if (txg_is_quiescing(dp)) {
+ txg_thread_wait(tx, &cpr,
+ &tx->tx_quiesce_done_cv, 0);
+ continue;
+ }
if (tx->tx_quiesce_txg_waiting < tx->tx_open_txg+1)
tx->tx_quiesce_txg_waiting = tx->tx_open_txg+1;
cv_broadcast(&tx->tx_quiesce_more_cv);
txg_thread_wait(tx, &cpr, &tx->tx_quiesce_done_cv, 0);
}
The reason I want to keep that portion in is that I do find some very rare race condition that tx_open_txg changed (by quiesce thread) between "if (tx->tx_quiesce_txg_waiting < tx->tx_open_txg+1)" and "tx->tx_quiesce_txg_waiting = tx->tx_open_txg+1;".
Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in original code. Extra parameter "txg" is added for txg_kick(), thus it knows which txg to kick. Also txg_kick() call is moved from dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can know the txg number assigned for txg_kick(). Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up. Signed-off-by: jxdking <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous inline comment.
Looking now on txg_quiesce() all I see it doing is opening new transaction, for which it takes all per-CPU tc_open_lock's, and after dropping them waits for all open transactions to be released to sync. The first part we need to call quickly without delaying too much to not extend currently open txg indefinitely, but it seems to be quite fast. The only question in what context can we do it to not cause lock ordering violations? Can't it be called from txg_kick() itself? Since txg_kick() already takes tx_sync_lock, I suspect the ordering should be fine. The second part of txg_quiesce() is just a waiting, and it could be done in sync thread, since it does need it to be completed and can't proceed otherwise. @ahrens @behlendorf Am I missing something? |
@jxdking Looking now I see the !txg_is_syncing(dp) in txg_kick(), so scratch that part, I really don't understand why do we have separate quiescing thread now. May be it had sense before, but it got lost over the years. Not your fault. Lets return to the previous version of your patch for now. I'm sorry for the noise. |
@amotin , not a problem. |
@jxdking But we still need to decide what to do with the quiescing state and the thread. It there is a way how it can take time -- pipelining it properly vs syncing would be good, the question is for how much? If there is really none, then we just don't need the quiescing thread and may be the state. One downside I see from your patch is that it calls txg_kick() while db_mtx is held in dbuf_dirty(). It should not be a problem now, as far as I can see, but may be a complication if we try to do anything more in the txg_kick() context rather then cv_broadcast(), |
Yes, also the extra txg parameter of txg_kick() comes from tx->tx_txg in dsl_pool_dirty_space(), which means current dmu_tx is not released to sync yet. We shall not put quiesce routine in this txg_kick(). |
I didn't mean all of it. As I told above, it could be split into two pieces -- quiescing could possibly be started in one place, but waited for completion in another (sync thread). |
I guess in zfs_write(), right before dmu_tx_commit(tx), "zfs_log_write()" may be the one you are looking for. |
@amotin , perhaps this information may interest you. |
Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in original code. Extra parameter "txg" is added for txg_kick(), thus it knows which txg to kick. Also txg_kick() call is moved from dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can know the txg number assigned for txg_kick(). Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: jxdking <[email protected]> Closes openzfs#12274
Motivation and Context
Simplify txg_kick() logic and code clean up. Make related code more readable.
Potentially improve performance.
Description
Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in
original code. Extra parameter "txg" is added for txg_kick(), thus it
knows which txg to kick. Also txg_kick() call is moved from
dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can
know the txg number assigned for txg_kick().
Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up.
How Has This Been Tested?
run fio, and observe txg_kick.
Types of changes
Checklist:
Signed-off-by
.