Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize txg_kick() process #12274

Merged
merged 1 commit into from
Jul 1, 2021
Merged

Optimize txg_kick() process #12274

merged 1 commit into from
Jul 1, 2021

Conversation

jxdking
Copy link
Contributor

@jxdking jxdking commented Jun 23, 2021

Motivation and Context

Simplify txg_kick() logic and code clean up. Make related code more readable.
Potentially improve performance.

Description

Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in
original code. Extra parameter "txg" is added for txg_kick(), thus it
knows which txg to kick. Also txg_kick() call is moved from
dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can
know the txg number assigned for txg_kick().

Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up.

How Has This Been Tested?

run fio, and observe txg_kick.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@jxdking jxdking changed the title Simplify txg_kick() process. Optimize txg_kick() process. Jun 24, 2021
@jxdking jxdking changed the title Optimize txg_kick() process. Optimize txg_kick() process Jun 24, 2021
@jxdking jxdking marked this pull request as ready for review June 24, 2021 02:48
@jxdking jxdking force-pushed the txg-kick-opt branch 3 times, most recently from 38a1f1b to a7e8bb7 Compare June 25, 2021 19:29
@jxdking
Copy link
Contributor Author

jxdking commented Jun 26, 2021

CentOS Stream 8, zfs_get_009_pos failed at

20:16:17.54 cannot mount 'testpool/depth_fs/fs_2_depth1/fs_3_depth2/fs_2_depth3': Cannot allocate memory filesystem successfully created, but not mounted
20:16:17.54 ERROR: zfs create testpool/depth_fs/fs_2_depth1/fs_3_depth2/fs_2_depth3 exited 1

@behlendorf
Copy link
Contributor

We see that on occasion, I've resubmitted that build.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jun 26, 2021
@jxdking
Copy link
Contributor Author

jxdking commented Jun 26, 2021

We see that on occasion, I've resubmitted that build.

Appreciated it.

Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the quiescing changes, this looks good.

module/zfs/txg.c Outdated
Comment on lines 558 to 575
if (txg_is_quiescing(dp)) {
txg_thread_wait(tx, &cpr,
&tx->tx_quiesce_done_cv, 0);
continue;
}
txg = tx->tx_open_txg;
tx->tx_quiescing_txg = txg;

mutex_exit(&tx->tx_sync_lock);
txg_quiesce(dp, txg);
mutex_enter(&tx->tx_sync_lock);

dprintf("quiesce done, txg %llu\n", (u_longlong_t)txg);
tx->tx_quiescing_txg = 0;
tx->tx_quiesced_txg = txg;
DTRACE_PROBE2(txg__quiesced, dsl_pool_t *, dp,
uint64_t, txg);
cv_broadcast(&tx->tx_quiesce_done_cv);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code was surprisingly tricky to get right. The proposed change here makes the code more complicated (we could either do the quiesce from the quiesce thread or from the sync thread). Can you share some performance results to motivate the added complexity?

Copy link
Contributor Author

@jxdking jxdking Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quiescing process is so fast that it even costs less than context switch between threads. My intention is it may not be worth to quiesce txgs on the separated thread.
Is the performance difference noticeable. Sadly No. Quiescing/Syncing only happens once every a couple of seconds. Saving micro-seconds out of seconds is not measurable.

This portion of code may be left as it is on current "openzfs/zfs".
Or do you suggest to remove the txg_quiesce_thread all together? Maybe for the future Pull Request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest removing these changes from this PR, and perhaps look at removing the txg_quiesce_thread alltogether in a future PR. Since the performance implications are not important, I would make those changes with an eye to improving code clarity / design.

Copy link
Contributor Author

@jxdking jxdking Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just re-pushed the commit.

                while (!tx->tx_exiting && !txg_has_quiesced_to_sync(dp)) {
+                       if (txg_is_quiescing(dp)) {
+                               txg_thread_wait(tx, &cpr,
+                                   &tx->tx_quiesce_done_cv, 0);
+                               continue;
+                       }
                        if (tx->tx_quiesce_txg_waiting < tx->tx_open_txg+1)
                                tx->tx_quiesce_txg_waiting = tx->tx_open_txg+1;
                        cv_broadcast(&tx->tx_quiesce_more_cv);
                        txg_thread_wait(tx, &cpr, &tx->tx_quiesce_done_cv, 0);
                }

The reason I want to keep that portion in is that I do find some very rare race condition that tx_open_txg changed (by quiesce thread) between "if (tx->tx_quiesce_txg_waiting < tx->tx_open_txg+1)" and "tx->tx_quiesce_txg_waiting = tx->tx_open_txg+1;".

Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in
original code. Extra parameter "txg" is added for txg_kick(), thus it
knows which txg to kick. Also txg_kick() call is moved from
dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can
know the txg number assigned for txg_kick().

Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is
also cleaned up.

Signed-off-by: jxdking <[email protected]>
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous inline comment.

@amotin
Copy link
Member

amotin commented Jun 29, 2021

Looking now on txg_quiesce() all I see it doing is opening new transaction, for which it takes all per-CPU tc_open_lock's, and after dropping them waits for all open transactions to be released to sync. The first part we need to call quickly without delaying too much to not extend currently open txg indefinitely, but it seems to be quite fast. The only question in what context can we do it to not cause lock ordering violations? Can't it be called from txg_kick() itself? Since txg_kick() already takes tx_sync_lock, I suspect the ordering should be fine. The second part of txg_quiesce() is just a waiting, and it could be done in sync thread, since it does need it to be completed and can't proceed otherwise.

@ahrens @behlendorf Am I missing something?

@amotin
Copy link
Member

amotin commented Jun 30, 2021

@jxdking Looking now I see the !txg_is_syncing(dp) in txg_kick(), so scratch that part, I really don't understand why do we have separate quiescing thread now. May be it had sense before, but it got lost over the years. Not your fault. Lets return to the previous version of your patch for now. I'm sorry for the noise.

@jxdking
Copy link
Contributor Author

jxdking commented Jun 30, 2021

@amotin , not a problem.
I just rolled back the commit.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 30, 2021
@amotin
Copy link
Member

amotin commented Jun 30, 2021

@jxdking But we still need to decide what to do with the quiescing state and the thread. It there is a way how it can take time -- pipelining it properly vs syncing would be good, the question is for how much? If there is really none, then we just don't need the quiescing thread and may be the state.

One downside I see from your patch is that it calls txg_kick() while db_mtx is held in dbuf_dirty(). It should not be a problem now, as far as I can see, but may be a complication if we try to do anything more in the txg_kick() context rather then cv_broadcast(),

@jxdking
Copy link
Contributor Author

jxdking commented Jun 30, 2021

One downside I see from your patch is that it calls txg_kick() while db_mtx is held in dbuf_dirty(). It should not be a problem now, as far as I can see, but may be a complication if we try to do anything more in the txg_kick() context rather then cv_broadcast(),

Yes, also the extra txg parameter of txg_kick() comes from tx->tx_txg in dsl_pool_dirty_space(), which means current dmu_tx is not released to sync yet. We shall not put quiesce routine in this txg_kick().

@amotin
Copy link
Member

amotin commented Jun 30, 2021

We shall not put quiesce routine in this txg_kick().

I didn't mean all of it. As I told above, it could be split into two pieces -- quiescing could possibly be started in one place, but waited for completion in another (sync thread).

@jxdking
Copy link
Contributor Author

jxdking commented Jun 30, 2021

It there is a way how it can take time

I guess in zfs_write(), right before dmu_tx_commit(tx), "zfs_log_write()" may be the one you are looking for.

@mmaybee mmaybee merged commit 50e09ed into openzfs:master Jul 1, 2021
@jxdking
Copy link
Contributor Author

jxdking commented Jul 2, 2021

But we still need to decide what to do with the quiescing state and the thread. It there is a way how it can take time

@amotin , perhaps this information may interest you.
I did come cross a situation that causes long quiescing state (somewhat reliably). I was testing with fsync or sync = on couples weeks ago.
Right now, I just spoted "lwb->lwb_tx" and "dmu_tx_commit(tx);", inside zil_lwb_flush_vdevs_done(). The tx is created in zil_lwb_write_issue() with dmu_tx_assign(tx, TXG_WAIT | TXG_NOTHROTTLE), which is subjected to block the quiesce stage if I am not wrong. Probably, I can put sleep() call there to verify the assumption.

tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 21, 2022
Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in
original code. Extra parameter "txg" is added for txg_kick(), thus it
knows which txg to kick. Also txg_kick() call is moved from
dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can
know the txg number assigned for txg_kick().

Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is
also cleaned up.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: jxdking <[email protected]>
Closes openzfs#12274
@ghost ghost mentioned this pull request Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants