Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

making sure the last quiesced txg is synced #8239

Closed
wants to merge 6 commits into from

Conversation

xwcal
Copy link

@xwcal xwcal commented Jan 5, 2019

Motivation and Context

See #8233

Consider this scenario (see txg.c ):
There is heavy write load when the pool exports.
After txg_sync_stop's call of txg_wait_synced returns, many more txgs get processed, but right before txg_sync_stop gets tx_sync_lock, the following happens:

  • txg_sync_thread begins waiting on tx_sync_more_cv.
  • txg_quiesce_thread gets done with txg_quiesce(dp, txg).
  • txg_sync_stop gets tx_sync_lock first, calls cv_broadcasts with tx_exiting == 1, and waits for exits.
  • txg_sync_thread wakes up first and exits.
  • Finally, txg_quiesce_thread gets tx_sync_lock, and calls cv_broadcast(&tx->tx_sync_more_cv),
    but txg_sync_thread is already gone, and the txg in txg_quiesce(dp, txg) above never gets synced.

Description

txg_sync_thread now waits for txg_quiesce_thread to exit and maybe run one more sync before exiting.

How Has This Been Tested?

Did not test.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

Fixed a potential bug as described in openzfs#8233:

Consider this scenario (see [txg.c](https://github.com/zfsonlinux/zfs/blob/06f3fc2a4b097545259935d54634c5c6f49ed20f/module/zfs/txg.c) ):
There is heavy write load when the pool exports.
After `txg_sync_stop`'s call of `txg_wait_synced` returns, many more txgs get processed, but right before` txg_sync_stop` gets `tx_sync_lock`, the following happens:

- `txg_sync_thread` begins waiting on `tx_sync_more_cv`.
- `txg_quiesce_thread` gets done with `txg_quiesce(dp, txg)`.
- `txg_sync_stop` gets `tx_sync_lock` first, calls `cv_broadcast`s with `tx_exiting` == 1, and waits for exits.
- `txg_sync_thread` wakes up first and exits.
- Finally, `txg_quiesce_thread` gets `tx_sync_lock`, and calls `cv_broadcast(&tx->tx_sync_more_cv)`, 
but `txg_sync_thread` is already gone, and the txg in `txg_quiesce(dp, txg)` above never gets synced.

Signed-off-by: Leap Second <[email protected]>
@codecov
Copy link

codecov bot commented Jan 5, 2019

Codecov Report

Merging #8239 into master will decrease coverage by 0.11%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8239      +/-   ##
==========================================
- Coverage   78.57%   78.45%   -0.12%     
==========================================
  Files         379      379              
  Lines      114924   114927       +3     
==========================================
- Hits        90299    90166     -133     
- Misses      24625    24761     +136
Flag Coverage Δ
#kernel 78.98% <100%> (ø) ⬆️
#user 67.2% <100%> (-0.19%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b8e441...025861a. Read the comment docs.

xwcal added 3 commits January 5, 2019 04:10
Fixed checkstyle complaints:
./module/zfs/txg.c: 558: line > 80 characters
./module/zfs/txg.c: 562: line > 80 characters

Signed-off-by: Leap Second <[email protected]>
Addressed checkstype complaints:
./module/zfs/txg.c: 559: continuation should be indented 4 spaces
./module/zfs/txg.c: 564: continuation should be indented 4 spaces

Signed-off-by: Leap Second <[email protected]>
Addressed checkstyle complaints:
./module/zfs/txg.c: 559: spaces instead of tabs
./module/zfs/txg.c: 559: continuation should be indented 4 spaces
./module/zfs/txg.c: 564: spaces instead of tabs
./module/zfs/txg.c: 564: continuation should be indented 4 spaces

Signed-off-by: Leap Second <[email protected]>
@xwcal
Copy link
Author

xwcal commented Jan 5, 2019

Kernel.org Built-in x86_64 (BUILD) keeps failing with the following error in the make log:

In file included from ./include/zfs/spl/sys/condvar.h:33:0,
from ./include/zfs/sys/zfs_context.h:38,
from ./include/zfs/sys/crypto/common.h:39,
from fs/zfs/icp/illumos-crypto.c:35:
./include/zfs/spl/sys/time.h: In function 'gethrestime':
./include/zfs/spl/sys/time.h:76:8: error: implicit declaration of function 'current_kernel_time64'; did you mean 'core_kernel_text'? [-Werror=implicit-function-declaration]
*ts = current_kernel_time64();
^~~~~~~~~~~~~~~~~~~~~
core_kernel_text
./include/zfs/spl/sys/time.h:76:6: error: incompatible types when assigning to type 'inode_timespec_t {aka struct timespec64}' from type 'int'
*ts = current_kernel_time64();
^
./include/zfs/spl/sys/time.h: In function 'gethrestime_sec':
./include/zfs/spl/sys/time.h:86:24: error: invalid initializer
inode_timespec_t ts = current_kernel_time64();
^~~~~~~~~~~~~~~~~~~~~
CC drivers/acpi/acpica/utresdecode.o
cc1: some warnings being treated as errors
make[3]: *** [fs/zfs/icp/illumos-crypto.o] Error 1
make[2]: *** [fs/zfs/icp] Error 2
make[1]: *** [fs/zfs] Error 2
make: *** [fs] Error 2

Any thoughts why?

EDIT:
Nvm, existing issue:
http://build.zfsonlinux.org/builders/Kernel.org%20Built-in%20x86_64%20%28BUILD%29/builds/24439
8501452
maybe relevant:
https://lkml.org/lkml/2018/12/17/110
https://lkml.org/lkml/2018/12/7/405
and
https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#c.current_kernel_time64

These are replaced by ktime_get_coarse_real_ts64() and ktime_get_coarse_ts64(). However, A lot of code that wants coarse-grained times can use the simple ‘jiffies’ instead, while some drivers may actually want the higher resolution accessors these days.

@xwcal
Copy link
Author

xwcal commented Jan 6, 2019

Since this happened to my fork instead of the master, I am not sure about starting a new issue.
But since this could come up again I thought I might as well document it:

I was making trivial edits to please checkstyle (hope I didn't rival others on a early saturday morning ;) ). After three successful runs on Ubuntu 18.04 x86_64 (TEST), the fourth one failed with

command timed out: 4200 seconds without output running ['runurl', 'https://raw.githubusercontent.com/zfsonlinux/zfs-buildbot/master/scripts/bb-test-zfstests.sh'], attempting to kill

success
success
success
fail

Here are some snippets from relevant logs:

Last few lines from 7.2.tests of the failed run:

Test: /usr/share/zfs/zfs-tests/tests/functional/snapshot/snapshot_005_pos (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/snapshot/snapshot_006_pos (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/snapshot/snapshot_007_pos (run as root) [00:01] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/snapshot/snapshot_008_pos (run as root) [00:00] [PASS]
Terminated

The last fews lines from 7.3.log:

Test: /usr/share/zfs/zfs-tests/tests/functional/snapshot/snapshot_008_pos (run as root) [00:00] [PASS]
16:45:05.51 ASSERTION: Verify that destroying snapshots returns space to the pool.
16:45:05.52 NOTE: Populate the /mnt/testdir directory
16:45:05.52 SUCCESS: file_write -o create -f /mnt/testdir/file1 -b 8192 -c 20 -d 1
16:45:05.56 SUCCESS: zfs snapshot testpool/[email protected]
16:45:05.57 SUCCESS: file_write -o create -f /mnt/testdir/file2 -b 8192 -c 20 -d 2
16:45:05.60 SUCCESS: zfs snapshot testpool/[email protected]
16:45:05.61 SUCCESS: file_write -o create -f /mnt/testdir/file3 -b 8192 -c 20 -d 3
16:45:05.64 SUCCESS: zfs snapshot testpool/[email protected]
16:45:05.64 SUCCESS: file_write -o create -f /mnt/testdir/file4 -b 8192 -c 20 -d 4
16:45:05.67 SUCCESS: zfs snapshot testpool/[email protected]
16:45:05.67 SUCCESS: file_write -o create -f /mnt/testdir/file5 -b 8192 -c 20 -d 5
16:45:05.70 SUCCESS: zfs snapshot testpool/[email protected]
16:45:05.70 SUCCESS: file_write -o create -f /mnt/testdir/file6 -b 8192 -c 20 -d 6
16:45:05.73 SUCCESS: zfs snapshot testpool/[email protected]
16:45:05.74 SUCCESS: file_write -o create -f /mnt/testdir/file7 -b 8192 -c 20 -d 7
16:45:05.76 SUCCESS: zfs snapshot testpool/[email protected]
16:45:05.77 SUCCESS: file_write -o create -f /mnt/testdir/file8 -b 8192 -c 20 -d 8
16:45:05.80 SUCCESS: zfs snapshot testpool/[email protected]
16:45:05.80 SUCCESS: file_write -o create -f /mnt/testdir/file9 -b 8192 -c 20 -d 9
16:45:05.83 SUCCESS: zfs snapshot testpool/[email protected]
16:45:05.86 SUCCESS: zfs destroy testpool/[email protected]
16:45:05.89 SUCCESS: zfs destroy testpool/[email protected]
16:45:05.92 SUCCESS: zfs destroy testpool/[email protected]
16:45:05.95 SUCCESS: zfs destroy testpool/[email protected]
16:45:05.98 SUCCESS: zfs destroy testpool/[email protected]
16:45:06.02 SUCCESS: zfs destroy testpool/[email protected]
16:45:06.05 SUCCESS: zfs destroy testpool/[email protected]
16:45:06.08 SUCCESS: zfs destroy testpool/[email protected]
16:45:06.11 SUCCESS: zfs destroy testpool/[email protected]
16:45:06.15 SUCCESS: zpool sync testpool
16:45:06.15 After destroying snapshots, the space is returned to the pool.
16:45:06.15 NOTE: Performing local cleanup via log_onexit (cleanup)

And the relevant part in 7.4.console:

[13714.051945] WARNING: Unable to automount /mnt/testdir/.zfs/snapshot/testsnap.8/testpool/[email protected]: 256
[13714.147827] WARNING: Unable to automount /mnt/testdir/.zfs/snapshot/testsnap.9/testpool/[email protected]: 256
[13714.520884] WARNING: Unable to automount /mnt/testdir1/.zfs/snapshot/testsnap/testpool/testctr/testfs1@testsnap: 256
[13715.044997] WARNING: Unable to automount /mnt/testdir1/.zfs/snapshot/testsnap/testpool/testctr/testfs1@testsnap: 256
[13717.056079] VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg)) failed
[13717.061400] PANIC at spa.c:7869:spa_sync()
[13717.064799] Showing stack for process 10853
[13717.064801] CPU: 1 PID: 10853 Comm: txg_sync Tainted: P OE 4.15.0-1007-aws #7-Ubuntu
[13717.064802] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[13717.064803] Call Trace:
[13717.064812] dump_stack+0x63/0x8b
[13717.064824] spl_dumpstack+0x29/0x30 [spl]
[13717.064830] spl_panic+0xc8/0x110 [spl]
[13717.064835] ? _wake_up_common_lock+0x8e/0xc0
[13717.064928] ? queued_spin_unlock+0xb/0x10 [zfs]
[13717.064997] ? queued_spin_unlock+0xb/0x10 [zfs]
[13717.065064] spa_sync+0x1042/0x10b0 [zfs]
[13717.065136] txg_sync_thread+0x2ef/0x4a0 [zfs]
[13717.065205] ? txg_dispatch_callbacks+0x100/0x100 [zfs]
[13717.065214] thread_generic_wrapper+0x7d/0xc0 [spl]
[13717.065218] kthread+0x121/0x140
[13717.065225] ? IS_ERR+0x10/0x10 [spl]
[13717.065227] ? kthread_create_worker_on_cpu+0x70/0x70
[13717.065231] ? do_syscall_64+0x73/0x130
[13717.065234] ? SyS_exit_group+0x14/0x20
[13717.065237] ret_from_fork+0x35/0x40
[13897.124668] INFO: task kworker/u30:0:32082 blocked for more than 120 seconds.
[13897.129283] Tainted: P OE 4.15.0-1007-aws #7-Ubuntu
[13897.134130] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13897.139437] kworker/u30:0 D 0 32082 2 0x80000000
[13897.139446] Workqueue: events_freezable_power
disk_events_workfn

Looks like zfs destroyin cleanup triggered the panic. But I haven't learned enough of zfs to figure out what is the cause.

@behlendorf can you take a look when you got time?

module/zfs/txg.c Show resolved Hide resolved
module/zfs/txg.c Outdated Show resolved Hide resolved
@behlendorf
Copy link
Contributor

@seekfirstleapsecond thanks for opening the PR and looking in to the failures. Don't worry about the kernel.org failures, they appear to be due to recent changes to an unreleased kernel and will need to be investigated independently. As for that last Ubuntu 18.04 x86_64 (TEST) failure, it's a known issue which is fairly hard to reproduce and we occasionally see when running the ZTS. It needs to be investigated and resolved.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jan 8, 2019
xwcal added 2 commits January 7, 2019 17:37
Signed-off-by: Leap Second <[email protected]>
Signed-off-by: Leap Second <[email protected]>
@behlendorf behlendorf requested a review from ahrens January 8, 2019 17:31
module/zfs/txg.c Show resolved Hide resolved
@c0d3z3r0
Copy link
Contributor

ping

@amotin
Copy link
Member

amotin commented Nov 19, 2024

I fail to see why this is a problem. By the time txg_sync_stop() is called there should be no more pool activity, otherwise it makes no sense to do. If there is still something ongoing, then it does not matter how much we sync, it will never be enough. Also the fact that we quiesced some TXG does not mean we must sync it. It only means that it no longer has open transactions.

@amotin amotin closed this Nov 19, 2024
@xwcal
Copy link
Author

xwcal commented Dec 12, 2024

Concurrency is certainly not everybody's cup of tea. Conversations between those who get it and those who don't tend to get difficult. Whoever feels qualified to talk about concurrency might wish to look at this and this and self-access their understanding of the arguments and concurrency itself. (Remember Redis?)

Regarding this PR, I have nothing to say about its relevance today since it's been a long time since I last seriously looked at ZFS source code. But hopefully those who seriously considered the issue and understood my argument can eventually find it useful in some way.

@amotin
Copy link
Member

amotin commented Dec 12, 2024

Concurrency is certainly not everybody's cup of tea. Conversations between those who get it and those who don't tend to get difficult.

@xwcal Shall I consider this an unasked personal attack? And why are those links about locking here?

those who seriously considered the issue and understood my argument

After txg_sync_stop() calls txg_wait_synced(dp, tx->tx_open_txg + TXG_DEFER_SIZE) there should be no transactions in the quiesced state. And if there are, it is not not a question of syncing everything quiesced, it is a question of some process not being stopped.

@xwcal
Copy link
Author

xwcal commented Dec 14, 2024

@amotin

Shall I consider this an unasked personal attack?

It's certainly not up to me to restrict your freedom to interpret it according to your instincts, in as much as I have the freedom to interpret this sloppy ending of a five year long collective inquiry according to my instincts. Was there an urge to help improve ZFS or a rush to inflate/deflate certain metrics?

why are those links about locking here?

The links are technically very interesting, and also serve as a good example of a conversation on concurrency getting difficult.

there should be no transactions in the quiesced state

How so?

it is a question of some process not being stopped

What ZFS mechanism ensures that all writing processes get stopped BEFORE txg_sync_stop() gets called? If no such mechanism exists (since, obviously, ZFS is not an OS), how is this BEFORE/AFTER ordering enforced? Or, rather, how should it be enforced? By making sure all users think twice before ending things?

Regarding your previous comment:

If there is still something ongoing, then it does not matter how much we sync, it will never be enough.
the fact that we quiesced some TXG does not mean we must sync it

Of course writes don't always succeed. But a successful write, whether sync or not, should always be persisted in the absence of a crash. You seem to disagree?

@amotin
Copy link
Member

amotin commented Dec 14, 2024

Was there an urge to help improve ZFS or a rush to inflate/deflate certain metrics?

The PR gained no sufficient traction to be reviewed and merged in almost 6 years. Attacking me after I spent time to look on it is counterproductive. I am not paid to listen this. But I am open to hear technical arguments.

why are those links about locking here?

The links are technically very interesting, and also serve as a good example of a conversation on concurrency getting difficult.

I looked it trough. I agree that locks with timeouts are requests for troubles. But I still see no relation to this topic. The only place where ZFS might use something similar is multi-mount protection, but we are not talking about it here

there should be no transactions in the quiesced state

How so?

More charades?

it is a question of some process not being stopped

What ZFS mechanism ensures that all writing processes get stopped BEFORE txg_sync_stop() gets called? If no such mechanism exists (since, obviously, ZFS is not an OS), how is this BEFORE/AFTER ordering enforced? Or, rather, how should it be enforced? By making sure all users think twice before ending things?

By unmounting file systems if it is a live export, and really killing processes if system reboots. And withing ZFS itself:

<------>/*                                                                                               
<------> * If we have set the spa_final_txg, we have already performed the                               
<------> * tasks below in spa_export_common(). We should not redo it here since                          
<------> * we delay the final TXGs beyond what spa_final_txg is set at.                                  
<------> */                                                                                              
<------>if (spa->spa_final_txg == UINT64_MAX) {                                                          
<------><------>/*                                                                                       
<------><------> * If the log space map feature is enabled and the pool is                               
<------><------> * getting exported (but not destroyed), we want to spend some                           
<------><------> * time flushing as many metaslabs as we can in an attempt to                            
<------><------> * destroy log space maps and save import time.                                          
<------><------> */                                                                                      
<------><------>if (spa_should_flush_logs_on_unload(spa))                                                
<------><------><------>spa_unload_log_sm_flush_all(spa);                                                
                                                                                                         
<------><------>/*                                                                                       
<------><------> * Stop async tasks.                                                                     
<------><------> */                                                                                      
<------><------>spa_async_suspend(spa);                                                                  
                                                                                                         
<------><------>if (spa->spa_root_vdev) {                                                                
<------><------><------>vdev_t *root_vdev = spa->spa_root_vdev;                                          
<------><------><------>vdev_initialize_stop_all(root_vdev,                                              
<------><------><------>    VDEV_INITIALIZE_ACTIVE);                                                     
<------><------><------>vdev_trim_stop_all(root_vdev, VDEV_TRIM_ACTIVE);                                 
<------><------><------>vdev_autotrim_stop_all(spa);                                                     
<------><------><------>vdev_rebuild_stop_all(spa);                                                      
<------><------><------>l2arc_spa_rebuild_stop(spa);                                                     
<------><------>}                                                                                        
<------>}                                                                                                
                                                                                                         
<------>/*                                                                                               
<------> * Stop syncing.                                                                                 
<------> */                                                                                              
<------>if (spa->spa_sync_on) {                                                                          
<------><------>txg_sync_stop(spa->spa_dsl_pool);                                                        
<------><------>spa->spa_sync_on = B_FALSE;                                                              
<------>}                                                                                                

Regarding your previous comment:

If there is still something ongoing, then it does not matter how much we sync, it will never be enough.
the fact that we quiesced some TXG does not mean we must sync it

Of course writes don't always succeed. But a successful write, whether sync or not, should always be persisted in the absence of a crash. You seem to disagree?

I said nothing about successful or not. You can not export the pool without unmounting, and unmounting means no new writes.

@xwcal
Copy link
Author

xwcal commented Dec 16, 2024

Frankly, whether this PR gets merged or not is no longer my concern. What is concerning is the attitude towards a potential problem of data loss.

Try as you may to unsay what you said:

If there is still something ongoing, then it does not matter how much we sync, it will never be enough.
the fact that we quiesced some TXG does not mean we must sync it

which has the obvious interpretation that it's no big deal to lose some data. Sorry but this is unacceptable to me. If a write succeeds, I want it to be persisted in the absence of a crash. Period.

Your empirical argument might be correct. But the spa_unload source code in your quote says nothing about unmounting. Maybe you'd like to spend more time traversing the call hierarchy/user land logic until you find a definitive proof of the unmount -> export ordering. The moral of the story in the "locking" links should be obvious: when you have more than one thread/process, ordering of events becomes highly nontrivial. Don't assume any ordering unless you can prove it.

And there is something called zvol. I'd like to hear your thoughts on that in the context of the current debate.

Let me also point out that contrary to what you said, this (complex and arguably low priority) PR has in fact been reviewed quite a few times and got an assignee (not you). Some had no difficulty understanding it. None of the others gratuitously closed it because they "failed to see" the problem.

@robn
Copy link
Member

robn commented Dec 16, 2024

If a write succeeds, I want it to be persisted in the absence of a crash. Period.

Easy to say. However, you seem to have defined "succeeded" as "quiesced", which doesn't make sense to me. As far as OpenZFS is concerned, the transition from open->quiesce->sync is an implementation detail.

As a user program, you can only rely on a write being fully committed to disk if you have explicitly taken steps to require it, for example, calling fsync() after a successful write(), or setting O_SYNC when opening the file, or setting sync=always on the dataset. If these are in place, then the relevant calls will not even return to userspace until and unless the data is on disk one way or another (either through a ZIL write, or through the txg_sync completing the transaction that covers the write.

If those "sync write" facilities are what you mean by "succeeded", then you are claiming that OpenZFS is not fulfilling this contract, which is a strong claim and will require evidence that I don't see here. If you meant something softer, like any successful call to write() should require the data is on disk, then your expectations are beyond those provided by any POSIX-like filesystem (and you can have this anyway, with sync=always).

(Since you mentioned zvol, block devices have syncing writes available as well. There is no difference there).

If we take a requirement for crash-persistence off the table, then we can consider if this patch is worth anything. I am ambivalent. If we can do "just one more sync" at export or other shutdown, without causing problems elsewhere, then maybe it's worth it for four lines of patch. But also, I don't care very much, because any software relying on writes to be on disk without the filesystem explicitly promising that, has a bug.

@xwcal
Copy link
Author

xwcal commented Dec 17, 2024

@robn I am not going to respond to the other points since:

  1. I am probably not smart enough to change your opinion that we need to sync all writes so as not to lose any data.

  2. You seem to be speaking on behalf of @amotin, exactly when he got silent on zvol. If there is some group thing going on then I'd better quit early since I certainly cannot handle a group.

But I do want to respond to one point you made:

you seem to have defined "succeeded" as "quiesced", which doesn't make sense to me. As far as OpenZFS is concerned, the transition from open->quiesce->sync is an implementation detail.

Let's look at the source code.

zfs_write (in zfs_vnops.c) calls dmu_tx_commit (in dmu_tx.c) before reporting a successful write. The latter calls txg_rele_to_sync (in txg.c), which tells txg_quiesce in a different thread "I am done" through a tc_lock and tc_cv as explained in the comment above the definition of struct tx_cpu in txg_impl.h:

 * The tx_cpu contains two locks, the tc_lock and tc_open_lock.
 * The tc_lock is used to protect all members of the tx_cpu structure with
 * the exception of the tc_open_lock. This lock should only be held for a
 * short period of time, typically when updating the value of tc_count.
 *
 * The tc_open_lock protects the tx_open_txg member of the tx_state structure.
 * This lock is used to ensure that transactions are only assigned into
 * the current open transaction group. In order to move the current open
 * transaction group to the quiesce phase, the txg_quiesce thread must
 * grab all tc_open_locks, increment the tx_open_txg, and drop the locks.

In other words, "succeeded", assuming zfs_write runs until completion without context switches, does precede "quiesced".

Hope it makes some sense now. Of course, if you see something in the source code that tells a different story, please let me know.

Also, I never claimed ZFS is losing sync writes (which in fact does happen when ZIL needs to be discarded for some reason). To avoid the straw man, let me repeat my statement above:

a successful write, whether sync or not, should always be persisted in the absence of a crash.

@amotin
Copy link
Member

amotin commented Dec 17, 2024

@robn I am not going to respond to the other points since:

1. I am probably not smart enough to change your opinion that we need to sync all writes so as not to lose any data.

2. You seem to be speaking on behalf of @amotin, exactly when he got silent on zvol. If there is some group thing going on then I'd better quit early since I certainly cannot handle a group.

@xwcal You are probably not smart enough to understand that not attacking everybody around you would make people nicer and more helpful in response. Me and Rob are living on opposite sides of the globe, working for different companies and have no relations outside this project. I haven't responded because I thought that Rob said enough, but apparently not.

In other words, "succeeded", assuming zfs_write runs until completion without context switches, does precede "quiesced".

Hope it makes some sense now. Of course, if you see something in the source code that tells a different story, please let me know.

OK. Let me tell you a story. You are coming into grocery after closing hour and start bringing more and more goods to the cashier again and again. For some time cashier will be polite and run more and more transactions. But at some point he will excuse himself, close the cash register and go home. After that store guard will allow you safely park your shopping cart, may even allow you visit a restroom. But he won't sell you anything. He either escort you out before closing the door, or call a police if you won't cooperate.

So unless it is clear yet, cashier here is sync thread, while quiesce thread is barely a guard who makes sure that customers are getting to cashier in order and then leave. The quiesce thread just makes sure that all threads with transactions assigned to a certain transaction group have committed them and won't cause any churn during the sync process. It does not give any guaranties. As I have told above, txg_sync_stop() calls txg_wait_synced(dp, tx->tx_open_txg + TXG_DEFER_SIZE), which forcefully sync 2 transaction groups on top of 3 already in progress (open/quiescing/syncing) and waits for the completions. It is quite a polite cashier, isn't it? After that both quiesce and sync threads are going out, and the rest is a duty of police.

And if you are still curios about ZVOLs, then same as file system must be unmounted, ZVOL devices must be closed, or ZFS won't let you export the pool.

@robn
Copy link
Member

robn commented Dec 17, 2024

I am probably not smart enough to change your opinion that we need to sync all writes so as not to lose any data.

We need to do whatever is necessary to satisfy the API guarantees we make to user applications. If we have not promised that the data is definitely on disk, then the user application has no right to expect that.

In those terms, what is your claim?

Look at it another way. You've come in aggressively asserting that there is a data loss scenario here. I take that claim seriously. When asked, you do not present anything to support that claim.

Your original patch does nothing to rectify any meaningful data lost scenario, because none of these mechanisms have anything to do with userspace durability expectations.

Now, if you're not trying to claim a data loss scenario (which you called a strawman above), then you will need to clarify this:

a successful write, whether sync or not, should always be persisted in the absence of a crash.

A successful write, by definition, is one where the application has received an explicit guarantee that the data is on disk and will be available after a crash. There is no other way.

I'm totally prepared to believe that there's some curious concurrency issue in the tx locking, and we can certainly talk about that. First I need to you to either show that either this issue does lead into an unexpected data loss on the part of a user application, or I need you to take "data loss" off the table. It's a big claim, that we take seriously, and so far you haven't shown that there's an issue, and when challenged, you get combative. I am not interested in that kind of discussion.

I will not be commenting here again until and unless more information is provided.

@xwcal
Copy link
Author

xwcal commented Dec 18, 2024

@amotin I am certainly not smart enough for many things. BUT, I am smart enough to tell when something is too coincidental to warrant suspicion that it is not a mere coincidence. I am smart enough to question why someone clearly uninterested in a technical issue wastes their time hovering over it if they weren't motivated by something non-technical. And I am smart enough not to jump onto a PR assigned to someone else and close it because I can't tell what it's for and I am too lazy to check.

Yeah it's been years but I still remember how txg works. The poor grocery store analogy wouldn't help anyone even if it were less emotionally charged. (When the emotion is over, you might gain a better understanding if you read txg.c again.)

Speaking of "smartness" I feel I am getting dumber as this conversation goes on. I guess I am either getting too much help or too much gaslighting. Either way, I am going to stop after this final comment.

As before, you clearly don't intend to provide any proof to back up your empirical argument:

then same as file system must be unmounted, ZVOL devices must be closed, or ZFS won't let you export the pool.

No need to look at 100 functions scattered in 10 files and called from different threads in and outside the kernel space. If you believe something is true it must be true right? (I'd like to believe ZFS has zero bugs too.) BUT if anyone cares, so that an innocent bystander trying to decide who to believe on the advanced things doesn't get more confused about the basic things (who might already be thinking they need to figure out a way to unmount a zvol before exporting their pool), explain the discrepancy between the snippet from man write:

RETURN VALUE

       On success, the number of bytes written is returned.  On error,
       -1 is returned, and errno is set to indicate the error.

       Note that a successful write() may transfer fewer than count
       bytes.  [...]

and the statement by @robn:

A successful write, by definition, is one where the application has received an explicit guarantee that the data is on disk and will be available after a crash.

@amotin
Copy link
Member

amotin commented Dec 18, 2024

The poor grocery store analogy wouldn't help anyone even if it were less emotionally charged.

My analogy might be poor, but you are still ignoring what you've being told again and again.

As before, you clearly don't intend to provide any proof to back up your empirical argument:

then same as file system must be unmounted, ZVOL devices must be closed, or ZFS won't let you export the pool.

What proof do you want? For what empirical argument? Just open any file on a pool and you won't be able to export it on Linux, or would need to force it on FreeBSD, which allows forceful unmount:

root@truenas[~]# zpool import -R /mnt zzz
root@truenas[~]# cd /mnt/zzz
root@truenas[/mnt/zzz]# zpool export zzz        
cannot unmount '/mnt/zzz': pool or dataset is busy
root@truenas[/mnt/zzz]# zpool export -f zzz 
cannot unmount '/mnt/zzz': pool or dataset is busy
root@truenas[/mnt/zzz]# cd /                    
root@truenas[/]# zpool export zzz    
root@truenas[/]# 

Just open any zvol device and pool export will hang until the device is closed (I consider it actually a bug, especially on FreeBSD, which is able to properly destroy open device, but that is a different topic):

root@truenas[/]# zpool import -R /mnt zzz
root@truenas[/]# cat >/dev/zvol/zzz/vol &
[1] 8507
root@truenas[/]# 
[1]  + suspended (tty input)  cat > /dev/zvol/zzz/vol
root@truenas[/]# zpool export zzz        
[1]  + killed     cat > /dev/zvol/zzz/vol
root@truenas[/]# 

Brian Behlendorf, whom you are quoting, actually told you: "The way txg_sync_stop() is called today when exporting the quiesced txg which might be lost will always be empty." Rob Norris just told you: "Your original patch does nothing to rectify any meaningful data lost scenario". I am repeating to you again and again that by the time this code is called there should be no (valid) pool activity. So please, show any real evidences of some problem or forever hold your peace. I am out of here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants