deadlock between mm_sem and tx assign in zfs_write() and page fault #7939

wgqimut · 2018-09-21T08:06:36Z

The bug time sequence:

context Use Barriers in pre-2.6.24 kernels #1, zfs_write assign a txg "n".
In the same process, context Implement zeventd daemon #2, mmap page fault (which means the mm_sem
is hold) occurred, zfs_dirty_inode open a txg failed, and wait previous
txg "n" completed.
context Use Barriers in pre-2.6.24 kernels #1 call uiomove to write, however page fault is occurred in
uiomove, which means it need mm_sem, but mm_sem is hold by
context Implement zeventd daemon #2, so it stuck and can't complete, then txg "n" will not complete.

So context #1 and context #2 trap into the "dead lock".

Signed-off-by: Grady Wong [email protected]

Motivation and Context

#7512

Description

How Has This Been Tested?

Test being done based on zfs-0.7.9:
Regression testing: fstress, ztest
Unit testing program: Limit system memory to 1.5GB and try to simulate race condition of file write and mmap access in one process. One thread keep writing file A's 1 byte of each page, the other thread keep modifying 10 byte of mmaped one page of file B. The method increases the page fault rate inside uiomove() .
The unit test program is attached below.

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <err.h>
#include <errno.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <pthread.h>


#define BUF_SIZE (128 * 1024)
#define MAP_THREAD 1

long get_tick()
{
	struct timespec now;
	if (clock_gettime(CLOCK_MONOTONIC, &now))
		return 0;
	return now.tv_sec + now.tv_nsec;
}


static char *rand_string(char *str, size_t size)
{
	const char charset[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJK...";
	size_t n = 0;

	if (size) {
		srand(get_tick());
		--size;
		for (n = 0; n < size; n++) {
			int key = rand() % (int) (sizeof charset - 1);
			str[n] = charset[key];
		}
		str[size] = '\0';
	}
	return str;
}

void *write_map_file_thread(void *data)
{
	int fd = -1;
	int ret = 0;
	char *buf = NULL;
	int page_size = getpagesize();
	int op_errno = 0;
	char *path = data;
	char map_write_file[4] = {0, };
	char file_path[255] = {0, };

	rand_string(map_write_file, 4);
	sprintf(file_path, "%s/%s", path, map_write_file);
	while (1) {
		ret = access(file_path, F_OK);
		if (ret) {
			op_errno = errno;
			if (op_errno == ENOENT) {
				fd = open(file_path, O_RDWR | O_CREAT, 0777);
				if (fd == -1) {
					err(1, "open file failed");
				}

				ret = ftruncate(fd, page_size);
				if (ret == -1) {
					err(1, "truncate file failed");
				}
			} else {
				err(1, "access file failed!");
			}
		} else {
			fd = open(file_path, O_RDWR, 0777);
			if (fd == -1) {
				err(1, "open file failed");
			}
		}

		buf = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
		if (buf == MAP_FAILED) {
			err(1, "map file failed");
		}
		printf("map file %s\n", file_path);

		if (fd != -1)
			close(fd);

	        char s[10];
		memcpy(buf, s, 10);
		printf("write %s to mapfile %s\n", s, file_path);
		ret = munmap(buf, page_size);
		if (ret != 0) {
			err(1, "unmap file failed");
		}
		printf("unmap file %s\n", file_path);
	}
}

void *write_file_thread(void *data)
{
	char *path = data;
	char file_name[4] = {0, };
	char file_path[255] = {0, };
	int fd = -1;
	ssize_t write_num = 0;
	int page_size = getpagesize();
	off_t offset = 0;

	rand_string(file_name, 4);
	sprintf(file_path, "%s/%s", path, file_name);

	fd = open(file_path, O_RDWR | O_CREAT, 0777);
	if (fd == -1) {
		err(1, "failed to open %s", file_path);
	}

	char *buf = malloc(1);
	while (1) {
		write_num = write(fd, buf, 1);
		if (write_num == 0) {
			err(1, "write failed!");
			break;
		}
		offset = lseek(fd, page_size, SEEK_CUR);
	}
}

int main(int argc, char *argv[])
{
	int ret = 0;
	pthread_t write_file_th;
	pthread_t write_file_th2;
	int sleep_time = 0;
	int i = 0;
	pthread_t map_file_th[MAP_THREAD];
	pthread_t eat_mem_th;
	pthread_t more_pg_th;

	if (argc != 2) {
		err(1, "invalid argument");
	}

	pthread_create(&write_file_th, NULL, write_file_thread, argv[1]);
	pthread_create(&write_file_th2, NULL, write_file_thread, argv[1]);


	for (i = 0; i < MAP_THREAD; i++) {
		pthread_create(&map_file_th[i], NULL, write_map_file_thread, argv[1]);
	}

	pthread_join(write_file_th, NULL);
	pthread_join(write_file_th2, NULL);
}

I didn't find a easy way to reproduce the problem constantly. But I no longer suffer the problem during the tests so far with my fixing. If you guys have any good idea to reproduce the problem, please shed me a light.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

behlendorf · 2018-09-21T21:33:01Z

@wgqimut thanks for doing the leg work on this lock inversion. I agree with your analysis of the issue, but let me suggest an alternate solution. Rather than disabling the page faulting in copy_from_user(), which I believe we want to keep enabled for performance reasons. What if we instead we set TXG_NOWAIT and then bypassed the write throttle in zfs_dirty_inode(). This will ensure we do not block waiting on the next TXG while holding the mmap_sem.

Normally in a system call context we'd want to drop the offending locks and then try everything again. Since that's not an option here, and because a common caller is filemap_page_mkwrite I don't think we'd want to block here even if we could do it safely.

What do you think about this proposed change instead, behlendorf/zfs@f7a900f. It works as expected for me locally with your reproducer (thanks for that!) but I'd really like to know how it fairs in your environment.

javenwu · 2018-09-22T00:58:21Z

@behlendorf Thanks for your quick reply and review. Actually, we considered a similar way as your proposed fixing at the very beginning. Our concern is that bypass throttle checking for zfs_dirty_inode might have a potential risk to make a huge txg with too many dirty pages in highly frequent mmap access scenarios.
For programs that heavily depend on mapped file operations, it's very possible dirty too many pages in a transaction group. That breaks the intention of txg write throttle design basically.

Another side, I don't understand where is obvious performance penalty if we disable page fault during copy_from_user in zfs_write() context. There is prefault before txg_assign for every write iteration inside zfs_write(). The prefault here is to avoid page faults in uiomove(), so most of pages should have been faulted-in before uiomove(). So txg reassign and prefault retry ONLY happens when prefaulted page is evicted.

copy data from user with pagefault disabled is not invented by us, actually I found other file systems do similar way in write() context as well, for example btrfs, ext4, fuse and even generic_perform_write() .
------%<----
pagefault_disable();
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
pagefault_enable();
------%<------
I noticed the quoted code is referenced by several filesystems, that's why I assume disabling page fault during uiomove in zfs_write() is not a big performance issue and that's why we propose this fixing.

We try to not break zfs txg write throttle and fix the issue.
If you have any more concern, please let us know, your advise are always important to us.

wgqimut · 2018-09-22T03:39:50Z

@behlendorf

Rather than disabling the page faulting in copy_from_user(), which I believe we want to keep enabled for performance reasons

uio_prepagefault decreases the page fault rate in copy_from_user actually, even I disable uio_prepagefault when I run my test, it's rare rate to trigger EFAULT, so I am not sure I understand the performance reason you mentioned.
As for bypass the write throttle, I argree with @javenwu.

bypass throttle checking for zfs_dirty_inode might have a potential risk to make a huge txg with too many dirty data in highly frequent mmap access scenarios.

In addition, my test program cannot reproduce the origin hang problem effectively and constantly, sometimes the hang triggers in minutes, sometimes, it cost weeks to reproduce the problem in my test. But so far, my fixing version has been tested for about 2 weeks, so far so good. Do you have any good idea for the testing?

behlendorf · 2018-09-25T00:05:45Z

Our concern is that bypassing throttle checking for zfs_dirty_inode might have a potential risk to make a huge txg with too many dirty pages in highly frequent mmap access scenarios.

That would be my major concern as well, in addition to potentially introducing the possibility for an inode update to be dropped.

Regarding any performance penalty I have to say that was more of a gut feeling on my part. If you have performance data that says otherwise, I'm happy to withdraw that concern.

Given the above I think the best solution here may be to use parts of both PRs. What do you think about something like this:

Use pagefault_disable() around the copy as you've done. Since memmove() is already ZFS specific, and only used in a few places, it would be reasonable to add a new flag to request page faults be disabled rather than introducing a new function.
In zfs_dirty_inode() we should switch to using a variant of TXG_NOTHROTTLE plus an explicit dmu_tx_wait(). This was a change made to the other zfs_* operations (and overlooked here at the time) to ensure that we do not get starved out of multiple TXGs when hitting the write throttle. In this case since we never want to allow a failure, perhaps something like.

        waited = B_FALSE;

top:
        error = dmu_tx_assign(tx, waited ? (TXG_NOTHROTTLE | TXG_WAIT) : TXG_NOWAIT);
        if (error) {
                if (error == ERESTART && waited == B_FALSE) {
                        waited = B_TRUE;
                        dmu_tx_wait(tx);
                        dmu_tx_abort(tx);
                        goto top;
                }
                dmu_tx_abort(tx);
                ZFS_EXIT(zfsvfs);
                return (error);
        }

Rather than introduce a new dmu_tx_reassign() which is only used here. It would be nice to rework the code to use dmu_tx_abort() and dmu_tx_assign(). Again similar to the other operations.

As for making the test failure more likely you might try decreasing the zfs_dirty_* module parameters significantly to force tiny by much more frequent TXGs. The idea being that should make it more likely you'll need to wait for one to complete in zfs_mark_dirty().

wgqimut · 2018-09-25T15:07:35Z

Thanks for the comments. I accept your suggestion and am working on the refine. After the modification and testing, I will send you the test results.

wgqimut · 2018-09-29T06:37:26Z

I have done my modification and testing. and There is something to explain.

1. About `dmu_tx_reassign`

Rather than introduce a new dmu_tx_reassign() which is only used here. It would be nice to rework the code to use dmu_tx_abort() and dmu_tx_assign(). Again similar to the other operations.

I rechecked the code, Successfully assigned tx can't be destroyed by dmu_tx_abort, what we can do is commit this tx, or unassign this tx. In this case dmu_tx_unassign can't be reuse too, cause we have already exit the tc->tc_open_lock if we assign txg successful. So I have to introduce the new function dmu_tx_reassign.

2. About the `__copy_from_user_inatomic` and `copy_from_user`

In other FS(like btrfs/fuse), pagefault_disable is used with __copy_from_user_inatomic, so I change that.

3. About testing

I disable the arc buf in zfs_write so the iozone write can be processed by dmu_write_uio_dbuf. The result proves that performance issue is not exist.

Cofiguration:

Mem: 64G
Arc max: 31G
Zpool status:

  pool: pool1
 state: ONLINE
  scan: none requested
config:

	NAME                                      STATE     READ WRITE CKSUM
	pool1                                     ONLINE       0     0     0
	  raidz2-0                                ONLINE       0     0     0
	    21c57141-dbea-48eb-b32d-320e2a6009fc  ONLINE       0     0     0
	    028a55c7-3722-4988-9823-9599b0750ef5  ONLINE       0     0     0
	    0546a512-27f4-46e7-a4b5-09acc65862c4  ONLINE       0     0     0
	    a991d797-cad8-4873-839e-6180658d4526  ONLINE       0     0     0
	    9a0ca566-449b-4506-9180-fc5f3794ef86  ONLINE       0     0     0
	    45ce1c41-c768-4b35-ab29-66d09d71818b  ONLINE       0     0     0
	  raidz2-1                                ONLINE       0     0     0
	    6593cfb4-113a-4291-a871-95c66cd4cd59  ONLINE       0     0     0
	    3303a051-3065-455c-b201-fe295c84a7b7  ONLINE       0     0     0
	    cbb9b4e6-0244-4be7-8d0b-61f5ce99b8e0  ONLINE       0     0     0
	    91253551-ba8f-4b83-955a-145bb713ab77  ONLINE       0     0     0
	    b04ebacd-4ad2-4647-892e-4cf975e841b2  ONLINE       0     0     0
	    021cb77d-3785-4585-9e75-9f28cf5934da  ONLINE       0     0     0

errors: No known data errors

Test cmd: iozone -s 100G -r 128k -i 0 -f /data/pool1/old.iozone -w

Test result: (unit: MB/s)

	old zfs write	new zfs write	old zfs rewrite	new zfs rewrite
	747	774	731	772
	803	837	779	749
	769	795	765	758
	786	762	741	747
	797	776	726	738
	775	749	714	726
	763	782	694	728
	765	743	719	717
	755	742	717	721
	768	744	742	705
Average	773	770	733	736

behlendorf

performance issue is not an exist.

Excellent, thanks for confirming that.

behlendorf · 2018-10-01T19:58:14Z

module/zcommon/zfs_uio.c

+				if (fault_disable) {
+					pagefault_disable();
+					if (__copy_from_user_inatomic(p,
+					    iov->iov_base+skip, cnt)) {


Switching to __copy_from_user_inatomic will disable the access_ok() checks. Can you point me to where these checks are still being done for the write.

You are right, I should have check access_ok() before calling __copy_from_user_inatomic. I found copy_from_user never been referenced with pagefault disabled by other kernel code. __copy_from_inatomic can be called with page fault disabled.

behlendorf · 2018-10-01T20:49:37Z

module/zfs/zfs_vnops.c

 			tx_bytes = uio->uio_resid;
-			error = dmu_write_uio_dbuf(sa_get_db(zp->z_sa_hdl),


what we can do is commit this tx, or unassign this tx.

It sounds like you investigated committing the empty tx and decided against it. Can you explain why. It looks like it would be relatively straight forward to slightly rework the existing if (tx_bytes == 0) {} case to continue on EFAULT.

I just thought if looks werid and not straight forward. I you think committing a empty txg is a better option, I'm okay with that. I will make the change.

Personally I find committing the empty txg to be clearer. Although, I can see the argument for adding a reassign function to handle some cases like this. Let's get @ahrens thoughts on this.

I'd prefer to commit the empty tx and then go back to around to where the "Start a transaction." comment is. From your previous conversation, it sound like dmu_write_uio_dbuf() will fail infrequently, so performance is not a concern here.

behlendorf · 2018-10-01T21:00:51Z

module/zcommon/zfs_uio.c

@@ -79,8 +81,19 @@ uiomove_iov(void *p, size_t n, enum uio_rw rw, struct uio *uio)
 				if (copy_to_user(iov->iov_base+skip, p, cnt))
 					return (EFAULT);
 			} else {
-				if (copy_from_user(p, iov->iov_base+skip, cnt))
-					return (EFAULT);
+				if (fault_disable) {


Perhaps we could do this with a flag in the uio_t instead to avoid changing all these interfaces.

ahrens · 2018-10-03T17:03:22Z

module/zcommon/zfs_uio.c

+					pagefault_enable();
+				} else {
+					if (copy_from_user(p,
+					    iov->iov_base+skip, cnt))


cstyle: add spaces around the + operator

behlendorf · 2018-10-03T18:31:19Z

module/zfs/dmu_tx.c

@@ -1040,6 +1040,7 @@ dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how)
 	return (0);
 }

+


nit: extra white space.

behlendorf · 2018-10-03T18:41:28Z

module/zfs/zfs_vnops.c

+			if (error == EFAULT) {
+				uio_prefaultpages(MIN(n, max_blksz), uio);
+				dmu_tx_commit(tx);
+				goto top;


I think it would be more correct to continue here and go all the way back to the top of the while loop to re-run the quota checks. It's possible we may now be over quota.

behlendorf

Thanks the updated PR is looking almost ready to integrate. When addressing this round of review feedback, please go ahead and rebase the PR on the latest code in the master branch. That should resolve the kmemleak failure reported by the Coverage builder.

behlendorf · 2018-10-07T20:08:02Z

module/zcommon/zfs_uio.c

-					return (EFAULT);
+				if (uio->uio_fault_disable) {
+					if (!access_ok(VERIFY_READ,
+					    (iov->iov_base+skip), cnt)) {


nit: please add a space before and after the + for the style checker. You can run make checkstyle locally to run the same checks.

module/zcommon/zfs_uio.c

behlendorf · 2018-10-07T20:11:25Z

module/zcommon/zfs_uio.c

+
+					pagefault_disable();
+					if (__copy_from_user_inatomic(p,
+					    (iov->iov_base+skip), cnt)) {


nit: same white space around +.

module/zcommon/zfs_uio.c

behlendorf · 2018-10-07T20:12:04Z

module/zcommon/zfs_uio.c

+					pagefault_enable();
+				} else {
+					if (copy_from_user(p,
+					    (iov->iov_base+skip), cnt))


nit: same white space around +.

module/zfs/zfs_vnops.c

ahrens · 2018-10-12T18:47:08Z

module/zfs/zfs_vnops.c

-	error = dmu_tx_assign(tx, TXG_WAIT);
+	boolean_t waited = B_FALSE;
+	error = dmu_tx_assign(tx,
+	    waited ? (TXG_NOTHROTTLE | TXG_WAIT) : TXG_NOWAIT);


I think it would be more idiomatic, as well as easier to understand, if we do the same thing as other similar code, i.e.:

error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT); if (error) { if (error == ERESTART) { waited = B_TRUE; dmu_tx_wait(tx); dmu_tx_abort(tx); goto top; } dmu_tx_abort(tx); ZFS_EXIT(zfsvfs); return (error); }

If you search for NOTHROTTLE in this file, you'll see 8 other cases that do it this way.

There's an extra complication which isn't obvious when reading this code in isolation which is why it was written a little differently. Despite the fact that this function allows an error to be returned, it's called from zpl_dirty_inode() which is a Linux VFS callback functions (.dirty_inode) which must always succeed.

void (*dirty_inode) (struct inode *, int flags);

This is why the existing code used TXG_WAIT since it is handled slightly differently in dmu_tx_try_assign().

if (spa_suspended(spa)) { ... if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_CONTINUE && !(txg_how & TXG_WAIT)) return (SET_ERROR(EIO)); return (SET_ERROR(ERESTART)); }

If we don't use TXG_WAIT it's possible the time update could be dropped and there's no way to report it. Perhaps we should instead add a comment explaining why this one is different.

In that case, please add a comment explaining why we need to use TXG_WAIT. And I think we would need to use TXG_WAIT every time, not just if the first NOWAIT call fails.

I'm sorry, I still don't understand how this works. The first time through, we call dmu_tx_assign(TXG_NOWAIT). If the pool is suspended, it will return EIO (per the code @behlendorf quoted above). zfs_dirty_inode() will then goto out and return EIO. According to the comment you added, this would be incorrect behavior, because it "must always succeed".

You are right, we should always assign by TXG_NOTHROTTLE | TXG_WAIT, then dmu_tx_assign can only return ERESTART, and we try again and agian until we assign successfully.

if (tx->tx_dir != NULL && asize != 0) { int err = dsl_dir_tempreserve_space(tx->tx_dir, memory, asize, tx->tx_netfree, &tx->tx_tempreserve_cookie, tx); if (err != 0) return (err); }

dmu_tx_try_assign may return like EDQUOT or ENOSPC error in this place, this will make dirty inode op failed, if I always retry no matter what error returned by dmu_tx_assign, this may lead to infinite loop, do you have any suggestions?

If the required semantic is "must always succeed", then your only options are to retry (potentially leading to an infinite loop), or panic (halt operation).

Note that if you get ENOSPC or EDQUOT (rather than ERESTART), then retrying will only work if some other process happens to free up space. We could consider using dmu_tx_mark_netfree() to greatly reduce the chances of ENOSPC/EDQUOT, by allowing it to use half the slop space, like unlink(). But it's still possible to get ENOSPC, so we'll have to decide how to handle it. See also the comments above dmu_tx_mark_netfree() and dsl_synctask.h:ZFS_SPACE_CHECK_RESERVED.

My previous comments is not accurate. We have no way to guarantee success of dirty_inode always. EQUOTA/ENOSPACE is not inevitable. After reference the implementation of other filesystems, I think the best effort is retry once and give up.
How do you think?

required semantic is "must always succeed"...

The required semantics are that the inode must be marked dirty so it will be written latter. The code happens to take this opportunity to assign a transaction for the newly dirtied inode. However, the vfs does provide a second .write_inode callback where the transaction assignment could be moved too. It does allow for failures and properly handles them.

From Documentation/filesystems/vfs.txt the required semantics are:

void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, int); dirty_inode: this method is called by the VFS to mark an inode dirty. write_inode: this method is called when the VFS needs to write an inode to disc. The second parameter indicates whether the write should be synchronous or not, not all filesystems check this flag.

Splitting this up in something which has been on the list to investigate, but hasn't been critical since the existing code in practice works well. Getting this right and tested on all the supported kernels might also be tricky.

For the purposes on this PR why don't we leaved this code unchanged and continue to use TXG_WAIT. This won't change the behavior and resolves the deadlock at hand. Then in a follow up PR, if @wgqimut is available he can investigate implementing the .write_inode callback. For frequently dirtied inode there's potentially a significant performance win to be had.

tuxoko · 2018-10-12T20:18:38Z

module/zfs/zfs_vnops.c

 	tx = dmu_tx_create(zfsvfs->z_os);

 	dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
 	zfs_sa_upgrade_txholds(tx, zp);

-	error = dmu_tx_assign(tx, TXG_WAIT);
+	boolean_t waited = B_FALSE;


Please move this to the beginning of the function

Note that this is not just a stylistic suggestion; if we goto top, we want waited to be TRUE

Signed-off-by: Grady Wong <[email protected]>

codecov · 2018-10-13T05:00:21Z

Codecov Report

Merging #7939 into master will decrease coverage by 0.13%.
The diff coverage is 60.86%.

@@            Coverage Diff             @@
##           master    #7939      +/-   ##
==========================================
- Coverage   78.64%    78.5%   -0.14%     
==========================================
  Files         377      377              
  Lines      114333   114232     -101     
==========================================
- Hits        89912    89679     -233     
- Misses      24421    24553     +132

Flag	Coverage Δ
#kernel	`78.83% <60.86%> (-0.03%)`	⬇️
#user	`67.42% <ø> (-0.24%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0aa5916...9cc3613. Read the comment docs.

Signed-off-by: Grady Wong <[email protected]>

behlendorf

Thanks, this LGTM. @tuxoko @ahrens can you give this another look. I'd like to move forward with this to resolve the core issue, the zfs_dirty_inode() improvements can be left for a follow up PR.

module/zfs/zfs_vnops.c

Signed-off-by: Grady Wong <[email protected]>

ahrens

Thanks for your perseverance in getting all these last little issues addressed.

ahrens · 2018-10-16T16:03:20Z

I'm marking this Accepted, but we need to check the test failures to make sure they are not related. The failures are:

Tests with results other than PASS that are unexpected:
    FAIL cli_root/zfs_set/snapdir_001_pos (expected PASS)
    FAIL cli_root/zfs_set/user_property_002_pos (expected PASS)
    FAIL cli_root/zfs_set/user_property_004_pos (expected PASS)

ztest: aborting test after 600 seconds because the process is overdue for termination.
/usr/sbin/ztest(+0x8cd7)[0x556ffd413cd7]
/lib64/libpthread.so.0(+0x11fb0)[0x7f69059ebfb0]
/lib64/libc.so.6(gsignal+0x10b)[0x7f6905651f4b]
/lib64/libc.so.6(abort+0x12b)[0x7f690563c591]
/usr/sbin/ztest(+0x7be5)[0x556ffd412be5]
/usr/sbin/ztest(+0x9a7d)[0x556ffd414a7d]
/lib64/libpthread.so.0(+0x7564)[0x7f69059e1564]
/lib64/libc.so.6(clone+0x3f)[0x7f690571531f]

spa_open(ztest_opts.zo_pool, &spa, FTAG) == 0 (0x6 == 0)
ASSERT at ztest.c:6918:ztest_run()/sbin/ztest[0x40969f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f62b5a4e390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f62b56a8428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f62b56aa02a]
/sbin/ztest[0x40ba49]
/sbin/ztest[0x4081aa]
/sbin/ztest[0x408f22]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f62b5693830]
/sbin/ztest[0x408f59]
child died with signal 6

behlendorf · 2018-10-16T17:07:08Z

These are all known existing issues. This is good to go.

The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread #2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread #2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread #2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939

The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread openzfs#2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread openzfs#2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread openzfs#2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939

The bug time sequence: 1. thread openzfs#1, `zfs_write` assign a txg "n". 2. In a same process, thread openzfs#2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread openzfs#1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread openzfs#2, so it stuck and can't complete, then txg "n" will not complete. So thread openzfs#1 and thread openzfs#2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939

The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread #2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread #2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread #2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939

The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread #2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread #2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread #2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes #7939

wgqimut force-pushed the master branch 4 times, most recently from 7372bd7 to 629bcae Compare September 21, 2018 11:07

behlendorf mentioned this pull request Sep 21, 2018

Fix zfs_write() / mmap update_time() lock inversion #7942

Closed

13 tasks

behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 24, 2018

behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Code Review Needed Ready for review and testing labels Sep 25, 2018

wgqimut force-pushed the master branch from 629bcae to 69a0632 Compare September 29, 2018 03:04

behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Revision Needed Changes are required for the PR to be accepted labels Oct 1, 2018

behlendorf reviewed Oct 1, 2018

View reviewed changes

behlendorf requested a review from tuxoko October 1, 2018 21:02

ahrens changed the title ~~fix write IO hang when mmap dirty_inode waiting this writing commit.~~ deadlock between mm_sem and tx assign in zfs_write() and page fault Oct 3, 2018

ahrens reviewed Oct 3, 2018

View reviewed changes

wgqimut force-pushed the master branch 3 times, most recently from 498b443 to 4530de4 Compare October 3, 2018 18:57

behlendorf reviewed Oct 3, 2018

View reviewed changes

wgqimut force-pushed the master branch 3 times, most recently from cc157bc to 1dcdf88 Compare October 5, 2018 17:03

behlendorf approved these changes Oct 7, 2018

View reviewed changes

wgqimut force-pushed the master branch from 1dcdf88 to cd06b1b Compare October 8, 2018 02:40

behlendorf removed the Status: Code Review Needed Ready for review and testing label Oct 12, 2018

ahrens requested changes Oct 12, 2018

View reviewed changes

ahrens added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Accepted Ready to integrate (reviewed, tested) labels Oct 12, 2018

tuxoko suggested changes Oct 12, 2018

View reviewed changes

wgqimut force-pushed the master branch from 81851db to 425be9e Compare October 13, 2018 02:52

Explain why zfs_dirty_inode need TXG_WAIT

102619e

Signed-off-by: Grady Wong <[email protected]>

wgqimut force-pushed the master branch from 425be9e to 102619e Compare October 13, 2018 04:09

Restore the zfs_dirty_inode

cfce4c2

Signed-off-by: Grady Wong <[email protected]>

wgqimut force-pushed the master branch from bf427ed to cfce4c2 Compare October 15, 2018 20:50

behlendorf approved these changes Oct 15, 2018

View reviewed changes

ahrens requested changes Oct 15, 2018

View reviewed changes

module/zfs/zfs_vnops.c Show resolved Hide resolved

can't abort tx when write error.

9cc3613

Signed-off-by: Grady Wong <[email protected]>

tuxoko approved these changes Oct 16, 2018

View reviewed changes

ahrens approved these changes Oct 16, 2018

View reviewed changes

ahrens added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Revision Needed Changes are required for the PR to be accepted labels Oct 16, 2018

ahrens merged commit 779a6c0 into openzfs:master Oct 16, 2018

behlendorf mentioned this pull request Mar 28, 2019

zfs dataset becomes stuck #8321

Closed

behlendorf mentioned this pull request Apr 29, 2022

autoconf: Fail when __copy_from_user_inatomic is a non-GPL symbol #13389

Closed

13 tasks

		tx_bytes = uio->uio_resid;
		error = dmu_write_uio_dbuf(sa_get_db(zp->z_sa_hdl),

		@@ -1040,6 +1040,7 @@ dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how)
		return (0);
		}

deadlock between mm_sem and tx assign in zfs_write() and page fault #7939

deadlock between mm_sem and tx assign in zfs_write() and page fault #7939

Conversation

wgqimut commented Sep 21, 2018 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

behlendorf commented Sep 21, 2018

javenwu commented Sep 22, 2018 • edited Loading

wgqimut commented Sep 22, 2018 • edited Loading

behlendorf commented Sep 25, 2018

wgqimut commented Sep 25, 2018

wgqimut commented Sep 29, 2018 • edited Loading

1. About dmu_tx_reassign

2. About the __copy_from_user_inatomic and copy_from_user

3. About testing

Cofiguration:

Test result: (unit: MB/s)

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgqimut Oct 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahrens Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgqimut Oct 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf Oct 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 13, 2018 • edited Loading

Codecov Report

behlendorf left a comment

Choose a reason for hiding this comment

ahrens left a comment

Choose a reason for hiding this comment

ahrens commented Oct 16, 2018

behlendorf commented Oct 16, 2018

wgqimut commented Sep 21, 2018 •

edited

Loading

javenwu commented Sep 22, 2018 •

edited

Loading

wgqimut commented Sep 22, 2018 •

edited

Loading

wgqimut commented Sep 29, 2018 •

edited

Loading

1. About `dmu_tx_reassign`

2. About the `__copy_from_user_inatomic` and `copy_from_user`

wgqimut Oct 2, 2018 •

edited

Loading

ahrens Oct 12, 2018 •

edited

Loading

wgqimut Oct 15, 2018 •

edited

Loading

behlendorf Oct 15, 2018 •

edited

Loading

codecov bot commented Oct 13, 2018 •

edited

Loading