Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock between mm_sem and tx assign in zfs_write() and page fault #7939

Merged
merged 6 commits into from
Oct 16, 2018

Conversation

wgqimut
Copy link
Contributor

@wgqimut wgqimut commented Sep 21, 2018

The bug time sequence:

  1. context Use Barriers in pre-2.6.24 kernels #1, zfs_write assign a txg "n".
  2. In the same process, context Implement zeventd daemon #2, mmap page fault (which means the mm_sem
    is hold) occurred, zfs_dirty_inode open a txg failed, and wait previous
    txg "n" completed.
  3. context Use Barriers in pre-2.6.24 kernels #1 call uiomove to write, however page fault is occurred in
    uiomove, which means it need mm_sem, but mm_sem is hold by
    context Implement zeventd daemon #2, so it stuck and can't complete, then txg "n" will not complete.

So context #1 and context #2 trap into the "dead lock".

Signed-off-by: Grady Wong [email protected]

Motivation and Context

#7512

Description

How Has This Been Tested?

Test being done based on zfs-0.7.9:
Regression testing: fstress, ztest
Unit testing program: Limit system memory to 1.5GB and try to simulate race condition of file write and mmap access in one process. One thread keep writing file A's 1 byte of each page, the other thread keep modifying 10 byte of mmaped one page of file B. The method increases the page fault rate inside uiomove() .
The unit test program is attached below.

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <err.h>
#include <errno.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <pthread.h>


#define BUF_SIZE (128 * 1024)
#define MAP_THREAD 1

long get_tick()
{
	struct timespec now;
	if (clock_gettime(CLOCK_MONOTONIC, &now))
		return 0;
	return now.tv_sec + now.tv_nsec;
}


static char *rand_string(char *str, size_t size)
{
	const char charset[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJK...";
	size_t n = 0;

	if (size) {
		srand(get_tick());
		--size;
		for (n = 0; n < size; n++) {
			int key = rand() % (int) (sizeof charset - 1);
			str[n] = charset[key];
		}
		str[size] = '\0';
	}
	return str;
}

void *write_map_file_thread(void *data)
{
	int fd = -1;
	int ret = 0;
	char *buf = NULL;
	int page_size = getpagesize();
	int op_errno = 0;
	char *path = data;
	char map_write_file[4] = {0, };
	char file_path[255] = {0, };

	rand_string(map_write_file, 4);
	sprintf(file_path, "%s/%s", path, map_write_file);
	while (1) {
		ret = access(file_path, F_OK);
		if (ret) {
			op_errno = errno;
			if (op_errno == ENOENT) {
				fd = open(file_path, O_RDWR | O_CREAT, 0777);
				if (fd == -1) {
					err(1, "open file failed");
				}

				ret = ftruncate(fd, page_size);
				if (ret == -1) {
					err(1, "truncate file failed");
				}
			} else {
				err(1, "access file failed!");
			}
		} else {
			fd = open(file_path, O_RDWR, 0777);
			if (fd == -1) {
				err(1, "open file failed");
			}
		}

		buf = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
		if (buf == MAP_FAILED) {
			err(1, "map file failed");
		}
		printf("map file %s\n", file_path);

		if (fd != -1)
			close(fd);

	        char s[10];
		memcpy(buf, s, 10);
		printf("write %s to mapfile %s\n", s, file_path);
		ret = munmap(buf, page_size);
		if (ret != 0) {
			err(1, "unmap file failed");
		}
		printf("unmap file %s\n", file_path);
	}
}

void *write_file_thread(void *data)
{
	char *path = data;
	char file_name[4] = {0, };
	char file_path[255] = {0, };
	int fd = -1;
	ssize_t write_num = 0;
	int page_size = getpagesize();
	off_t offset = 0;

	rand_string(file_name, 4);
	sprintf(file_path, "%s/%s", path, file_name);

	fd = open(file_path, O_RDWR | O_CREAT, 0777);
	if (fd == -1) {
		err(1, "failed to open %s", file_path);
	}

	char *buf = malloc(1);
	while (1) {
		write_num = write(fd, buf, 1);
		if (write_num == 0) {
			err(1, "write failed!");
			break;
		}
		offset = lseek(fd, page_size, SEEK_CUR);
	}
}

int main(int argc, char *argv[])
{
	int ret = 0;
	pthread_t write_file_th;
	pthread_t write_file_th2;
	int sleep_time = 0;
	int i = 0;
	pthread_t map_file_th[MAP_THREAD];
	pthread_t eat_mem_th;
	pthread_t more_pg_th;

	if (argc != 2) {
		err(1, "invalid argument");
	}

	pthread_create(&write_file_th, NULL, write_file_thread, argv[1]);
	pthread_create(&write_file_th2, NULL, write_file_thread, argv[1]);


	for (i = 0; i < MAP_THREAD; i++) {
		pthread_create(&map_file_th[i], NULL, write_map_file_thread, argv[1]);
	}

	pthread_join(write_file_th, NULL);
	pthread_join(write_file_th2, NULL);
}

I didn't find a easy way to reproduce the problem constantly. But I no longer suffer the problem during the tests so far with my fixing. If you guys have any good idea to reproduce the problem, please shed me a light.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@wgqimut wgqimut force-pushed the master branch 4 times, most recently from 7372bd7 to 629bcae Compare September 21, 2018 11:07
@behlendorf
Copy link
Contributor

@wgqimut thanks for doing the leg work on this lock inversion. I agree with your analysis of the issue, but let me suggest an alternate solution. Rather than disabling the page faulting in copy_from_user(), which I believe we want to keep enabled for performance reasons. What if we instead we set TXG_NOWAIT and then bypassed the write throttle in zfs_dirty_inode(). This will ensure we do not block waiting on the next TXG while holding the mmap_sem.

Normally in a system call context we'd want to drop the offending locks and then try everything again. Since that's not an option here, and because a common caller is filemap_page_mkwrite I don't think we'd want to block here even if we could do it safely.

What do you think about this proposed change instead, behlendorf/zfs@f7a900f. It works as expected for me locally with your reproducer (thanks for that!) but I'd really like to know how it fairs in your environment.

@javenwu
Copy link
Contributor

javenwu commented Sep 22, 2018

@behlendorf Thanks for your quick reply and review. Actually, we considered a similar way as your proposed fixing at the very beginning. Our concern is that bypass throttle checking for zfs_dirty_inode might have a potential risk to make a huge txg with too many dirty pages in highly frequent mmap access scenarios.
For programs that heavily depend on mapped file operations, it's very possible dirty too many pages in a transaction group. That breaks the intention of txg write throttle design basically.

Another side, I don't understand where is obvious performance penalty if we disable page fault during copy_from_user in zfs_write() context. There is prefault before txg_assign for every write iteration inside zfs_write(). The prefault here is to avoid page faults in uiomove(), so most of pages should have been faulted-in before uiomove(). So txg reassign and prefault retry ONLY happens when prefaulted page is evicted.

copy data from user with pagefault disabled is not invented by us, actually I found other file systems do similar way in write() context as well, for example btrfs, ext4, fuse and even generic_perform_write() .
------%<----
pagefault_disable();
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
pagefault_enable();
------%<------
I noticed the quoted code is referenced by several filesystems, that's why I assume disabling page fault during uiomove in zfs_write() is not a big performance issue and that's why we propose this fixing.

We try to not break zfs txg write throttle and fix the issue.
If you have any more concern, please let us know, your advise are always important to us.

@wgqimut
Copy link
Contributor Author

wgqimut commented Sep 22, 2018

@behlendorf

Rather than disabling the page faulting in copy_from_user(), which I believe we want to keep enabled for performance reasons

uio_prepagefault decreases the page fault rate in copy_from_user actually, even I disable uio_prepagefault when I run my test, it's rare rate to trigger EFAULT, so I am not sure I understand the performance reason you mentioned.
As for bypass the write throttle, I argree with @javenwu.

bypass throttle checking for zfs_dirty_inode might have a potential risk to make a huge txg with too many dirty data in highly frequent mmap access scenarios.

In addition, my test program cannot reproduce the origin hang problem effectively and constantly, sometimes the hang triggers in minutes, sometimes, it cost weeks to reproduce the problem in my test. But so far, my fixing version has been tested for about 2 weeks, so far so good. Do you have any good idea for the testing?

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 24, 2018
@behlendorf
Copy link
Contributor

Our concern is that bypassing throttle checking for zfs_dirty_inode might have a potential risk to make a huge txg with too many dirty pages in highly frequent mmap access scenarios.

That would be my major concern as well, in addition to potentially introducing the possibility for an inode update to be dropped.

Regarding any performance penalty I have to say that was more of a gut feeling on my part. If you have performance data that says otherwise, I'm happy to withdraw that concern.

Given the above I think the best solution here may be to use parts of both PRs. What do you think about something like this:

  1. Use pagefault_disable() around the copy as you've done. Since memmove() is already ZFS specific, and only used in a few places, it would be reasonable to add a new flag to request page faults be disabled rather than introducing a new function.

  2. In zfs_dirty_inode() we should switch to using a variant of TXG_NOTHROTTLE plus an explicit dmu_tx_wait(). This was a change made to the other zfs_* operations (and overlooked here at the time) to ensure that we do not get starved out of multiple TXGs when hitting the write throttle. In this case since we never want to allow a failure, perhaps something like.

        waited = B_FALSE;

top:
        error = dmu_tx_assign(tx, waited ? (TXG_NOTHROTTLE | TXG_WAIT) : TXG_NOWAIT);
        if (error) {
                if (error == ERESTART && waited == B_FALSE) {
                        waited = B_TRUE;
                        dmu_tx_wait(tx);
                        dmu_tx_abort(tx);
                        goto top;
                }
                dmu_tx_abort(tx);
                ZFS_EXIT(zfsvfs);
                return (error);
        }
  1. Rather than introduce a new dmu_tx_reassign() which is only used here. It would be nice to rework the code to use dmu_tx_abort() and dmu_tx_assign(). Again similar to the other operations.

As for making the test failure more likely you might try decreasing the zfs_dirty_* module parameters significantly to force tiny by much more frequent TXGs. The idea being that should make it more likely you'll need to wait for one to complete in zfs_mark_dirty().

@wgqimut
Copy link
Contributor Author

wgqimut commented Sep 25, 2018

Thanks for the comments. I accept your suggestion and am working on the refine. After the modification and testing, I will send you the test results.

@behlendorf behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Code Review Needed Ready for review and testing labels Sep 25, 2018
@wgqimut
Copy link
Contributor Author

wgqimut commented Sep 29, 2018

I have done my modification and testing. and There is something to explain.

1. About dmu_tx_reassign

Rather than introduce a new dmu_tx_reassign() which is only used here. It would be nice to rework the code to use dmu_tx_abort() and dmu_tx_assign(). Again similar to the other operations.

I rechecked the code, Successfully assigned tx can't be destroyed by dmu_tx_abort, what we can do is commit this tx, or unassign this tx. In this case dmu_tx_unassign can't be reuse too, cause we have already exit the tc->tc_open_lock if we assign txg successful. So I have to introduce the new function dmu_tx_reassign.

2. About the __copy_from_user_inatomic and copy_from_user

In other FS(like btrfs/fuse), pagefault_disable is used with __copy_from_user_inatomic, so I change that.

3. About testing

I disable the arc buf in zfs_write so the iozone write can be processed by dmu_write_uio_dbuf. The result proves that performance issue is not exist.

Cofiguration:

Mem: 64G
Arc max: 31G
Zpool status:

  pool: pool1
 state: ONLINE
  scan: none requested
config:

	NAME                                      STATE     READ WRITE CKSUM
	pool1                                     ONLINE       0     0     0
	  raidz2-0                                ONLINE       0     0     0
	    21c57141-dbea-48eb-b32d-320e2a6009fc  ONLINE       0     0     0
	    028a55c7-3722-4988-9823-9599b0750ef5  ONLINE       0     0     0
	    0546a512-27f4-46e7-a4b5-09acc65862c4  ONLINE       0     0     0
	    a991d797-cad8-4873-839e-6180658d4526  ONLINE       0     0     0
	    9a0ca566-449b-4506-9180-fc5f3794ef86  ONLINE       0     0     0
	    45ce1c41-c768-4b35-ab29-66d09d71818b  ONLINE       0     0     0
	  raidz2-1                                ONLINE       0     0     0
	    6593cfb4-113a-4291-a871-95c66cd4cd59  ONLINE       0     0     0
	    3303a051-3065-455c-b201-fe295c84a7b7  ONLINE       0     0     0
	    cbb9b4e6-0244-4be7-8d0b-61f5ce99b8e0  ONLINE       0     0     0
	    91253551-ba8f-4b83-955a-145bb713ab77  ONLINE       0     0     0
	    b04ebacd-4ad2-4647-892e-4cf975e841b2  ONLINE       0     0     0
	    021cb77d-3785-4585-9e75-9f28cf5934da  ONLINE       0     0     0

errors: No known data errors

Test cmd: iozone -s 100G -r 128k -i 0 -f /data/pool1/old.iozone -w

Test result: (unit: MB/s)

old zfs write new zfs write old zfs rewrite new zfs rewrite
747 774 731 772
803 837 779 749
769 795 765 758
786 762 741 747
797 776 726 738
775 749 714 726
763 782 694 728
765 743 719 717
755 742 717 721
768 744 742 705
Average 773 770 733 736

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Revision Needed Changes are required for the PR to be accepted labels Oct 1, 2018
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performance issue is not an exist.

Excellent, thanks for confirming that.

if (fault_disable) {
pagefault_disable();
if (__copy_from_user_inatomic(p,
iov->iov_base+skip, cnt)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to __copy_from_user_inatomic will disable the access_ok() checks. Can you point me to where these checks are still being done for the write.

Copy link
Contributor Author

@wgqimut wgqimut Oct 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I should have check access_ok() before calling __copy_from_user_inatomic. I found copy_from_user never been referenced with pagefault disabled by other kernel code. __copy_from_inatomic can be called with page fault disabled.

tx_bytes = uio->uio_resid;
error = dmu_write_uio_dbuf(sa_get_db(zp->z_sa_hdl),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what we can do is commit this tx, or unassign this tx.

It sounds like you investigated committing the empty tx and decided against it. Can you explain why. It looks like it would be relatively straight forward to slightly rework the existing if (tx_bytes == 0) {} case to continue on EFAULT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought if looks werid and not straight forward. I you think committing a empty txg is a better option, I'm okay with that. I will make the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I find committing the empty txg to be clearer. Although, I can see the argument for adding a reassign function to handle some cases like this. Let's get @ahrens thoughts on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to commit the empty tx and then go back to around to where the "Start a transaction." comment is. From your previous conversation, it sound like dmu_write_uio_dbuf() will fail infrequently, so performance is not a concern here.

@@ -79,8 +81,19 @@ uiomove_iov(void *p, size_t n, enum uio_rw rw, struct uio *uio)
if (copy_to_user(iov->iov_base+skip, p, cnt))
return (EFAULT);
} else {
if (copy_from_user(p, iov->iov_base+skip, cnt))
return (EFAULT);
if (fault_disable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could do this with a flag in the uio_t instead to avoid changing all these interfaces.

@behlendorf behlendorf requested a review from tuxoko October 1, 2018 21:02
@ahrens ahrens changed the title fix write IO hang when mmap dirty_inode waiting this writing commit. deadlock between mm_sem and tx assign in zfs_write() and page fault Oct 3, 2018
pagefault_enable();
} else {
if (copy_from_user(p,
iov->iov_base+skip, cnt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cstyle: add spaces around the + operator

@wgqimut wgqimut force-pushed the master branch 3 times, most recently from 498b443 to 4530de4 Compare October 3, 2018 18:57
@@ -1040,6 +1040,7 @@ dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how)
return (0);
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra white space.

if (error == EFAULT) {
uio_prefaultpages(MIN(n, max_blksz), uio);
dmu_tx_commit(tx);
goto top;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more correct to continue here and go all the way back to the top of the while loop to re-run the quota checks. It's possible we may now be over quota.

@wgqimut wgqimut force-pushed the master branch 3 times, most recently from cc157bc to 1dcdf88 Compare October 5, 2018 17:03
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the updated PR is looking almost ready to integrate. When addressing this round of review feedback, please go ahead and rebase the PR on the latest code in the master branch. That should resolve the kmemleak failure reported by the Coverage builder.

return (EFAULT);
if (uio->uio_fault_disable) {
if (!access_ok(VERIFY_READ,
(iov->iov_base+skip), cnt)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add a space before and after the + for the style checker. You can run make checkstyle locally to run the same checks.

module/zcommon/zfs_uio.c Show resolved Hide resolved

pagefault_disable();
if (__copy_from_user_inatomic(p,
(iov->iov_base+skip), cnt)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same white space around +.

module/zcommon/zfs_uio.c Show resolved Hide resolved
pagefault_enable();
} else {
if (copy_from_user(p,
(iov->iov_base+skip), cnt))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same white space around +.

module/zfs/zfs_vnops.c Show resolved Hide resolved
@behlendorf behlendorf removed the Status: Code Review Needed Ready for review and testing label Oct 12, 2018
error = dmu_tx_assign(tx, TXG_WAIT);
boolean_t waited = B_FALSE;
error = dmu_tx_assign(tx,
waited ? (TXG_NOTHROTTLE | TXG_WAIT) : TXG_NOWAIT);
Copy link
Member

@ahrens ahrens Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more idiomatic, as well as easier to understand, if we do the same thing as other similar code, i.e.:

	error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
	if (error) {
		if (error == ERESTART) {
			waited = B_TRUE;
			dmu_tx_wait(tx);
			dmu_tx_abort(tx);
			goto top;
		}
		dmu_tx_abort(tx);
		ZFS_EXIT(zfsvfs);
		return (error);
	}

If you search for NOTHROTTLE in this file, you'll see 8 other cases that do it this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an extra complication which isn't obvious when reading this code in isolation which is why it was written a little differently. Despite the fact that this function allows an error to be returned, it's called from zpl_dirty_inode() which is a Linux VFS callback functions (.dirty_inode) which must always succeed.

        void (*dirty_inode) (struct inode *, int flags);

This is why the existing code used TXG_WAIT since it is handled slightly differently in dmu_tx_try_assign().

       if (spa_suspended(spa)) {
                ...
                if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_CONTINUE &&
                    !(txg_how & TXG_WAIT))
                        return (SET_ERROR(EIO));

                return (SET_ERROR(ERESTART));
        }

If we don't use TXG_WAIT it's possible the time update could be dropped and there's no way to report it. Perhaps we should instead add a comment explaining why this one is different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, please add a comment explaining why we need to use TXG_WAIT. And I think we would need to use TXG_WAIT every time, not just if the first NOWAIT call fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I still don't understand how this works. The first time through, we call dmu_tx_assign(TXG_NOWAIT). If the pool is suspended, it will return EIO (per the code @behlendorf quoted above). zfs_dirty_inode() will then goto out and return EIO. According to the comment you added, this would be incorrect behavior, because it "must always succeed".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we should always assign by TXG_NOTHROTTLE | TXG_WAIT, then dmu_tx_assign can only return ERESTART, and we try again and agian until we assign successfully.

Copy link
Contributor Author

@wgqimut wgqimut Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (tx->tx_dir != NULL && asize != 0) {
		int err = dsl_dir_tempreserve_space(tx->tx_dir, memory,
		    asize, tx->tx_netfree, &tx->tx_tempreserve_cookie, tx);
		if (err != 0)
			return (err);
}

dmu_tx_try_assign may return like EDQUOT or ENOSPC error in this place, this will make dirty inode op failed, if I always retry no matter what error returned by dmu_tx_assign, this may lead to infinite loop, do you have any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the required semantic is "must always succeed", then your only options are to retry (potentially leading to an infinite loop), or panic (halt operation).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if you get ENOSPC or EDQUOT (rather than ERESTART), then retrying will only work if some other process happens to free up space. We could consider using dmu_tx_mark_netfree() to greatly reduce the chances of ENOSPC/EDQUOT, by allowing it to use half the slop space, like unlink(). But it's still possible to get ENOSPC, so we'll have to decide how to handle it. See also the comments above dmu_tx_mark_netfree() and dsl_synctask.h:ZFS_SPACE_CHECK_RESERVED.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comments is not accurate. We have no way to guarantee success of dirty_inode always. EQUOTA/ENOSPACE is not inevitable. After reference the implementation of other filesystems, I think the best effort is retry once and give up.
How do you think?

Copy link
Contributor

@behlendorf behlendorf Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required semantic is "must always succeed"...

The required semantics are that the inode must be marked dirty so it will be written latter. The code happens to take this opportunity to assign a transaction for the newly dirtied inode. However, the vfs does provide a second .write_inode callback where the transaction assignment could be moved too. It does allow for failures and properly handles them.

From Documentation/filesystems/vfs.txt the required semantics are:

        void (*dirty_inode) (struct inode *, int flags);
        int (*write_inode) (struct inode *, int);

  dirty_inode: this method is called by the VFS to mark an inode dirty.

  write_inode: this method is called when the VFS needs to write an
        inode to disc.  The second parameter indicates whether the write
        should be synchronous or not, not all filesystems check this flag.

Splitting this up in something which has been on the list to investigate, but hasn't been critical since the existing code in practice works well. Getting this right and tested on all the supported kernels might also be tricky.

For the purposes on this PR why don't we leaved this code unchanged and continue to use TXG_WAIT. This won't change the behavior and resolves the deadlock at hand. Then in a follow up PR, if @wgqimut is available he can investigate implementing the .write_inode callback. For frequently dirtied inode there's potentially a significant performance win to be had.

@ahrens ahrens added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Accepted Ready to integrate (reviewed, tested) labels Oct 12, 2018
tx = dmu_tx_create(zfsvfs->z_os);

dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
zfs_sa_upgrade_txholds(tx, zp);

error = dmu_tx_assign(tx, TXG_WAIT);
boolean_t waited = B_FALSE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the beginning of the function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is not just a stylistic suggestion; if we goto top, we want waited to be TRUE

@codecov
Copy link

codecov bot commented Oct 13, 2018

Codecov Report

Merging #7939 into master will decrease coverage by 0.13%.
The diff coverage is 60.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7939      +/-   ##
==========================================
- Coverage   78.64%    78.5%   -0.14%     
==========================================
  Files         377      377              
  Lines      114333   114232     -101     
==========================================
- Hits        89912    89679     -233     
- Misses      24421    24553     +132
Flag Coverage Δ
#kernel 78.83% <60.86%> (-0.03%) ⬇️
#user 67.42% <ø> (-0.24%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0aa5916...9cc3613. Read the comment docs.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this LGTM. @tuxoko @ahrens can you give this another look. I'd like to move forward with this to resolve the core issue, the zfs_dirty_inode() improvements can be left for a follow up PR.

module/zfs/zfs_vnops.c Show resolved Hide resolved
Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your perseverance in getting all these last little issues addressed.

@ahrens ahrens added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Revision Needed Changes are required for the PR to be accepted labels Oct 16, 2018
@ahrens
Copy link
Member

ahrens commented Oct 16, 2018

I'm marking this Accepted, but we need to check the test failures to make sure they are not related. The failures are:

Tests with results other than PASS that are unexpected:
    FAIL cli_root/zfs_set/snapdir_001_pos (expected PASS)
    FAIL cli_root/zfs_set/user_property_002_pos (expected PASS)
    FAIL cli_root/zfs_set/user_property_004_pos (expected PASS)

ztest: aborting test after 600 seconds because the process is overdue for termination.
/usr/sbin/ztest(+0x8cd7)[0x556ffd413cd7]
/lib64/libpthread.so.0(+0x11fb0)[0x7f69059ebfb0]
/lib64/libc.so.6(gsignal+0x10b)[0x7f6905651f4b]
/lib64/libc.so.6(abort+0x12b)[0x7f690563c591]
/usr/sbin/ztest(+0x7be5)[0x556ffd412be5]
/usr/sbin/ztest(+0x9a7d)[0x556ffd414a7d]
/lib64/libpthread.so.0(+0x7564)[0x7f69059e1564]
/lib64/libc.so.6(clone+0x3f)[0x7f690571531f]

spa_open(ztest_opts.zo_pool, &spa, FTAG) == 0 (0x6 == 0)
ASSERT at ztest.c:6918:ztest_run()/sbin/ztest[0x40969f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f62b5a4e390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f62b56a8428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f62b56aa02a]
/sbin/ztest[0x40ba49]
/sbin/ztest[0x4081aa]
/sbin/ztest[0x408f22]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f62b5693830]
/sbin/ztest[0x408f59]
child died with signal 6

@behlendorf
Copy link
Contributor

These are all known existing issues. This is good to go.

@ahrens ahrens merged commit 779a6c0 into openzfs:master Oct 16, 2018
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Oct 17, 2018
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Grady Wong <[email protected]>
Closes openzfs#7939
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Oct 18, 2018
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Grady Wong <[email protected]>
Closes openzfs#7939
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread openzfs#2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread openzfs#2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread openzfs#2 are deadlocked.

Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Grady Wong <[email protected]>
Closes openzfs#7939
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
The bug time sequence:
1. thread openzfs#1, `zfs_write` assign a txg "n".
2. In a same process, thread openzfs#2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread openzfs#1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread openzfs#2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread openzfs#1 and thread openzfs#2 are deadlocked.

Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Grady Wong <[email protected]>
Closes openzfs#7939
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jan 30, 2019
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Grady Wong <[email protected]>
Closes openzfs#7939
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Feb 12, 2019
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Grady Wong <[email protected]>
Closes openzfs#7939
tonyhutter pushed a commit that referenced this pull request Mar 4, 2019
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Grady Wong <[email protected]>
Closes #7939
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants