Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock between mm_sem and tx assign in zfs_write() and page fault #7939

Merged
merged 6 commits into from
Oct 16, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions include/spl/sys/uio.h
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ typedef struct uio {
int uio_iovcnt;
offset_t uio_loffset;
uio_seg_t uio_segflg;
boolean_t uio_fault_disable;
uint16_t uio_fmode;
uint16_t uio_extflg;
offset_t uio_limit;
Expand Down
21 changes: 19 additions & 2 deletions module/zcommon/zfs_uio.c
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
#include <sys/sysmacros.h>
#include <sys/strings.h>
#include <linux/kmap_compat.h>
#include <linux/uaccess.h>

/*
* Move "n" bytes at byte address "p"; "rw" indicates the direction
Expand Down Expand Up @@ -79,8 +80,24 @@ uiomove_iov(void *p, size_t n, enum uio_rw rw, struct uio *uio)
if (copy_to_user(iov->iov_base+skip, p, cnt))
return (EFAULT);
} else {
if (copy_from_user(p, iov->iov_base+skip, cnt))
return (EFAULT);
if (uio->uio_fault_disable) {
if (!access_ok(VERIFY_READ,
(iov->iov_base + skip), cnt)) {
return (EFAULT);
behlendorf marked this conversation as resolved.
Show resolved Hide resolved
}

pagefault_disable();
if (__copy_from_user_inatomic(p,
(iov->iov_base + skip), cnt)) {
pagefault_enable();
return (EFAULT);
behlendorf marked this conversation as resolved.
Show resolved Hide resolved
behlendorf marked this conversation as resolved.
Show resolved Hide resolved
}
pagefault_enable();
} else {
if (copy_from_user(p,
(iov->iov_base + skip), cnt))
return (EFAULT);
}
}
break;
case UIO_SYSSPACE:
Expand Down
20 changes: 19 additions & 1 deletion module/zfs/zfs_vnops.c
Original file line number Diff line number Diff line change
Expand Up @@ -809,8 +809,17 @@ zfs_write(struct inode *ip, uio_t *uio, int ioflag, cred_t *cr)
ssize_t tx_bytes;
if (abuf == NULL) {
tx_bytes = uio->uio_resid;
uio->uio_fault_disable = B_TRUE;
error = dmu_write_uio_dbuf(sa_get_db(zp->z_sa_hdl),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what we can do is commit this tx, or unassign this tx.

It sounds like you investigated committing the empty tx and decided against it. Can you explain why. It looks like it would be relatively straight forward to slightly rework the existing if (tx_bytes == 0) {} case to continue on EFAULT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought if looks werid and not straight forward. I you think committing a empty txg is a better option, I'm okay with that. I will make the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I find committing the empty txg to be clearer. Although, I can see the argument for adding a reassign function to handle some cases like this. Let's get @ahrens thoughts on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to commit the empty tx and then go back to around to where the "Start a transaction." comment is. From your previous conversation, it sound like dmu_write_uio_dbuf() will fail infrequently, so performance is not a concern here.

uio, nbytes, tx);
if (error == EFAULT) {
dmu_tx_commit(tx);
behlendorf marked this conversation as resolved.
Show resolved Hide resolved
uio_prefaultpages(MIN(n, max_blksz), uio);
continue;
} else if (error != 0) {
dmu_tx_abort(tx);
break;
wgqimut marked this conversation as resolved.
Show resolved Hide resolved
}
tx_bytes -= uio->uio_resid;
} else {
tx_bytes = nbytes;
Expand Down Expand Up @@ -4636,13 +4645,22 @@ zfs_dirty_inode(struct inode *ip, int flags)
}
#endif

top:
tx = dmu_tx_create(zfsvfs->z_os);

dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
zfs_sa_upgrade_txholds(tx, zp);

error = dmu_tx_assign(tx, TXG_WAIT);
boolean_t waited = B_FALSE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the beginning of the function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is not just a stylistic suggestion; if we goto top, we want waited to be TRUE

error = dmu_tx_assign(tx,
waited ? (TXG_NOTHROTTLE | TXG_WAIT) : TXG_NOWAIT);
Copy link
Member

@ahrens ahrens Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more idiomatic, as well as easier to understand, if we do the same thing as other similar code, i.e.:

	error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
	if (error) {
		if (error == ERESTART) {
			waited = B_TRUE;
			dmu_tx_wait(tx);
			dmu_tx_abort(tx);
			goto top;
		}
		dmu_tx_abort(tx);
		ZFS_EXIT(zfsvfs);
		return (error);
	}

If you search for NOTHROTTLE in this file, you'll see 8 other cases that do it this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an extra complication which isn't obvious when reading this code in isolation which is why it was written a little differently. Despite the fact that this function allows an error to be returned, it's called from zpl_dirty_inode() which is a Linux VFS callback functions (.dirty_inode) which must always succeed.

        void (*dirty_inode) (struct inode *, int flags);

This is why the existing code used TXG_WAIT since it is handled slightly differently in dmu_tx_try_assign().

       if (spa_suspended(spa)) {
                ...
                if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_CONTINUE &&
                    !(txg_how & TXG_WAIT))
                        return (SET_ERROR(EIO));

                return (SET_ERROR(ERESTART));
        }

If we don't use TXG_WAIT it's possible the time update could be dropped and there's no way to report it. Perhaps we should instead add a comment explaining why this one is different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, please add a comment explaining why we need to use TXG_WAIT. And I think we would need to use TXG_WAIT every time, not just if the first NOWAIT call fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I still don't understand how this works. The first time through, we call dmu_tx_assign(TXG_NOWAIT). If the pool is suspended, it will return EIO (per the code @behlendorf quoted above). zfs_dirty_inode() will then goto out and return EIO. According to the comment you added, this would be incorrect behavior, because it "must always succeed".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we should always assign by TXG_NOTHROTTLE | TXG_WAIT, then dmu_tx_assign can only return ERESTART, and we try again and agian until we assign successfully.

Copy link
Contributor Author

@wgqimut wgqimut Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (tx->tx_dir != NULL && asize != 0) {
		int err = dsl_dir_tempreserve_space(tx->tx_dir, memory,
		    asize, tx->tx_netfree, &tx->tx_tempreserve_cookie, tx);
		if (err != 0)
			return (err);
}

dmu_tx_try_assign may return like EDQUOT or ENOSPC error in this place, this will make dirty inode op failed, if I always retry no matter what error returned by dmu_tx_assign, this may lead to infinite loop, do you have any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the required semantic is "must always succeed", then your only options are to retry (potentially leading to an infinite loop), or panic (halt operation).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if you get ENOSPC or EDQUOT (rather than ERESTART), then retrying will only work if some other process happens to free up space. We could consider using dmu_tx_mark_netfree() to greatly reduce the chances of ENOSPC/EDQUOT, by allowing it to use half the slop space, like unlink(). But it's still possible to get ENOSPC, so we'll have to decide how to handle it. See also the comments above dmu_tx_mark_netfree() and dsl_synctask.h:ZFS_SPACE_CHECK_RESERVED.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comments is not accurate. We have no way to guarantee success of dirty_inode always. EQUOTA/ENOSPACE is not inevitable. After reference the implementation of other filesystems, I think the best effort is retry once and give up.
How do you think?

Copy link
Contributor

@behlendorf behlendorf Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required semantic is "must always succeed"...

The required semantics are that the inode must be marked dirty so it will be written latter. The code happens to take this opportunity to assign a transaction for the newly dirtied inode. However, the vfs does provide a second .write_inode callback where the transaction assignment could be moved too. It does allow for failures and properly handles them.

From Documentation/filesystems/vfs.txt the required semantics are:

        void (*dirty_inode) (struct inode *, int flags);
        int (*write_inode) (struct inode *, int);

  dirty_inode: this method is called by the VFS to mark an inode dirty.

  write_inode: this method is called when the VFS needs to write an
        inode to disc.  The second parameter indicates whether the write
        should be synchronous or not, not all filesystems check this flag.

Splitting this up in something which has been on the list to investigate, but hasn't been critical since the existing code in practice works well. Getting this right and tested on all the supported kernels might also be tricky.

For the purposes on this PR why don't we leaved this code unchanged and continue to use TXG_WAIT. This won't change the behavior and resolves the deadlock at hand. Then in a follow up PR, if @wgqimut is available he can investigate implementing the .write_inode callback. For frequently dirtied inode there's potentially a significant performance win to be had.

if (error) {
if (error == ERESTART && waited == B_FALSE) {
waited = B_TRUE;
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
goto out;
}
Expand Down