Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock between mm_sem and tx assign in zfs_write() and page fault #7939
deadlock between mm_sem and tx assign in zfs_write() and page fault #7939
Changes from 1 commit
6458894
9f7221f
8e221f6
102619e
cfce4c2
9cc3613
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like you investigated committing the empty tx and decided against it. Can you explain why. It looks like it would be relatively straight forward to slightly rework the existing
if (tx_bytes == 0) {}
case tocontinue
onEFAULT
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought if looks werid and not straight forward. I you think committing a empty txg is a better option, I'm okay with that. I will make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I find committing the empty txg to be clearer. Although, I can see the argument for adding a reassign function to handle some cases like this. Let's get @ahrens thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to commit the empty tx and then go back to around to where the "Start a transaction." comment is. From your previous conversation, it sound like dmu_write_uio_dbuf() will fail infrequently, so performance is not a concern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to the beginning of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is not just a stylistic suggestion; if we
goto top
, we wantwaited
to beTRUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more idiomatic, as well as easier to understand, if we do the same thing as other similar code, i.e.:
If you search for NOTHROTTLE in this file, you'll see 8 other cases that do it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an extra complication which isn't obvious when reading this code in isolation which is why it was written a little differently. Despite the fact that this function allows an error to be returned, it's called from
zpl_dirty_inode()
which is a Linux VFS callback functions (.dirty_inode) which must always succeed.This is why the existing code used
TXG_WAIT
since it is handled slightly differently indmu_tx_try_assign()
.If we don't use
TXG_WAIT
it's possible the time update could be dropped and there's no way to report it. Perhaps we should instead add a comment explaining why this one is different.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, please add a comment explaining why we need to use TXG_WAIT. And I think we would need to use TXG_WAIT every time, not just if the first NOWAIT call fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I still don't understand how this works. The first time through, we call
dmu_tx_assign(TXG_NOWAIT)
. If the pool is suspended, it will return EIO (per the code @behlendorf quoted above). zfs_dirty_inode() will thengoto out
and return EIO. According to the comment you added, this would be incorrect behavior, because it "must always succeed".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we should always assign by
TXG_NOTHROTTLE | TXG_WAIT
, thendmu_tx_assign
can only returnERESTART
, and we try again and agian until we assign successfully.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dmu_tx_try_assign
may return likeEDQUOT
orENOSPC
error in this place, this will makedirty inode
op failed, if I always retry no matter what error returned bydmu_tx_assign
, this may lead to infinite loop, do you have any suggestions?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the required semantic is "must always succeed", then your only options are to retry (potentially leading to an infinite loop), or panic (halt operation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if you get ENOSPC or EDQUOT (rather than ERESTART), then retrying will only work if some other process happens to free up space. We could consider using dmu_tx_mark_netfree() to greatly reduce the chances of ENOSPC/EDQUOT, by allowing it to use half the slop space, like unlink(). But it's still possible to get ENOSPC, so we'll have to decide how to handle it. See also the comments above dmu_tx_mark_netfree() and dsl_synctask.h:ZFS_SPACE_CHECK_RESERVED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous comments is not accurate. We have no way to guarantee success of dirty_inode always. EQUOTA/ENOSPACE is not inevitable. After reference the implementation of other filesystems, I think the best effort is retry once and give up.
How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The required semantics are that the inode must be marked dirty so it will be written latter. The code happens to take this opportunity to assign a transaction for the newly dirtied inode. However, the vfs does provide a second
.write_inode
callback where the transaction assignment could be moved too. It does allow for failures and properly handles them.From
Documentation/filesystems/vfs.txt
the required semantics are:Splitting this up in something which has been on the list to investigate, but hasn't been critical since the existing code in practice works well. Getting this right and tested on all the supported kernels might also be tricky.
For the purposes on this PR why don't we leaved this code unchanged and continue to use
TXG_WAIT
. This won't change the behavior and resolves the deadlock at hand. Then in a follow up PR, if @wgqimut is available he can investigate implementing the.write_inode
callback. For frequently dirtied inode there's potentially a significant performance win to be had.