-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor generic inode time updating #4916
Conversation
@lorddoskias Edit: see zfs_tstamp_update_setup |
Extending |
fc34576
to
6b1e4bc
Compare
Provided the tests for those 3 commits succeed I'd like to add another one which simplifies the zfs_inode_update_impl byt removing the code necessary to support the "new" variant. So don't merge until then. |
@@ -361,8 +361,10 @@ zpl_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) | |||
ssize_t wrote; | |||
|
|||
crhold(cr); | |||
spl_inode_lock(file_inode(filp)); | |||
wrote = zpl_write_common(filp->f_mapping->host, buf, len, ppos, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tuxoko @behlendorf Is there any reason we are getting the inode from f_mapping->host and not directly from file_inode? That'd be one dereference and make the code more succinct and clear. In my testing I put an assert for f_inode == f_mapping_host and it never fired, so I assume it's safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backward compat.
Edit: Oh, we actually already have file_inode compat layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indee, and even in the compat layer the inode ref is taken via the dentry and not the f_mapping. Though it shouldn't really matter. So are you happy with me changing the argument to zpl_write_common to utilise file_inode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally it was for backwards compatibility but since this code was originally written we've added a file_inode()
compat wrapper to support all the way back to 2.6.32 kernels (which is the oldest kernel we support). You should be able to safely change this to file_inode()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could still use a file_dentry() compat wrapper, see #4935.
So are you happy with me changing the argument to zpl_write_common to utilise file_inode?
Yup.
Did this fail due to a bug (deadlock) or due to amazon problems? |
@lorddoskias Amazon problems which have been resolved (I hope). Can you rebase this on master. The patch to add edit: I lied, this failed almost certainly due to commit c484597. We shouldn't need to make this change at all. |
* Only read atime from SA if we are newly created inode (or rezget), | ||
* otherwise i_atime might be dirty. | ||
*/ | ||
if (new) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of this change you can remove new
from the function arguments (which was recently added) and the zfs_inode_update_new()
function. Or we can leave it as dead code for now and remove it when this function is completely removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Intended to do this change and add it do this PR but I wanted to clear the other commits first. So, the reason why I locked the zfs_write is that the inode->i_mtime etc attributes are being changed. And just to be on the safe side I wanted to lock the vfs inode while the operations is in progress, otherwise it might be possible to initiate 2 separate operations which might lead to corruption. E.g. a write + truncate (maybe, I'm only speculating)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing range lock will guarantee a consistent ordering as long as you perform the inode updates while it's being held. Which is why zfs_inode_update()
always happens before zfs_range_unlock()
.
d3ca60e
to
8f12781
Compare
Definite progress, but it appears that |
@behlendorf I think I might have found the culprit and it might turned out to be a latent bug in zfs. So when a file is created in a zfs directory here are the situations where the mtime is being updated:
So looking at the stack trace (triggered by the following debug code zfs_tstamp_update_setup) :
As can be seen only the second branch is being triggered, hence the mtime is not being updated. Again looking at the initial backtrace the tstamp_update function is called from zfs_setattr, which has the following code:
So in this case we are in the second branch, where tstamp_update is called with STATE_CHANGED, which is defined to be CTIME. However, in the else clause we are only updating the ctime, and completely ignoring whether ATTR_MTIME is actually being set. And in this case ATTR_MTIME is in fact set. The following patch seems to fix the issue, does it look correct to you:
|
That looks correct to me, this is almost certainly an issue caused by the differences in how illumos and Linux handle these bits. My only suggestion is that something like the following patch may be more readable (untested). diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 707a211..67ecd2f 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -3024,7 +3024,8 @@ top:
} else if (mask != 0) {
SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zsb), NULL,
&ctime, sizeof (ctime));
- zfs_tstamp_update_setup(zp, STATE_CHANGED, mtime, ctime);
+ zfs_tstamp_update_setup(zp, mask & (ATTR_MTIME | ATTR_CTIME),
+ mtime, ctime);
if (attrzp) {
SA_ADD_BULK_ATTR(xattr_bulk, xattr_count,
SA_ZPL_CTIME(zsb), NULL, pro-tip: Use |
@behlendorf You suggestion works and I will use it. I also prefer this change to be in a separate commit for the sake of git history. Also I started looking a bit closer into zfs_setattr and wonder whether all the code dealing with XVATTR should be removed? Clearly this is an artifact from illumos and the zfs_setattr function is very large and hard to follow, let alone reason for correctness? |
Keeping this change in a separate commit makes sense to me. You're right, the XVATTR code really should go away. None of it makes sense for Linux. What's kept anyone from doing this so far is that the needed changes go much deeper than you might initially expect. And as you rightly guessed we've left it largely for reasons of correctness. That said, I wouldn't be at all opposed to seeing it go and that might be more practical now that we have better test coverage. |
@behlendorf I see, indeed it's going to be tricky. But looking even further I'm really perplexed as to how the mtime stuff should be handled. Since first we have:
Meaning whatever is the value passed from the VFS is going to be persisted on disk. Then in the |
That's troubling. That certainly could cause subtle issues. For instance things like To my knowledge I don't recall any reports of this causing problems but it's wouldn't be a terrible idea to address it. |
@behlendorf Actually I might be wrong. I just tested commenting the code in the "else" branch and always relying on whatever is passed into vap->va_mtime and the test kept failing. So what's in here now seems to be correct. I will squash the commits, rebase and resubmit and if everything is ok I guess this can be merged. |
8f12781
to
07a04ed
Compare
Yup, just make sure you get the cstyle issue too. |
07a04ed
to
a990133
Compare
So the non-simd test seemed to just have timeout. However, I cannot understand the leakage in the SIMD instance. I cannot see how my changes could affect this. Is this just some instance jitter? |
The non-SIMD failure was due to #4034 which is a rare longstanding issue we've seen in testing a few times. The leaking in the SIMD case I think is unrelated too. I've resubmitted both of those builds to see if it's reproducible. |
the time update in setattr is over complicated. It should be as simple as |
There's definitely room for simplifying this. But I think it makes sense to make that kind of change in the context of switching from the illumos vattr to the linux iattr as mentioned above. This change as is LGTM. @tuxoko do you have any remaining concerns about this cleanup? |
@behlendorf |
@tuxoko I tested an implementation that was very similar to what copy_attr was doing w.r.t to time handling and it still was failing the time test from ZFS. Though I agree that logic should be fairly simple. I will have another go at that, hopefully bringing better results. And you are right about your concern of ignoring what's in the iattr when mask != 0 and just using the current time. |
@lorddoskias |
@tuxoko You were right. I just did the code, but this time with the proper inode modification and the test passed. I've pushed a new branch, hopefully this one is the final version. |
b219bf0
to
441cc8f
Compare
@@ -3010,29 +3009,27 @@ zfs_setattr(struct inode *ip, vattr_t *vap, int flags, cred_t *cr) | |||
|
|||
if (mask & ATTR_MTIME) { | |||
ZFS_TIME_ENCODE(&vap->va_mtime, mtime); | |||
ZTOI(zp)->i_mtime = timespec_trunc(vap->va_mtime, | |||
ZTOI(zp)->i_sb->s_time_gran); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[cstyle] 4 space indent for continued lines. The same change should be made to the rest of the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strance, my cstyle doesn't report anything, that's why I pushed it. Are you using a modified cstyle or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I just noticed while looking at the patch. cstyle just isn't quite smart enough to catch this case. Unlike the Linux kernel style the SunOS style is to simply indent 4 spaces to continue a line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, goot thing vim's "set list" option shows the discrepancies. Fixed in current version but will push once tests are finished so that I get a baseline results
I noticed that zfs_log_setattr doesn't store ctime, and it doesn't bother to do anything with ATTR_CTIME, heck, the original zfs_setattr doesn't bother either. So I guess in order for log replay to work, we need to revert the ctime part. |
@tuxoko You mean just remove the |
@lorddoskias |
That's really surprising, and the replay code here is downright convoluted. It would be so much clearer if |
@behlendorf I thought the same thing as I was reading the code. Any particular reason why this isn't the case? It seems a rather simple fix for the case? Any objections to doing it that way in zfsonlinux? |
Yeah, this would be much cleaner. |
I've no idea was it was originally done this way, this logic predates my involvement with ZFS. I don't have any objection to updating the code for ZoL to do the simpler thing here. |
BTW, I'm kind of curious about the mask in zfs_log_setattr. Is this even OS independent. It sure doesn't look like. |
No, it's definitely not designed to be OS independent. It's just adapted from the original Solaris code. |
Which means a pool with dirty log is not safe to migrate between OS? |
No that should behave consistently across OSs. We interpret the log contents the same way and map them to their Linux counterparts. But as you know a lot of this infrastructure is Solaris specific and is just emulated on Linux (vattr_t, xvattr_t, etc). It's be great to just do the native Linux thing or alternately some more OS independent interface we could all use. |
I'm talking about the mask value. I believe we just use the ATTR_UID and friends defined in Linux without any translation. This surely won't behave correctly when imported in Solaris. |
Oh, I see. You're right, those definitely don't match up and really should be translated. |
ZFS doesn't provide a custom update_time method meaning it delegates this job to the generic VFS layer. The only time when it needs to set the various *time values is when the inode is being marshalled to/from the disk. Do this by moving the relevant code from zfs_inode_update_impl to zfs_node_alloc and zfs_rezget. As a result from this change it is no longer necessary to have multiple versions of the zfs_inode_update function - so just nuke them and leave only one.
441cc8f
to
bd5a37b
Compare
The failures seem irrelevant to the changes. |
@@ -320,6 +320,7 @@ typedef struct { | |||
uint64_t lr_size; /* size to set */ | |||
uint64_t lr_atime[2]; /* access time */ | |||
uint64_t lr_mtime[2]; /* modification time */ | |||
uint64_t lr_ctime[2]; /* change time */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't change the log record structure.
@lorddoskias Please try this way. |
Simplify time handling in zfs_setattr by mimicking the logic in setattr_copy from the linux kernel. In order to achieve this in the case when ZFS' log is being replayed it is necessary to unconditionally set the ctime in zfs_replay_setattr. Also use the timespec_trunc function when assigning values to the generic inode struct. This is currently a noop since zfs sets s_time_gran to 1, however in the future rules about precision might change.
bd5a37b
to
e561a26
Compare
@tuxoko Should be looking much better now. What do you say ? |
@lorddoskias |
That's so much better, LGTM. |
Simplify time handling in zfs_setattr by mimicking the logic in setattr_copy from the linux kernel. In order to achieve this in the case when ZFS' log is being replayed it is necessary to unconditionally set the ctime in zfs_replay_setattr. Also use the timespec_trunc function when assigning values to the generic inode struct. This is currently a noop since zfs sets s_time_gran to 1, however in the future rules about precision might change. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Closes #4916
So here is my initial go at moving more code away from zfs_inode_update_impl. This shouldn't cause any visible behavior changes.