Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.7.1 client: data corruption when the client chunked PUT times out #2676

Closed
moscicki opened this issue Jan 7, 2015 · 20 comments
Closed

1.7.1 client: data corruption when the client chunked PUT times out #2676

moscicki opened this issue Jan 7, 2015 · 20 comments

Comments

@moscicki
Copy link
Contributor

moscicki commented Jan 7, 2015

We have seen data corruption which happens when the chunked upload (of the first chunk) times out while talking to the nginx proxy (over SSL, proxy terminates SSL). That happened due to a network problem in the smashbox testing and after some pain I managed to reproduce it manually.

Steps to reproduce (on redhat6):

  • run owncloudcmd 1.7.1 which starts chunked upload (our test files are 30MB)
  • suspend the owncloudcmd process (and all its threads) immediately after it starts putting first chunk, keep it suspended for a bit more than 60 seconds

In the nginx logs we have records of three requests:

  • first PUT returning 408 after 60 seconds (Client TimeOut) -- never reached upstream server
  • two subsequent PUTs (two chunks uploaded) returning 200 -- processed correctly by the upstream server

To check what goes on on the nginx proxy I have configured nginx to leave temporary files with request bodies:

client_body_in_file_only on;
client_body_in_single_buffer on;
client_body_temp_path /var/spool/nginx/tmp/client_body/;

I then concatenate two chunked PUT request bodies corresponding to my file:

cat /var/spool/nginx/tmp/client_body/0000000054 /var/spool/nginx/tmp/client_body/0000000055 > /tmp/FULL

When I run the md5sum on /tmp/FULL it corresponds exactly the file saved by the upstream storage server but it is different than the checksum of the original file that sync client was uploading.

Could you please examine the source code for client timeout condition for chunked PUTs? It is hard to sniff the traffic with SSL. Would it be plausible to have an option to save the payload of the PUT body into tmp files for debugging?

@moscicki
Copy link
Contributor Author

moscicki commented Jan 8, 2015

Two more traits:

  • running the same exact sequence of event with curl DOES NOT result in data corruption (the difference known to us is the support for 100-Continue by the curl client)
  • re-running the same exact test with owncloudcmd 1.7.1 against HTTP nginx endpoint (SSL DISABLED) DOES NOT result in data corruption

@moscicki
Copy link
Contributor Author

moscicki commented Jan 8, 2015

The corruption pattern (16KB block) in this case is the same as already reported by other users (and by us in the past): #2425 (comment)

Smells like SSL problem on the client: is it possible that some buffers (16KB) are reused and not properly initialized by the underlying Qt layer?

@moscicki
Copy link
Contributor Author

moscicki commented Jan 8, 2015

@ckamm: we have a reproducer for this problem now
@chandon: do you have nginx proxy in front of your server? if yes, which version?

@chandon
Copy link

chandon commented Jan 8, 2015

@moscicki: i don't have any proxy on my server. I use an apache server and owncloud is served over HTTPS whith a self-signed certificate (without mod_proxy as specified before)

@moscicki
Copy link
Contributor Author

moscicki commented Jan 8, 2015

@ckamm, @chandon: ok, then this is very clear now: it is a owncloud sync client problem with SSL enabled and not the nginx proxy (curl client does not have the problem either). The symptoms are so similar in both our cases that is almost certain we see the same problem. Now as we have a clear reproducer I think the devs can investigate more.

@chandon: if you still have logs you may check if you can see 408 return code on PUT requests (or other symptoms of network timeout while doing chunked upload)

Independently of that we are now working on checksumming functionality with sync client devs and this protect us from any forms of wierd corruption in the future. This case shows very clearly that checksumming is a must, as the entire stack which is on the data path is too complex and too diverse and not owned by "us" (users or service providers).

@moscicki
Copy link
Contributor Author

moscicki commented Jan 8, 2015

@dragotin: this is a hardcore corruption case but I think it is very urgent (it will silently corrupt data on shaky networks for all your users).

@dragotin dragotin added this to the 1.8 - UI Enhancements milestone Jan 8, 2015
@moscicki moscicki changed the title data corruption when the client chunked PUT times out -- nginx or 1.7.1 client problem 1.7.1 client: data corruption when the client chunked PUT times out Jan 8, 2015
@guruz
Copy link
Contributor

guruz commented Jan 8, 2015

@moscicki We're not the only Qt users though... are there similar bugreports e.g. on http://bugs.qt-project.org ?
Can you reproduce this if you force the usage of the old propagator? (export OWNCLOUD_USE_LEGACY_JOBS=1)
curl might not see it because of the HTTP 100 you mentioned?

@dragotin
Copy link
Contributor

dragotin commented Jan 8, 2015

@moscicki I agree with all your conclusions.

@danimo, @guruz can you investigate this on the Qt and SSL side of things? Another thing we need to double check is the handling of the 408 error condition.

Thanks for the analysis, that's very helpful!

@chandon
Copy link

chandon commented Jan 8, 2015

@moscicki: in my log files, for the PUT requests on owncloud, no 408 :

Status Code|Request count
201        |63597
204        |66004
400        |68
412        |258
500        |255
507        |5

@moscicki
Copy link
Contributor Author

moscicki commented Jan 8, 2015

@guruz: I would leave this to developers to replay the reproducer and fix the case. thanks!

@ckamm
Copy link
Contributor

ckamm commented Jan 14, 2015

@moscicki I'm trying to reproduce, but wasn't successful yet.

My steps:

  • Use owncloudcmd to upload a 30MB file
  • Suspend immediately when PUT job starts
  • Wait for 408 to appear in nginx logs
  • Resume owncloudcmd, wait for PUT to finish
  • Compare checksum of original file and file on server

Questions:

  • Does it happen every time for you?
  • How important is the timing on the suspend? Will it fail to reproduce when a second passes before the suspend?

I found it odd that I don't see any mention of the first upload's timeout in the owncloudcmd logs. Do you see it there?

@moscicki
Copy link
Contributor Author

@ckamm: sorry I was a bit overloaded, I will come back on this to you tomorrow

@guruz
Copy link
Contributor

guruz commented Jan 20, 2015

@moscicki Could you tell us which Qt version you are using on the problematic machines? Also if the issue exists if you use the OWNCLOUD_USE_LEGACY_JOBS export?

@ogoffart
Copy link
Contributor

@moscicki which OS, with version of openssl?
Could it be a problem on the server?
Is there anything i can do for this issue?

@moscicki
Copy link
Contributor Author

@ogoffart, @guruz, @ckamm: sorry guys for the delay. I have not yet had time to work on it. I will come back to you on this next week.

It happened on redhat6, client 1.7.1:

>rpm -qa | grep openssl
openssl-1.0.1e-30.el6_6.5.x86_64

>ldd /opt/qt-4.8/bin/owncloudcmd 
    linux-vdso.so.1 =>  (0x00007fff1dd58000)
    libowncloudsync.so.0 => /opt/qt-4.8/lib64/libowncloudsync.so.0 (0x00007f5e79528000)
    libQtWebKit.so.4 => /opt/qt-4.8/lib64/libQtWebKit.so.4 (0x00007f5e77bd0000)
    libQtXmlPatterns.so.4 => /opt/qt-4.8/lib64/libQtXmlPatterns.so.4 (0x00007f5e77558000)
    libQtGui.so.4 => /opt/qt-4.8/lib64/libQtGui.so.4 (0x00007f5e768a0000)
    libQtDBus.so.4 => /opt/qt-4.8/lib64/libQtDBus.so.4 (0x00007f5e76620000)
    libQtXml.so.4 => /opt/qt-4.8/lib64/libQtXml.so.4 (0x00007f5e763d8000)
    libQtSql.so.4 => /opt/qt-4.8/lib64/libQtSql.so.4 (0x00007f5e76198000)
    libQtNetwork.so.4 => /opt/qt-4.8/lib64/libQtNetwork.so.4 (0x00007f5e75e50000)
    libQtCore.so.4 => /opt/qt-4.8/lib64/libQtCore.so.4 (0x00007f5e75980000)
    libocsync.so.0 => /opt/qt-4.8/lib64/owncloud/libocsync.so.0 (0x00007f5e75760000)
    librt.so.1 => /lib64/librt.so.1 (0x0000003bb2a00000)
    libdl.so.2 => /lib64/libdl.so.2 (0x0000003bb2600000)
    libsqlite3.so.0 => /usr/lib64/libsqlite3.so.0 (0x0000003bb2e00000)
    libqtkeychain.so.0 => /opt/qt-4.8/lib64/libqtkeychain.so.0 (0x00007f5e75528000)
    libneon.so.27 => /opt/neon-0.30.0/lib64/libneon.so.27 (0x00007f5e752f8000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003bb4e00000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f5e75070000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003bb4600000)
    libc.so.6 => /lib64/libc.so.6 (0x0000003bb1e00000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003bb2200000)
    libXrender.so.1 => /usr/lib64/libXrender.so.1 (0x0000003bb9600000)
    libX11.so.6 => /usr/lib64/libX11.so.6 (0x0000003bb9200000)
    libgthread-2.0.so.0 => /lib64/libgthread-2.0.so.0 (0x0000003bb5200000)
    libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x0000003bb3a00000)
    libpng12.so.0 => /usr/lib64/libpng12.so.0 (0x0000003bba600000)
    libz.so.1 => /lib64/libz.so.1 (0x0000003bb3200000)
    libfreetype.so.6 => /usr/lib64/libfreetype.so.6 (0x0000003bb7600000)
    libgobject-2.0.so.0 => /lib64/libgobject-2.0.so.0 (0x0000003bb4a00000)
    libSM.so.6 => /usr/lib64/libSM.so.6 (0x0000003bbca00000)
    libICE.so.6 => /usr/lib64/libICE.so.6 (0x0000003bbce00000)
    libXi.so.6 => /usr/lib64/libXi.so.6 (0x0000003bbc200000)
    libXrandr.so.2 => /usr/lib64/libXrandr.so.2 (0x0000003bbb200000)
    libXfixes.so.3 => /usr/lib64/libXfixes.so.3 (0x0000003bbba00000)
    libXcursor.so.1 => /usr/lib64/libXcursor.so.1 (0x0000003bbb600000)
    libXinerama.so.1 => /usr/lib64/libXinerama.so.1 (0x0000003bbbe00000)
    libfontconfig.so.1 => /usr/lib64/libfontconfig.so.1 (0x0000003bb7e00000)
    libXext.so.6 => /usr/lib64/libXext.so.6 (0x0000003bb5e00000)
    libdbus-1.so.3 => /lib64/libdbus-1.so.3 (0x0000003bb4200000)
    /lib64/ld-linux-x86-64.so.2 (0x0000003bb1a00000)
    libgnutls.so.26 => /usr/lib64/libgnutls.so.26 (0x00007f5e74dc8000)
    libpakchois.so.0 => /usr/lib64/libpakchois.so.0 (0x00007f5e74bc0000)
    libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x0000003bba200000)
    libkrb5.so.3 => /lib64/libkrb5.so.3 (0x0000003bb8e00000)
    libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x0000003bb6e00000)
    libcom_err.so.2 => /lib64/libcom_err.so.2 (0x0000003bb5600000)
    libproxy.so.0 => /usr/lib64/libproxy.so.0 (0x00007f5e749b8000)
    libexpat.so.1 => /lib64/libexpat.so.1 (0x0000003bb8a00000)
    libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x0000003bb8200000)
    libuuid.so.1 => /lib64/libuuid.so.1 (0x0000003bb5a00000)
    libtasn1.so.3 => /usr/lib64/libtasn1.so.3 (0x00007f5e747a8000)
    libgcrypt.so.11 => /lib64/libgcrypt.so.11 (0x00007f5e74530000)
    libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x0000003bb6200000)
    libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003bb6a00000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003bb3e00000)
    libXau.so.6 => /usr/lib64/libXau.so.6 (0x0000003bbae00000)
    libgpg-error.so.0 => /lib64/libgpg-error.so.0 (0x00007f5e74328000)
    libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003bb3600000)

More next week.

@moscicki
Copy link
Contributor Author

And:

>rpm -qa | grep owncloud
libowncloudsync0-1.7.1-1.2.x86_64
owncloud-client-l10n-1.7.1-1.2.x86_64
owncloud-client-1.7.1-1.2.x86_64

@guruz
Copy link
Contributor

guruz commented Feb 3, 2015

.oO(Note to self: The Qt 4.8 might be one reason.. just a guess)

#2558

@ogoffart
Copy link
Contributor

ogoffart commented Feb 3, 2015

Is there anything else I can do regarding this issue?

@guruz
Copy link
Contributor

guruz commented Feb 4, 2015

My only remaining idea was that @moscicki tests with OWNCLOUD_USE_LEGACY_JOBS so we can see if the bug is either

  1. in our QNAM propagator code
  2. in his Qt 4.8

@dragotin Otherwise, is there a way to provide @moscicki a Qt 5.4 build for his CentOS?

@ogoffart
Copy link
Contributor

Please re-open a new issue if there is still a problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants