Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Multiple uploads resulting in CANCELED #15841

Closed
Cyberes opened this issue Jun 26, 2023 · 20 comments
Closed

Multiple uploads resulting in CANCELED #15841

Cyberes opened this issue Jun 26, 2023 · 20 comments
Labels
A-Media-Repository Uploading, downloading images and video, thumbnailing T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Cyberes
Copy link
Contributor

Cyberes commented Jun 26, 2023

Description

I am attempting to upload two video files (each about 30 MB) to a room at the same time. This always fails with a generic error message popup (The file '[filename]' failed to upload.). There is no error in the console and the network of the debug window shows the fetch request as (canceled). Synapse logs don't show any errors and neither does Openresty. Uploading the files one at a time works, but uploading them together always fails.

  • My upload limit in Synapse and Openresty (reverse proxy) is 40M. This has been an issue for a while and I have not discovered the root cause yet.
  • My client is desktop and web Element on Ubuntu.
  • I have a media worker with an S3 storage provider setup (see config below).

Steps to reproduce

  • ???

Homeserver

matrix.evulid.cc

Synapse Version

1.86.0

Installation Method

pip (from PyPI)

Database

PostgreSQL

Workers

Multiple workers

Platform

Ubuntu Server LXC container

Configuration

media_store_path:                                 "/mnt/matrix-storage/media"

media_storage_providers:
- module:                                         s3_storage_provider.S3StorageProviderBackend
  store_local:                                    True
  store_remote:                                   True
  store_synchronous:                              True
  config:
      bucket:                                     matrix-synapse
      endpoint_url:                               'https://xxx.r2.cloudflarestorage.com/matrix-synapse'
      access_key_id:                              'xxx'
      secret_access_key:                          'xxx'

Relevant log output

No unusual logs.

Anything else that would be useful to know?

No response

@Cyberes
Copy link
Contributor Author

Cyberes commented Jun 26, 2023

Looks like the 30MB file fails when individually uploaded.

@MadLittleMods MadLittleMods added the A-Media-Repository Uploading, downloading images and video, thumbnailing label Jun 28, 2023
@MadLittleMods
Copy link
Contributor

I know you mentioned that there aren't any relevant logs but seems like it could be related to #12937 -> #13009

Are you using Element to upload the file? Element will cancel the request if it doesn't see any progress for 30 seconds (see matrix-org/matrix-js-sdk -> src/http-api/index.ts#L102-L111)

It seems like TCP dumps from the client <-> server would be useful to debug this further.

Other related issues:

@MadLittleMods MadLittleMods added X-Needs-Info This issue is blocked awaiting information from the reporter T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jun 28, 2023
@Cyberes
Copy link
Contributor Author

Cyberes commented Jun 29, 2023

I'm no web developer, but I checked the debugger and saw that I am indeed triggering the timeout. The upload only takes a few seconds but the bar sits at 100% for a bit before the timeout kills it.

I also moved the media store to an SSD and put /tmp on RAM. I don't think it's related to the file moving issue because its only a 30MB file.

MadLittleMods added a commit that referenced this issue Jun 29, 2023
@clokep
Copy link
Member

clokep commented Jun 29, 2023

Is your reverse proxy killing the upload? I think by default they usually have small limits.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Jun 29, 2023

@Cyberes Does the media end up on your server after the dust settles? (if so, feels like it's one of the issues I've already linked)

@clokep also has an interesting angle with the reverse-proxy.

Otherwise, seems like a case of tracing or adding logs to see where Synapse is stuck at. I've added some extra instrumentation so traces provide a lot more info when using the /upload endpoint -> #15850

Are you able to enable tracing for your Synapse instance (will need wait for the PR to merge or apply those changes manually)? Please share the trace JSON files if you get this going

@Cyberes
Copy link
Contributor Author

Cyberes commented Jun 29, 2023

There isn't anything in the nginx error logs. I have client_max_body_size 40M; in my config.

And find /mnt/matrix-storage/ -name "*[ID returned by /upload]*" does show the file in local_content.

I'll pull that branch and enable tracing.

MadLittleMods added a commit that referenced this issue Jul 5, 2023
Add tracing instrumentation to media `/upload` code paths to investigate #15841
@Cyberes
Copy link
Contributor Author

Cyberes commented Jul 6, 2023

Here's the trace.

evulid.cc media_worker UploadResource.json.txt

From the clients perspective, it was canceled after 33.68s of being in-flight:

Screenshot_2023-07-06_14-41-59

@MadLittleMods MadLittleMods removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Jul 6, 2023
MadLittleMods added a commit that referenced this issue Jul 6, 2023
Follow-up to #15850

Tracing instrumentation to media `/upload` code paths to investigate,
#15841
@MadLittleMods
Copy link
Contributor

MadLittleMods commented Jul 6, 2023

Thanks for trace @Cyberes!

It doesn't appear to be a bug in Synapse given that the file ends up on your server and the trace shows that it's working but just has a long period where it blocks while storing the file (probably the shutil.copyfile part). It's probably just Synapse or your system being slow to store the file (performance issue) and Element doesn't gracefully handle that case.

What does your homeserver.yaml look like (media_store_path, media_storage_providers)?

I've also made the tracing less ambiguous since there are so many shared names that make my head hurt trying to correlate to the code, #15888. Also added an extra trace around the shutil.copyfile part specifically to see if it's that. Post another trace when you get a chance to run with that code 🤞

@Cyberes
Copy link
Contributor Author

Cyberes commented Jul 6, 2023

media_store_path:                                 "/mnt/matrix-storage/media"

media_storage_providers:
- module:                                         s3_storage_provider.S3StorageProviderBackend
  store_local:                                    True
  store_remote:                                   True
  store_synchronous:                              True
  config:
      bucket:                                     matrix-synapse
      endpoint_url:                               'https://xxx.r2.cloudflarestorage.com/matrix-synapse'
      access_key_id:                              'xxx'
      secret_access_key:                          'xxx'

Ah, I forgot about S3! It might be slow to send it to S3.

evulid.cc media_worker UploadResource less ambiguous.json.txt

@MadLittleMods
Copy link
Contributor

It's definitely going to be the S3 storage provider being slow 🎯

I think you could set store_synchronous: False so it doesn't block on it.

@Cyberes
Copy link
Contributor Author

Cyberes commented Jul 6, 2023

I disabled S3 and it looks like the duration is now 312.27ms but Element still says (canceled).

evulid.cc.media_worker.UploadResource no S3.json.txt

@MadLittleMods
Copy link
Contributor

@Cyberes Please post the client-side network request alongside the trace. If the request says (canceled), it's the client who canceled it. But it would be good to see the timing from the outside.

From the trace alone, nothing seems wrong.

The PR to update the tracing also has a few updates if it's easy for you to pull.

@Cyberes
Copy link
Contributor Author

Cyberes commented Jul 6, 2023

I didn't see the new changes, here's the new trace.

evulid.cc.media_worker.UploadResource more.json.txt

vector.har.txt

@Cyberes
Copy link
Contributor Author

Cyberes commented Jul 7, 2023

I might have forgotten to do pip install . after pulling. Here are some more traces of failed uploads, let me know if you need more HARs.

evulid.cc.media_worker.UploadResource with media_storage_providers commented out.json.txt
evulid.cc.media_worker.UploadResource with s3 store_synchronous false.json.txt
evulid.cc.media_worker.UploadResource with s3.json.txt

Oddly, after restarting Synapse with the whole media_storage_providers commented out, things worked for a while. I played with store_synchronous for a bit and while things weren't uploading to the bucket the Element->Synapse uploads worked. But then things stopped working.

@MadLittleMods
Copy link
Contributor

@Cyberes The bulk amount of info isn't helpful. Please just boil it down to the simplest configuration that reproduces the problem.

It looks like the har file isn't valid JSON and I was only really concerned with the request duration as seen from the client.

Just need one Jaeger trace and corresponding request timing on the client (like earlier). The har file is also fine but I don't want to faff around with trying to make it parse.

@Cyberes
Copy link
Contributor Author

Cyberes commented Jul 9, 2023

Okay, wasn't sure what you were looking for. Just stick with evulid.cc.media_worker.UploadResource more.json.txt and the corresponding vector fixed.har.txt

Synapse trace:

Client-side:

Screenshot 2023-07-09 at 15-29-43 HTTP Archive Viewer 2 0 17

MadLittleMods added a commit that referenced this issue Jul 10, 2023
A lot of the functions have the same name in this space like `store_file`,
and we also do it multiple times for different reasons (main media repo,
other storage providers, thumbnails, etc) so it's good to differentiate
them so your head doesn't explode.

Follow-up to #15850

Tracing instrumentation to media `/upload` code paths to investigate #15841
@MadLittleMods
Copy link
Contributor

Thanks for the clarity @Cyberes!

Trace looks good again and seems like we need to move to the client-side layer (Element) to see why it's canceling. Add a bunch of logging around what the matrix-js-sdk -> uploadContent(...) is doing. And this would be better tracked in https://github.com/vector-im/element-web/issues. If it narrows back down to a Synapse specific behavior, feel free to post here again.

@Cyberes
Copy link
Contributor Author

Cyberes commented Jul 11, 2023

I really appreciate the help @MadLittleMods and you going out of your way to help! I've sent the Element team a bug report and I'm sure they'll appreciate the server-side debugging we've done so far. element-hq/element-web#25756

@Cyberes
Copy link
Contributor Author

Cyberes commented Jul 13, 2023

I ran debug logging on the /media endpoint. It seems like Nginx is still receiving data from Element when Element cancels the upload.
element-hq/element-web#25756 (comment)

@Cyberes
Copy link
Contributor Author

Cyberes commented Oct 19, 2023

I think this may be a Cloudflare issue.

I disabled Cloudflare for matrix.evulid.cc and I was able to upload a few dozen files without any (CANCELED) error. I then switched CF back on and got the error again. So it's something about their middleware that is causing something to go wrong.

I've never encountered an issue like this with CF. Do you know of any conflicts that may be occurring between Element -> Cloudflare -> Synapse?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Media-Repository Uploading, downloading images and video, thumbnailing T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

3 participants