Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TotalPartsExceeded: exceeded total allowed configured MaxUploadParts #4117

Closed
ghost opened this issue Sep 20, 2019 · 6 comments
Closed

TotalPartsExceeded: exceeded total allowed configured MaxUploadParts #4117

ghost opened this issue Sep 20, 2019 · 6 comments

Comments

@ghost
Copy link

ghost commented Sep 20, 2019

What happened?:
Workload: ~100 GB datums at low frequency (maybe a few at a time). Each datum takes about 4 hours to run.

What happens: The logs show that all datums are processed, but the job progress from pachctl list job shows 0/num_datums marked as finished. The job fails after a few hours with

rpc error: code = Unknown desc = MultipartUpload: upload multipart failed
    upload id: NPlpCsvmjcOlmuk6AO7l_247qIBCMUcbpO5vRRozj_xUclDCzJlVWanx7efNly09rD106D7QfI4STmi7C5SWqImub7vjYYT15YKqqosnr6lk1A1S9OAkzUEqpK5zKhVFVSeBvKRoWPx_N1h7INBoxTJeQjlR_1tBm1tFBruANbs-
caused by: TotalPartsExceeded: exceeded total allowed configured MaxUploadParts (10000). Adjust PartSize to fit in this limit

What you expected to happen?:
I expected the files to be uploaded and merged since the datums were processed successfully.

How to reproduce it (as minimally and precisely as possible)?:
spec:

{
  "pipeline": {
    "name": "mapped_reads_g1"
  },
  "transform": {
    "image": "[redacted].dkr.ecr.us-east-2.amazonaws.com/bioinf/mapping:4",
    "cmd": ["/bin/bash"],
    "stdin": [
             "for fulldatum in /pfs/genomes/*",
             "do",
             "./read_mapping.sh BGISEQ500 /pfs/reference_prepped/human_g1k_v37.fasta.gz 72 ${fulldatum} /pfs/out",
             "done"
    ]
  },
  "datum_tries": 1,
  "input":{
    "cross": [
      {
        "pfs": {
          "repo": "reference_prepped",
          "glob": "/",
	  "branch": "g1k_v37"
        }
      },
      {
        "pfs": {
          "repo": "genomes",
          "glob": "/*",
	  "branch": "master"
        }
      }
    ]
  },
  "parallelism_spec": {
    "constant": 6
  },
  "resource_requests": {
    "cpu": 72,
    "memory": "100G"
  },
  "cache_size": "20G",
  "standby": true,
  "enable_stats": true
}

AWS servers: c5d.18xlarge

The script is just a simple shell script running this tool and
this tool.

Anything else we need to know?:

Error seems to come from the aws go dk:
https://github.com/aws/aws-sdk-go/blob/v1.20.3/service/s3/s3manager/upload.go#L574

See it here and here too

Example stats from one job

{
  "downloadTime": "1824.028572018s",
  "processTime": "10204.350023253s",
  "uploadTime": "409.736244278s",
  "downloadBytes": "93634516006",
  "uploadBytes": "77819727536"
}

Environment?:

  • Kubernetes version (use kubectl version):
    1.15.0
  • Pachyderm CLI and pachd server version (use pachctl version):
    1.9.5
  • Cloud provider (e.g. aws, azure, gke) or local deployment (e.g. minikube vs dockerized k8s):
    aws ec2
  • OS (e.g. from /etc/os-release):
    kope.io kops ami
  • Others:
@gabrielgrant
Copy link
Contributor

Thanks for the detailed bug filing, @Jamesthegiantpeach

To confirm is it the input, output, or both that is in the ~100GB range?

Regardless, it certainly looks like you're right that the AWS SDK requires upping the PartSize setting manually (default appears to be 5mb) for uploads of larger files to succeed. This seems relatively straightforward to handle for standard pipelines, since all data is on disk before upload begins, but not sure how we should be handling this for spouts. Thoughts @adelelopez ?

@ghost
Copy link
Author

ghost commented Sep 23, 2019

@gabrielgrant

{
  "downloadTime": "1824.028572018s",
  "processTime": "10204.350023253s",
  "uploadTime": "409.736244278s",
  "downloadBytes": "93634516006", This is the input
  "uploadBytes": "77819727536"       This is the output 
}

I am not sure why I am hitting this now. I have uploaded files larger than 100GB with this tool before.

@jdoliner
Copy link
Member

This has been fixed by #4107, it'll be shipped in 1.9.6.

@xubofei1983
Copy link

@jdoliner @gabrielgrant
I get this issue too.

“44fbb193200614326d680be667b6e14ec1cfa","data":[{"path":"/552660222","hash":"w2qH0p32RoSRlmU19CFm8gLBPFTDhstY3wQ3oGzzf7k="}],"ts":"2021-02-02T23:41:51.651013533Z","message":"failed processing datum: rpc error: code = Unknown desc = MultipartUpload: upload multipart failed\n\tupload id: k7wdTEYd8W8KxopAiqotOCuwUTWeHBkSWi6T4LC3QZfRsa4Kl5Q828Kr1EuoGC9EPHB4IvMEbrtpNDZFHqymtpr9lzJMyabM29icEtFplG1yzNcV73mjZO54txm6wu3k4WmJmEimZCGPf03pSDBiog--\ncaused by: TotalPartsExceeded: exceeded total allowed S3 limit MaxUploadParts (10000). Adjust PartSize to fit in this limit, retrying in 0s"}”

However, I already set "--max-upload-parts 20000" during installation, and I can see value in secret file correctly,

also I accessed into the docker container and export environment variables and I see.
declare -x MAX_UPLOAD_PARTS="20000"

We are on 1.12.1, AWS EKS.

@dgeorg42
Copy link
Contributor

dgeorg42 commented Feb 3, 2021

@xubofei1983 - The upper limit for max upload parts is 10,000. If you try to set it higher than that, AWS ignores it and limits you to 10,000. What you need to do is adjust the PartSize, which defaults to 5 MB. By setting the PartSize higher, you'll have fewer parts. You just need to figure out how high you need to set it in order to get below the 10,000 part threshold.

@xubofei1983
Copy link

ah thanks @dgeorg42 . I think this should be better described in doc.
--max-upload-parts int (rarely set) Set a custom maximum number of upload parts. (default 10000)

this parameter does not seem very useful then.

two more related questions:

  1. is this actually caused by the total datum size > 50G (5MB * 10000)

  2. would there be any negative impact (performance) on increasing PartSize? if so, maybe I should try reduce the datum size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants