TotalPartsExceeded: exceeded total allowed configured MaxUploadParts #4117

ghost · 2019-09-20T06:28:50Z

What happened?:
Workload: ~100 GB datums at low frequency (maybe a few at a time). Each datum takes about 4 hours to run.

What happens: The logs show that all datums are processed, but the job progress from pachctl list job shows 0/num_datums marked as finished. The job fails after a few hours with

rpc error: code = Unknown desc = MultipartUpload: upload multipart failed
    upload id: NPlpCsvmjcOlmuk6AO7l_247qIBCMUcbpO5vRRozj_xUclDCzJlVWanx7efNly09rD106D7QfI4STmi7C5SWqImub7vjYYT15YKqqosnr6lk1A1S9OAkzUEqpK5zKhVFVSeBvKRoWPx_N1h7INBoxTJeQjlR_1tBm1tFBruANbs-
caused by: TotalPartsExceeded: exceeded total allowed configured MaxUploadParts (10000). Adjust PartSize to fit in this limit

What you expected to happen?:
I expected the files to be uploaded and merged since the datums were processed successfully.

How to reproduce it (as minimally and precisely as possible)?:
spec:

{
  "pipeline": {
    "name": "mapped_reads_g1"
  },
  "transform": {
    "image": "[redacted].dkr.ecr.us-east-2.amazonaws.com/bioinf/mapping:4",
    "cmd": ["/bin/bash"],
    "stdin": [
             "for fulldatum in /pfs/genomes/*",
             "do",
             "./read_mapping.sh BGISEQ500 /pfs/reference_prepped/human_g1k_v37.fasta.gz 72 ${fulldatum} /pfs/out",
             "done"
    ]
  },
  "datum_tries": 1,
  "input":{
    "cross": [
      {
        "pfs": {
          "repo": "reference_prepped",
          "glob": "/",
	  "branch": "g1k_v37"
        }
      },
      {
        "pfs": {
          "repo": "genomes",
          "glob": "/*",
	  "branch": "master"
        }
      }
    ]
  },
  "parallelism_spec": {
    "constant": 6
  },
  "resource_requests": {
    "cpu": 72,
    "memory": "100G"
  },
  "cache_size": "20G",
  "standby": true,
  "enable_stats": true
}

AWS servers: c5d.18xlarge

The script is just a simple shell script running this tool and
this tool.

Anything else we need to know?:

Error seems to come from the aws go dk:
https://github.com/aws/aws-sdk-go/blob/v1.20.3/service/s3/s3manager/upload.go#L574

See it here and here too

Example stats from one job

{
  "downloadTime": "1824.028572018s",
  "processTime": "10204.350023253s",
  "uploadTime": "409.736244278s",
  "downloadBytes": "93634516006",
  "uploadBytes": "77819727536"
}

Environment?:

Kubernetes version (use kubectl version):
1.15.0
Pachyderm CLI and pachd server version (use pachctl version):
1.9.5
Cloud provider (e.g. aws, azure, gke) or local deployment (e.g. minikube vs dockerized k8s):
aws ec2
OS (e.g. from /etc/os-release):
kope.io kops ami
Others:

The text was updated successfully, but these errors were encountered:

gabrielgrant · 2019-09-23T18:31:50Z

Thanks for the detailed bug filing, @Jamesthegiantpeach

To confirm is it the input, output, or both that is in the ~100GB range?

Regardless, it certainly looks like you're right that the AWS SDK requires upping the PartSize setting manually (default appears to be 5mb) for uploads of larger files to succeed. This seems relatively straightforward to handle for standard pipelines, since all data is on disk before upload begins, but not sure how we should be handling this for spouts. Thoughts @adelelopez ?

ghost · 2019-09-23T18:40:50Z

@gabrielgrant

{
  "downloadTime": "1824.028572018s",
  "processTime": "10204.350023253s",
  "uploadTime": "409.736244278s",
  "downloadBytes": "93634516006", This is the input
  "uploadBytes": "77819727536"       This is the output 
}

I am not sure why I am hitting this now. I have uploaded files larger than 100GB with this tool before.

jdoliner · 2019-09-30T22:44:35Z

This has been fixed by #4107, it'll be shipped in 1.9.6.

xubofei1983 · 2021-02-03T00:08:46Z

@jdoliner @gabrielgrant
I get this issue too.

“44fbb193200614326d680be667b6e14ec1cfa","data":[{"path":"/552660222","hash":"w2qH0p32RoSRlmU19CFm8gLBPFTDhstY3wQ3oGzzf7k="}],"ts":"2021-02-02T23:41:51.651013533Z","message":"failed processing datum: rpc error: code = Unknown desc = MultipartUpload: upload multipart failed\n\tupload id: k7wdTEYd8W8KxopAiqotOCuwUTWeHBkSWi6T4LC3QZfRsa4Kl5Q828Kr1EuoGC9EPHB4IvMEbrtpNDZFHqymtpr9lzJMyabM29icEtFplG1yzNcV73mjZO54txm6wu3k4WmJmEimZCGPf03pSDBiog--\ncaused by: TotalPartsExceeded: exceeded total allowed S3 limit MaxUploadParts (10000). Adjust PartSize to fit in this limit, retrying in 0s"}”

However, I already set "--max-upload-parts 20000" during installation, and I can see value in secret file correctly,

also I accessed into the docker container and export environment variables and I see.
declare -x MAX_UPLOAD_PARTS="20000"

We are on 1.12.1, AWS EKS.

dgeorg42 · 2021-02-03T00:26:52Z

@xubofei1983 - The upper limit for max upload parts is 10,000. If you try to set it higher than that, AWS ignores it and limits you to 10,000. What you need to do is adjust the PartSize, which defaults to 5 MB. By setting the PartSize higher, you'll have fewer parts. You just need to figure out how high you need to set it in order to get below the 10,000 part threshold.

xubofei1983 · 2021-02-03T00:49:28Z

ah thanks @dgeorg42 . I think this should be better described in doc.
--max-upload-parts int (rarely set) Set a custom maximum number of upload parts. (default 10000)

this parameter does not seem very useful then.

two more related questions:

is this actually caused by the total datum size > 50G (5MB * 10000)
would there be any negative impact (performance) on increasing PartSize? if so, maybe I should try reduce the datum size?

pappasilenus added bug user labels Sep 20, 2019

jdoliner mentioned this issue Sep 23, 2019

Expose advanced configuration for amazon client #4107

Merged

gabrielgrant mentioned this issue Sep 23, 2019

Load Testing Improvements #3589

Open

4 tasks

jdoliner closed this as completed Sep 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TotalPartsExceeded: exceeded total allowed configured MaxUploadParts #4117

TotalPartsExceeded: exceeded total allowed configured MaxUploadParts #4117

ghost commented Sep 20, 2019 •

edited by ghost

Loading

gabrielgrant commented Sep 23, 2019

ghost commented Sep 23, 2019 •

edited by ghost

Loading

jdoliner commented Sep 30, 2019

xubofei1983 commented Feb 3, 2021

dgeorg42 commented Feb 3, 2021

xubofei1983 commented Feb 3, 2021

TotalPartsExceeded: exceeded total allowed configured MaxUploadParts #4117

TotalPartsExceeded: exceeded total allowed configured MaxUploadParts #4117

Comments

ghost commented Sep 20, 2019 • edited by ghost Loading

gabrielgrant commented Sep 23, 2019

ghost commented Sep 23, 2019 • edited by ghost Loading

jdoliner commented Sep 30, 2019

xubofei1983 commented Feb 3, 2021

dgeorg42 commented Feb 3, 2021

xubofei1983 commented Feb 3, 2021

ghost commented Sep 20, 2019 •

edited by ghost

Loading

ghost commented Sep 23, 2019 •

edited by ghost

Loading