Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSS InternalError also should retry #13262

Closed
4 tasks done
shuangkun opened this issue Jun 28, 2024 · 5 comments · Fixed by #13263
Closed
4 tasks done

OSS InternalError also should retry #13262

shuangkun opened this issue Jun 28, 2024 · 5 comments · Fixed by #13263
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/upstream This is an issue with an upstream dependency, not Argo itself type/bug

Comments

@shuangkun
Copy link
Member

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Sometimes the OSS pressure is too great and an internal error may occur. In this case, retrying can solve the problem.

time="2024-06-20T04:04:59 UTC" level=info msg="Save artifact" artifactName=abcd 
duration=12.543267547s error=
"oss: service returned error: StatusCode=500, ErrorCode=InternalError, ErrorMessage=\"Please contact the server administrator, oss@service.
aliyun.com\", RequestId=6673AA5F82C3D334359A2202, Ec=0001-00000000" 

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

any workflow

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
shuangkun added a commit to shuangkun/argo-workflows that referenced this issue Jun 28, 2024
Co-authored-by: AlbeeSo <[email protected]>
Co-authored-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
shuangkun added a commit to shuangkun/argo-workflows that referenced this issue Jun 28, 2024
Co-authored-by: AlbeeSo <[email protected]>
Co-authored-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
@agilgur5
Copy link

agilgur5 commented Jun 29, 2024

"oss: service returned error: StatusCode=500, ErrorCode=InternalError, ErrorMessage=\"Please contact the server administrator, oss@service.
aliyun.com\", RequestId=6673AA5F82C3D334359A2202, Ec=0001-00000000"

I don't think retrying this error makes sense -- it literally doesn't instruct the user to do that. If it should be retried, the error returned should be different. If your team can help get that changed upstream, I think that would make more sense

@agilgur5 agilgur5 added area/artifacts S3/GCP/OSS/Git/HDFS etc area/retryStrategy Template-level retryStrategy area/upstream This is an issue with an upstream dependency, not Argo itself problem/more information needed Not enough information has been provide to diagnose this issue. solution/invalid This is incorrect. Also can be used for spam and removed area/retryStrategy Template-level retryStrategy labels Jun 29, 2024
@shuangkun
Copy link
Member Author

The official documentation advice retry for internal error. https://help-aliyun-com.translate.goog/zh/oss/support/0001-00000000?spm=a2c4g.11186623.0.i0&_x_tr_sl=zh-CN&_x_tr_tl=en&_x_tr_hl=zh-CN&_x_tr_pto=wapp
I found that S3 also has retries for internal errors.
https://github.com/argoproj/argo-workflows/blob/main/workflow/artifacts/s3/errors.go#L19

@agilgur5 agilgur5 removed problem/more information needed Not enough information has been provide to diagnose this issue. solution/invalid This is incorrect. Also can be used for spam labels Jul 1, 2024
@agilgur5
Copy link

agilgur5 commented Jul 1, 2024

Thanks for providing the links! I guess you're right then, per the docs and S3 equivalent.

Although I still don't think upstream should return a 500 in this case, but if they say to retry, I guess we should. If you could get the responsible team to change that to a 503 with an error that says to retry, that would help clear things up

@agilgur5
Copy link

agilgur5 commented Jul 1, 2024

https://api.alibabacloud.com/error-code/Oss/7127?spm=api-workbench-intl.API%20Document.0.0.7ceecc12csPzzU

Indeed says:

The server is busy. Please try again.

Should really be a 503 though 🤔

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Jul 1, 2024
@shuangkun
Copy link
Member Author

Thanks, I will try to communicate with the OSS team.

agilgur5 pushed a commit that referenced this issue Jul 6, 2024
Signed-off-by: shuangkun <[email protected]>
Co-authored-by: AlbeeSo <[email protected]>
(cherry picked from commit 77b5732)
agilgur5 pushed a commit that referenced this issue Jul 6, 2024
Signed-off-by: shuangkun <[email protected]>
Co-authored-by: AlbeeSo <[email protected]>
(cherry picked from commit 77b5732)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/upstream This is an issue with an upstream dependency, not Argo itself type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants