Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning message is confusing when pod logs cannot be retrieved #3711

Closed
jiezhang opened this issue May 8, 2020 · 27 comments · Fixed by #3848
Closed

Warning message is confusing when pod logs cannot be retrieved #3711

jiezhang opened this issue May 8, 2020 · 27 comments · Fixed by #3848
Labels
area/frontend good first issue help wanted The community is welcome to contribute. kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label. priority/p1 status/triaged Whether the issue has been explicitly triaged

Comments

@jiezhang
Copy link

jiezhang commented May 8, 2020

What steps did you take:

After the pod finishes successfully and is later reclaimed, the following warning is displayed. The message often confuses first-time users, and suggests them to check out the troubleshooting guide.

Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption

image

What happened:

In fact, the logs can be viewed in Stackdriver Kubernetes Monitoring.

What did you expect to happen:

Remove the warning message.

/kind bug
/area frontend

@Bobgy
Copy link
Contributor

Bobgy commented May 15, 2020

Thanks for the suggestion!
Sounds reasonable to me.

We can only show troubleshooting guide when there is an error, but not a warning.

@Bobgy Bobgy added help wanted The community is welcome to contribute. status/triaged Whether the issue has been explicitly triaged labels May 15, 2020
@Bobgy Bobgy self-assigned this May 15, 2020
@jonasdebeukelaer
Copy link
Contributor

happy to fix this
/assign @jonasdebeukelaer

@jonasdebeukelaer
Copy link
Contributor

oh wait is this already done? i.e. just removing 'troubleshooting guide'?

@Bobgy
Copy link
Contributor

Bobgy commented May 25, 2020

@jonasdebeukelaer Thanks for offering help!
This still needs to be done.

Some helpful information for contribution:

  1. frontend contribution guide: https://github.com/kubeflow/pipelines/tree/master/frontend
  2. Banner component (that shows the troubleshooting link): https://github.com/kubeflow/pipelines/blob/master/frontend/src/components/Banner.tsx
  3. Run Details Page's log viewer tab's banner:

My suggested UX would be to hide the troubleshooting link when the given banner is a warning (it still shows it when the banner is error), but you can take a look and decide if that feels reasonable to you. It already supports hiding the link on ad-hoc usage: https://github.com/kubeflow/pipelines/blob/master/frontend/src/components/Banner.tsx#L72, so we can also dynamically configure it on usages.

@jiezhang
Copy link
Author

@Bobgy @jonasdebeukelaer I wonder if it is okay to remove the message completely, or at least lower the level to informational (without the exclamation mark and "Warning" prefix).

@Bobgy
Copy link
Contributor

Bobgy commented Jul 31, 2020

/reopen
I just tested on 1.0.0 and the problem still exists.

Now it shows fail to retrieve pod logs with error message:

Error response: Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

We didn't accommodate for the case when workflow was also missing.

@k8s-ci-robot
Copy link
Contributor

@Bobgy: Reopened this issue.

In response to this:

/reopen
I just tested on 1.0.0 and the problem still exists.

Now it shows fail to retrieve pod logs with error message:

Error response: Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

We didn't accommodate for the case when workflow was also missing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jul 31, 2020
@Bobgy
Copy link
Contributor

Bobgy commented Jul 31, 2020

@jonasdebeukelaer do you want to revisit this?
Or I can follow up too

@jonasdebeukelaer
Copy link
Contributor

hey @Bobgy should be a quick one to fix so happy to do it. In what situations can a workflow be missing?

@Bobgy
Copy link
Contributor

Bobgy commented Aug 26, 2020

When user configures a TTL to GC workflows. (We have a default TTL of 1day)

In fact, workflow status should be persisted into run details DB rows. So UI shouldn't need to fetch the workflow

@jonasdebeukelaer
Copy link
Contributor

hmm makes sense 👍

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 12, 2020

@Bobgy The logs are already persisted to the storage the same way as other artifacts. AFAIK, @eterna2 added support to show these logs in the UX when the pod is not available, but this option is turned off by default. Maybe you can enable this option?

@Bobgy
Copy link
Contributor

Bobgy commented Oct 12, 2020

@Ark-kun this bug: #3711 (comment) must be fixed before logs can be reused from archive.

@haydnkeung
Copy link

@Ark-kun How do you enable the option?

@ConverJens
Copy link
Contributor

@Ark-kun @Bobgy Is there any update to this? How do you enable log persistence?

@Bobgy
Copy link
Contributor

Bobgy commented Jan 14, 2021

No update yet, we need someone from community to fix this problem.

For us, we are on GCP, so GCP stackdriver auto persists all Kubernetes pod logs.

@ConverJens
Copy link
Contributor

@haydnkeung I managed to enable logs persistence.

Check the configmap workflow-controller-configmap and see if archiveLogs: true is set. For me it wasn't, even though I'm on KF 1.1, and I had to set it in the config-map.yaml found in your manifest dir under argo/base.

@stale
Copy link

stale bot commented Jun 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 3, 2021
@ra312
Copy link

ra312 commented Jul 1, 2021

@haydnkeung I managed to enable logs persistence.

Check the configmap workflow-controller-configmap and see if archiveLogs: true is set. For me it wasn't, even though I'm on KF 1.1, and I had to set it in the config-map.yaml found in your manifest dir under argo/base.

Dear @ConverJens , did you restart ml-pipeline-ui (kubeflow UX) after editing configmap?

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 1, 2021
@ConverJens
Copy link
Contributor

@ra312 I actually redeployed all ml-pipieline components apart from Minio and Mysql so I don't know which is required. However, I don't think the UI has anythinh to do with this but rather it's the api server that needs restarting. The UI will simply pick up the logs as any other artifact.

@ra312
Copy link

ra312 commented Jul 13, 2021

Thanks, @ConverJens! I will try to do the same.

@rohitgujral
Copy link

@ConverJens @ra312 I'm also trying to persist the pipeline pod logs so that if pod gets deleted then logs should be available to the pipeline runs.
I added archiveLogs: true to the argo/base config-map and restarted pipelines deployment but still after deleting the pod, im not seeing logs.

image

Is there any other step which needs to be done ?
kubeflow version - 1.0.2 and argo version - 2.3.0

@ConverJens
Copy link
Contributor

@rohitgujral The logs tab will only be populated until the pod is removed. The complete logs will be available as a tar.gz under artifacts instead, called main-logs.tar.gz I think.

Note that while the logs tab will always hold the full log, the artifact can lose the final part in the event of your component crashing.I believe that this has to do with the logs not being fully flushed in some instances and in that case, only the logs up the point of error is available.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022
@ra312
Copy link

ra312 commented Apr 1, 2022

closing since the issue has been reported as fixed in #3848

@ra312
Copy link

ra312 commented Apr 1, 2022

/close

@google-oss-prow
Copy link

@ra312: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend good first issue help wanted The community is welcome to contribute. kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label. priority/p1 status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants