Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gathering of resources from last migrated namespaces, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) #34

Closed
wants to merge 9 commits into from

Conversation

djwhatle
Copy link

@djwhatle djwhatle commented Mar 23, 2021

Closes #32

  • Provides new capability to gather content of namespaces from most recent migration attempt. This is necessary for troubleshooting things like Rsync/Stunnel pods not coming up, Stage Pod issues, PVC attach issues, etc. Secrets are scrubbed out, and only the last 100 log lines are included from Pods in these namespaces.

image

  • Adds gather of StorageClasses, PVs
  • Adds gather of Routes, Service from default and openshift-image-registry namespaces

@djwhatle djwhatle changed the title Add "gather_migrated_namespaces" script, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) Add gathering of resources from last migrated namespaces, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) Mar 24, 2021
- Prometheus metrics

**Essential-only gather**

Differences from full gather:
- Logs are only gathered from specified time window
- Skips collection of prometheus metrics, pprof. Removes duplicate logs from payload.
```
```sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "sh" intentional here? It won't show up on README I think

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so this bit just signals to the markdown renderer that this block of code should be rendered with syntax highlighting for shell scripts.

I believe this is the full list of supported syntax highlighting.

Copy link
Contributor

@pranavgaikwad pranavgaikwad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Added some questions, feel free to merge away otherwise

@@ -16,6 +16,12 @@ for localns in $(/usr/bin/oc get migrationcontrollers.migration.openshift.io --a
done
echo "Will collect debug info from migclusters [${clusters[@]}]"

# Find the latest migration, plan, and associated namespaces so we can gather additional info from those
latest_migration=$(oc -n ${localns} get migmigration -o json | jq -r '.items|=sort_by(.metadata.creationTimestamp)' | jq -r '.items[-1].metadata.name')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familliar with jq. Does items[-1] fail when there are no migmigrations present?

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it results in "null" getting set to the var and a 0 return code. Will try a full must-gather in this scenario and make sure it doesn't crash.

$ oc delete migmigration --all
migmigration.migration.openshift.io "8069a3d0-8c19-11eb-b0bf-5fcc05ad4e8c" deleted
migmigration.migration.openshift.io "b5a018a0-8b1e-11eb-ac4b-3b5c0d147d47" deleted

$ latest_migration=$(oc -n openshift-migration get migmigration -o json | jq -r '.items|=sort_by(.metadata.creationTimestamp)' | jq -r '.items[-1].metadata.name')

# Check return code
$  echo $?
0

# Check var output
$ echo $latest_migration
null

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so basically the end-result is that all of the latest_* vars get set to null, and inside the loop that compares current cluster name to latest_cluster, we'll never enter the new block for gathering content of migrated namespaces.

Output from must-gather

[must-gather-n94dc] POD Error from server (NotFound): migmigrations.migration.openshift.io "null" not found
[must-gather-n94dc] POD jq: error (at <stdin>:191): Cannot iterate over null (null)
[must-gather-n94dc] POD [cluster=host][namespace=openshift-migration] Detected MTC installation
[must-gather-n94dc] POD [cluster=host] Not found within latest_migration_clusters=null
[must-gather-n94dc] POD null

So I think this is harmless and works as-is, however I'm open to input on whether we should squash the jq stderr and send it to /dev/null so that the error never shows up.

Copy link
Contributor

@pranavgaikwad pranavgaikwad Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djwhatle I was under impression that if a script in must-gather returns non-zero exit code, the temporaray must-gather pod exits with non-zero status. But maybe that is incorrect. I think it'd be OK to have stderr printed in this way since this doesn't really break the must-gather container.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, it's just that the RC is always 0 even when jq doesn't find something.

# Collect all resources in MTC namespaces with must-gather
for ns in ${namespaces[@]}; do
echo "[cluster=${cluster}][namespace=${ns}] Running oc adm inspect"
/usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster} --all-namespaces ns/${ns} &
done

# Collect PV and StorageClass info for cluster
echo "[cluster=${cluster}] Running oc adm inspect storageclasses,persistentvolumes"
usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster} storageclasses,persistentvolumes &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, any specific reason for not including PVCs here?

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PVCs are namespace scoped resources, so they are gathered by the other invocations of oc adm inspect ns/<some-ns>

# Shorten logs from migrated-namespaces to last 100 lines
echo "Shortening logs collected from migrated namespaces to last 100 lines"
for logfile in $(find must-gather/clusters/*/migrated-namespaces/namespaces/*/*/ -name *.log); do
tail -100 "${logfile}" > tmp.log && mv tmp.log "${logfile}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand where these logs are coming from. Are these logs from the Rsync/Stunnel pods?

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is any Pod logs gathered from the migrated-namespaces (e.g. the namespaces that were on migplan.spec.namespaces), so this is Pod logs for Rsync, Stunnel, Stage Pods, but also any other apps running in those namespaces.

The amount of logs that could be in these namespaces is potentially unbounded since this is where user apps run, so I thought it would make sense to only capture the last bits which would probably contain errors for Rsync or Stunnel or whatever. Do you think 100 lines is enough?

I also considered if we should only keep the logs for Rsync, Stunnel, Stage Pods, but since oc adm inspect grabs everything I'd need to know the naming conventions (like a regex) for all of those Pod names so I could delete everything except those. I like the idea of only grabbing our stuff better but it may be more brittle since we're still moving containers around (e.g. current work to combine Rsync and Stunnel)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djwhatle Agreed. I am unsure about gathering user application logs though. Are we supposed to be doing that at all?

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not a great reason for us to. I think you make a good point, I should refine this further so that it only gathers logs for Pods belonging to us. oc adm inspect was the single-command solution but we can do better with some specialized logic.

This would also make me more comfortable keeping the full logs, which would be more useful.

@djwhatle djwhatle closed this Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gather more information about latest migration attempt
3 participants