-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gathering of resources from last migrated namespaces, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) #34
Conversation
- Prometheus metrics | ||
|
||
**Essential-only gather** | ||
|
||
Differences from full gather: | ||
- Logs are only gathered from specified time window | ||
- Skips collection of prometheus metrics, pprof. Removes duplicate logs from payload. | ||
``` | ||
```sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "sh" intentional here? It won't show up on README I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so this bit just signals to the markdown renderer that this block of code should be rendered with syntax highlighting for shell scripts.
I believe this is the full list of supported syntax highlighting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Added some questions, feel free to merge away otherwise
@@ -16,6 +16,12 @@ for localns in $(/usr/bin/oc get migrationcontrollers.migration.openshift.io --a | |||
done | |||
echo "Will collect debug info from migclusters [${clusters[@]}]" | |||
|
|||
# Find the latest migration, plan, and associated namespaces so we can gather additional info from those | |||
latest_migration=$(oc -n ${localns} get migmigration -o json | jq -r '.items|=sort_by(.metadata.creationTimestamp)' | jq -r '.items[-1].metadata.name') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familliar with jq. Does items[-1] fail when there are no migmigrations present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it results in "null" getting set to the var and a 0 return code. Will try a full must-gather in this scenario and make sure it doesn't crash.
$ oc delete migmigration --all
migmigration.migration.openshift.io "8069a3d0-8c19-11eb-b0bf-5fcc05ad4e8c" deleted
migmigration.migration.openshift.io "b5a018a0-8b1e-11eb-ac4b-3b5c0d147d47" deleted
$ latest_migration=$(oc -n openshift-migration get migmigration -o json | jq -r '.items|=sort_by(.metadata.creationTimestamp)' | jq -r '.items[-1].metadata.name')
# Check return code
$ echo $?
0
# Check var output
$ echo $latest_migration
null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so basically the end-result is that all of the latest_*
vars get set to null
, and inside the loop that compares current cluster name to latest_cluster
, we'll never enter the new block for gathering content of migrated namespaces.
Output from must-gather
[must-gather-n94dc] POD Error from server (NotFound): migmigrations.migration.openshift.io "null" not found
[must-gather-n94dc] POD jq: error (at <stdin>:191): Cannot iterate over null (null)
[must-gather-n94dc] POD [cluster=host][namespace=openshift-migration] Detected MTC installation
[must-gather-n94dc] POD [cluster=host] Not found within latest_migration_clusters=null
[must-gather-n94dc] POD null
So I think this is harmless and works as-is, however I'm open to input on whether we should squash the jq stderr and send it to /dev/null so that the error never shows up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djwhatle I was under impression that if a script in must-gather returns non-zero exit code, the temporaray must-gather pod exits with non-zero status. But maybe that is incorrect. I think it'd be OK to have stderr printed in this way since this doesn't really break the must-gather container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, it's just that the RC is always 0 even when jq
doesn't find something.
# Collect all resources in MTC namespaces with must-gather | ||
for ns in ${namespaces[@]}; do | ||
echo "[cluster=${cluster}][namespace=${ns}] Running oc adm inspect" | ||
/usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster} --all-namespaces ns/${ns} & | ||
done | ||
|
||
# Collect PV and StorageClass info for cluster | ||
echo "[cluster=${cluster}] Running oc adm inspect storageclasses,persistentvolumes" | ||
usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster} storageclasses,persistentvolumes & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, any specific reason for not including PVCs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PVCs are namespace scoped resources, so they are gathered by the other invocations of oc adm inspect ns/<some-ns>
# Shorten logs from migrated-namespaces to last 100 lines | ||
echo "Shortening logs collected from migrated namespaces to last 100 lines" | ||
for logfile in $(find must-gather/clusters/*/migrated-namespaces/namespaces/*/*/ -name *.log); do | ||
tail -100 "${logfile}" > tmp.log && mv tmp.log "${logfile}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand where these logs are coming from. Are these logs from the Rsync/Stunnel pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is any Pod logs gathered from the migrated-namespaces
(e.g. the namespaces that were on migplan.spec.namespaces
), so this is Pod logs for Rsync, Stunnel, Stage Pods, but also any other apps running in those namespaces.
The amount of logs that could be in these namespaces is potentially unbounded since this is where user apps run, so I thought it would make sense to only capture the last bits which would probably contain errors for Rsync or Stunnel or whatever. Do you think 100 lines is enough?
I also considered if we should only keep the logs for Rsync, Stunnel, Stage Pods, but since oc adm inspect
grabs everything I'd need to know the naming conventions (like a regex) for all of those Pod names so I could delete everything except those. I like the idea of only grabbing our stuff better but it may be more brittle since we're still moving containers around (e.g. current work to combine Rsync and Stunnel)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djwhatle Agreed. I am unsure about gathering user application logs though. Are we supposed to be doing that at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not a great reason for us to. I think you make a good point, I should refine this further so that it only gathers logs for Pods belonging to us. oc adm inspect
was the single-command solution but we can do better with some specialized logic.
This would also make me more comfortable keeping the full logs, which would be more useful.
Closes #32
default
andopenshift-image-registry
namespaces