Add gathering of resources from last migrated namespaces, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) #34

djwhatle · 2021-03-23T22:24:42Z

Closes #32

Provides new capability to gather content of namespaces from most recent migration attempt. This is necessary for troubleshooting things like Rsync/Stunnel pods not coming up, Stage Pod issues, PVC attach issues, etc. Secrets are scrubbed out, and only the last 100 log lines are included from Pods in these namespaces.

Adds gather of StorageClasses, PVs
Adds gather of Routes, Service from default and openshift-image-registry namespaces

JaydipGabani · 2021-03-24T19:14:26Z

README.md

 - Prometheus metrics

 **Essential-only gather**

 Differences from full gather:
 - Logs are only gathered from specified time window
 - Skips collection of prometheus metrics, pprof. Removes duplicate logs from payload.
-```
+```sh


is "sh" intentional here? It won't show up on README I think

Yeah, so this bit just signals to the markdown renderer that this block of code should be rendered with syntax highlighting for shell scripts.

I believe this is the full list of supported syntax highlighting.

pranavgaikwad

Overall looks good to me. Added some questions, feel free to merge away otherwise

pranavgaikwad · 2021-03-24T21:24:07Z

collection-scripts/gather

@@ -16,6 +16,12 @@ for localns in $(/usr/bin/oc get migrationcontrollers.migration.openshift.io --a
  done
  echo "Will collect debug info from migclusters [${clusters[@]}]"

+  # Find the latest migration, plan, and associated namespaces so we can gather additional info from those
+  latest_migration=$(oc -n ${localns} get migmigration -o json | jq -r '.items|=sort_by(.metadata.creationTimestamp)' | jq -r '.items[-1].metadata.name')


I am not familliar with jq. Does items[-1] fail when there are no migmigrations present?

Looks like it results in "null" getting set to the var and a 0 return code. Will try a full must-gather in this scenario and make sure it doesn't crash.

$ oc delete migmigration --all migmigration.migration.openshift.io "8069a3d0-8c19-11eb-b0bf-5fcc05ad4e8c" deleted migmigration.migration.openshift.io "b5a018a0-8b1e-11eb-ac4b-3b5c0d147d47" deleted $ latest_migration=$(oc -n openshift-migration get migmigration -o json | jq -r '.items|=sort_by(.metadata.creationTimestamp)' | jq -r '.items[-1].metadata.name') # Check return code $ echo $? 0 # Check var output $ echo $latest_migration null

Ok, so basically the end-result is that all of the latest_* vars get set to null, and inside the loop that compares current cluster name to latest_cluster, we'll never enter the new block for gathering content of migrated namespaces.

Output from must-gather

[must-gather-n94dc] POD Error from server (NotFound): migmigrations.migration.openshift.io "null" not found [must-gather-n94dc] POD jq: error (at <stdin>:191): Cannot iterate over null (null) [must-gather-n94dc] POD [cluster=host][namespace=openshift-migration] Detected MTC installation [must-gather-n94dc] POD [cluster=host] Not found within latest_migration_clusters=null [must-gather-n94dc] POD null

So I think this is harmless and works as-is, however I'm open to input on whether we should squash the jq stderr and send it to /dev/null so that the error never shows up.

@djwhatle I was under impression that if a script in must-gather returns non-zero exit code, the temporaray must-gather pod exits with non-zero status. But maybe that is incorrect. I think it'd be OK to have stderr printed in this way since this doesn't really break the must-gather container.

You're correct, it's just that the RC is always 0 even when jq doesn't find something.

pranavgaikwad · 2021-03-24T21:25:44Z

collection-scripts/gather

    # Collect all resources in MTC namespaces with must-gather
    for ns in ${namespaces[@]}; do
      echo "[cluster=${cluster}][namespace=${ns}] Running oc adm inspect"
      /usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster} --all-namespaces ns/${ns} &
    done

+    # Collect PV and StorageClass info for cluster
+    echo "[cluster=${cluster}] Running oc adm inspect storageclasses,persistentvolumes"
+    usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster} storageclasses,persistentvolumes &


+1, any specific reason for not including PVCs here?

PVCs are namespace scoped resources, so they are gathered by the other invocations of oc adm inspect ns/<some-ns>

pranavgaikwad · 2021-03-24T21:27:45Z

collection-scripts/gather

+# Shorten logs from migrated-namespaces to last 100 lines
+echo "Shortening logs collected from migrated namespaces to last 100 lines"
+for logfile in $(find must-gather/clusters/*/migrated-namespaces/namespaces/*/*/ -name *.log); do
+  tail -100 "${logfile}" > tmp.log && mv tmp.log "${logfile}"


I am not sure I understand where these logs are coming from. Are these logs from the Rsync/Stunnel pods?

This is any Pod logs gathered from the migrated-namespaces (e.g. the namespaces that were on migplan.spec.namespaces), so this is Pod logs for Rsync, Stunnel, Stage Pods, but also any other apps running in those namespaces.

The amount of logs that could be in these namespaces is potentially unbounded since this is where user apps run, so I thought it would make sense to only capture the last bits which would probably contain errors for Rsync or Stunnel or whatever. Do you think 100 lines is enough?

I also considered if we should only keep the logs for Rsync, Stunnel, Stage Pods, but since oc adm inspect grabs everything I'd need to know the naming conventions (like a regex) for all of those Pod names so I could delete everything except those. I like the idea of only grabbing our stuff better but it may be more brittle since we're still moving containers around (e.g. current work to combine Rsync and Stunnel)

@djwhatle Agreed. I am unsure about gathering user application logs though. Are we supposed to be doing that at all?

There's not a great reason for us to. I think you make a good point, I should refine this further so that it only gathers logs for Pods belonging to us. oc adm inspect was the single-command solution but we can do better with some specialized logic.

This would also make me more comfortable keeping the full logs, which would be more useful.

djwhatle added 7 commits March 23, 2021 16:00

Add jq to Dockerfile for parsing MigPlans

ca5a3fc

Gather YAML for registry route/svc, PV, StorageClasses for clusters

f58e298

Update

f5dc371

Remove migrated-namespace collection from main gather

aff73ac

Update README

cd9c64b

Update README

7e9c055

Update README

e380083

djwhatle requested review from JaydipGabani and pranavgaikwad March 23, 2021 22:25

Consolidate migrated-namespaces to single gather

5ee727b

JaydipGabani reviewed Mar 24, 2021

View reviewed changes

pranavgaikwad approved these changes Mar 24, 2021

View reviewed changes

JaydipGabani approved these changes Mar 24, 2021

View reviewed changes

Update gather to always gather migrated-ns content

6e8cc97

djwhatle closed this Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gathering of resources from last migrated namespaces, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) #34

Add gathering of resources from last migrated namespaces, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) #34

djwhatle commented Mar 23, 2021 •

edited

Loading

JaydipGabani Mar 24, 2021

djwhatle Mar 25, 2021 •

edited

Loading

pranavgaikwad left a comment

pranavgaikwad Mar 24, 2021

djwhatle Mar 25, 2021 •

edited

Loading

djwhatle Mar 25, 2021 •

edited

Loading

pranavgaikwad Mar 25, 2021 •

edited

Loading

djwhatle Mar 25, 2021

pranavgaikwad Mar 24, 2021

djwhatle Mar 25, 2021 •

edited

Loading

pranavgaikwad Mar 24, 2021

djwhatle Mar 25, 2021 •

edited

Loading

pranavgaikwad Mar 25, 2021

djwhatle Mar 25, 2021 •

edited

Loading

Add gathering of resources from last migrated namespaces, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) #34

Add gathering of resources from last migrated namespaces, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) #34

Conversation

djwhatle commented Mar 23, 2021 • edited Loading

JaydipGabani Mar 24, 2021

Choose a reason for hiding this comment

djwhatle Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

pranavgaikwad left a comment

Choose a reason for hiding this comment

pranavgaikwad Mar 24, 2021

Choose a reason for hiding this comment

djwhatle Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

djwhatle Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

pranavgaikwad Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

djwhatle Mar 25, 2021

Choose a reason for hiding this comment

pranavgaikwad Mar 24, 2021

Choose a reason for hiding this comment

djwhatle Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

pranavgaikwad Mar 24, 2021

Choose a reason for hiding this comment

djwhatle Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

pranavgaikwad Mar 25, 2021

Choose a reason for hiding this comment

djwhatle Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

djwhatle commented Mar 23, 2021 •

edited

Loading

djwhatle Mar 25, 2021 •

edited

Loading

djwhatle Mar 25, 2021 •

edited

Loading

djwhatle Mar 25, 2021 •

edited

Loading

pranavgaikwad Mar 25, 2021 •

edited

Loading

djwhatle Mar 25, 2021 •

edited

Loading

djwhatle Mar 25, 2021 •

edited

Loading

djwhatle Mar 25, 2021 •

edited

Loading