Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MSTransferor typo calling getName method #11031

Merged
merged 1 commit into from
Mar 10, 2022
Merged

Conversation

amaltaro
Copy link
Contributor

Fixes #11030

Status

ready

Description

The title says it all, correct the method name called for the Workflow object, which should be getName().

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

Complement to #10959

External dependencies / deployment changes

None

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perfect!

Thanks @amaltaro

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12865/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro merged commit d0dd92e into dmwm:master Mar 10, 2022
@amaltaro
Copy link
Contributor Author

I discussed this with Imran and he agreed to get this tested in production, before cutting a new release. It does fix the problem, but now we hit another issue when creating the alert with the alert manager tool:

2022-03-10 13:38:42,389:INFO:MSTransferor: Finding final pileup destination for request: cmsunified_task_TOP-RunIISummer20UL18pLHEGEN-00062__v1_T_210811_092821_4442
2022-03-10 13:38:42,389:ERROR:MSTransferor: Workflow has been incorrectly assigned: cmsunified_task_TOP-RunIISummer20UL18pLHEGEN-00062__v1_T_210811_092821_4442. The secondary dataset: /MinBias_TuneCP5_13TeV-pythia8/RunIISummer20UL18SIM-10
6X_upgrade2018_realistic_v11_L1v1-v2/GEN-SIM,belongs to the campaign: RunIISummer20UL18DIGI, with Secondaries location set to: {'T1_IT_CNAF'}. While the workflow has been assigned to: ['T2_CH_CERN', 'T2_CH_CERN_HLT']
2022-03-10 13:38:42,395:ERROR:MSTransferor: Failed to send alert to http://cms-monitoring.cern.ch:30093/api/v1/alerts. Error: url=http://cms-monitoring.cern.ch:30093/api/v1/alerts, code=400, reason=Bad Request, headers={'Access-Control-Al
low-Headers': 'Accept, Authorization, Content-Type, Origin', 'Access-Control-Allow-Methods': 'GET, POST, DELETE, OPTIONS', 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'Date', 'Cache-Control': 'no-cache, no-store, 
must-revalidate', 'Content-Type': 'application/json', 'Date': 'Thu, 10 Mar 2022 12:38:42 GMT', 'Content-Length': '140'}, result=b'{"status":"error","errorType":"bad_data","error":"json: cannot unmarshal number into Go struct field Alert.l
abels of type model.LabelValue"}'
Traceback (most recent call last):
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/MicroService/MSTransferor/MSTransferor.py", line 748, in sendAlert
    self.alertManagerApi.sendAlert(alertName, severity, summary, description,
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/AlertManager/AlertManagerAPI.py", line 93, in sendAlert
    res = self.mgr.getdata(self.alertManagerUrl, params=params, headers=self.headers, verb='POST')
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 316, in getdata
    _, data = self.request(url=url, params=params, headers=headers, verb=verb,
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/PortForward.py", line 69, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 306, in request
    raise exc
http.client.HTTPException: url=http://cms-monitoring.cern.ch:30093/api/v1/alerts, code=400, reason=Bad Request, headers={'Access-Control-Allow-Headers': 'Accept, Authorization, Content-Type, Origin', 'Access-Control-Allow-Methods': 'GET, 
POST, DELETE, OPTIONS', 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'Date', 'Cache-Control': 'no-cache, no-store, must-revalidate', 'Content-Type': 'application/json', 'Date': 'Thu, 10 Mar 2022 12:38:42 GMT', 'Con
tent-Length': '140'}, result=b'{"status":"error","errorType":"bad_data","error":"json: cannot unmarshal number into Go struct field Alert.labels of type model.LabelValue"}'
2022-03-10 13:38:42,395:CRITICAL:MSTransferor: Workflow: cmsunified_task_TOP-RunIISummer20UL18pLHEGEN-00062__v1_T_210811_092821_4442 could not proceed due to some PU misconfiguration,so it will be skipped.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 10, 2022

@amaltaro , this is typical error in Go when someone provide wrong data type. There is some data structure in alert manager which is used to marshal your data. The error message specifies it correctly: "json: cannot unmarshal number into Go struct field Alert.labels of type model.LabelValue". It means that your POST data contains a value which is not a number, e.g. you pass string as integer, etc. To identify the problem you should find, and better add to your traceback, the POST payload data. Then it would be clear to see which JSON attribute has a wrong data type.

@amaltaro
Copy link
Contributor Author

Thanks for looking into it, Valentin. I was indeed about to tag you.

I fail to see any changes related to the way we create these alerts and/or anything in the underlying layer (AlertManagerAPI and pycurl_manager). Was there any change on the other side when handling these alerts?

Just for the record, one can reproduce this problem from the ms-transferor pod, with the following recipe:

source /data/srv/current/apps/reqmgr2ms/etc/profile.d/init.sh 
python3

then in the python interpreter, something like:

from WMCore.Services.AlertManager.AlertManagerAPI import AlertManagerAPI
alertManagerApi = AlertManagerAPI("http://cms-monitoring.cern.ch:30093/api/v1/alerts")

alertServiceName = "ms-transferor"
workflowName = "cmsunified_task_TOP-RunIISummer20UL17pLHEGEN-00061__v1_T_210810_204636_3829"
alertName = "{}: PU misconfiguration error. Workflow: {}".format(alertServiceName, workflowName)
alertSeverity = "high"
alertSummary = "[MSTransferor] Workflow cannot proceed due to some PU misconfiguration."
alertDescription = "Workflow: {} could not proceed due to some PU misconfiguration,".format(workflowName)
alertDescription += "so it will be skipped."

alertManagerApi.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, alertServiceName, 1 * 60 * 60)

Hmm, perhaps a minor pycurl upgrade that we had in the last weeks is affecting this... debugging.

@amaltaro
Copy link
Contributor Author

@vkuznet dumping the json passed to pycurl_manager (RequestHandler), here it is:

In [4]: json.loads(params)
Out[4]: 
[{'labels': {'alertname': 'ms-transferor: PU misconfiguration error. Workflow: cmsunified_task_TOP-RunIISummer20UL17pLHEGEN-00061__v1_T_210810_204636_3829',
   'severity': 'high',
   'tag': 3600,
   'service': 'ms-transferor'},
  'annotations': {'hostname': 'ms-transferor-977596c88-zbcmr',
   'summary': '[MSTransferor] Workflow cannot proceed due to some PU misconfiguration.',
   'description': 'Workflow: cmsunified_task_TOP-RunIISummer20UL17pLHEGEN-00061__v1_T_210810_204636_3829 could not proceed due to some PU misconfiguration,so it will be skipped.'},
  'endsAt': '2022-03-10T14:49:23.104078+01:00',
  'generatorURL': ''}]

are you able to spot any problem with this data structure?

@amaltaro
Copy link
Contributor Author

I think I know what the problem is, likely with the tag. Checking...

@amaltaro
Copy link
Contributor Author

We moved from positional + keyword arguments (in an old WMCore tag):
https://github.com/dmwm/WMCore/blob/2.0.0/src/python/WMCore/MicroService/MSTransferor/MSTransferor.py#L763-L764

to a positional-only mode:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSTransferor/MSTransferor.py#L748-L749

thus, setting the wrong parameters on the sendAlert method. Let me test this theory now and come with a fix, if that is the case.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 10, 2022

@amaltaro , sorry I was resolving another disaster on k8s cluster. I'm glad you spot the issue, but the main problem is really lack of data-type checking in Python.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 10, 2022

I think the proper way is to check each data type you use, but it will require additional coding and checking all attributes of data structure. If 'tag' is a string, then you should check its values to be string, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MSTransferor: typo accessing Workflow getName attribute
4 participants