Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correlate entries in monit_prod_cmssw_pop_* with those in monit_prod_condor_raw_metric* #36351

Closed
joseflix opened this issue Dec 3, 2021 · 10 comments · Fixed by #36570
Closed

Comments

@joseflix
Copy link

joseflix commented Dec 3, 2021

Each CMSSW job reports popularity entries in monit_prod_cmssw_pop_.
Also, each job reports utilization and other things in monit_prod_condor_raw_metric
.

There is no ID to be used as unique identifier between these entities, and this is needed. For example, if a job was accessing a file X from local/remote storage, we want to know the CPU efficiency degradation.

Files are listed in monit_prod_cmssw_pop_* and CPU utilization in monit_prod_condor_raw_metric*. No way to correlate this.

Is it very difficult to add an ID that will help people to join both sources of information?

Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2021

A new Issue was created by @joseflix Josep Flix, PhD.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Dec 3, 2021

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2021

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Dec 3, 2021

If I understood correctly, the monit_prod_cmssw_pop_ is reported by StatisticsSenderService, and the CMSSWChirp fields of monit_prod_condor_raw_metric are reported by CondorStatusUpdater.

(ref

The next step would be to look if a common identifier could be easily delivered to both.

@mrceyhun
Copy link

mrceyhun commented Dec 4, 2021

Hi Matti,

  • monit_prod_cmssw_pop_ data fed by udp-collector service on CMS Monitoring. We just parse the incoming data and send it as is to ElasticSearch(through AMQ).
  • monit_prod_condor_raw_metric data originally comes from Condor Schedds. We query them every 12 minutes and send results to ES after parsing and some calculations. The important part of parsing query responses is: convert_to_json.py and as you already mentioned, documentation of all fields can be found here

Best,
Ceyhun (CMS Monitoring)

@dan131riley
Copy link

InputSource has a GUID that could be used for this, but I have not looked to see how accessible it is from the monitoring routines:

/// Accessor for global process identifier
std::string const& processGUID() const { return processGUID_; }

@makortel
Copy link
Contributor

makortel commented Dec 6, 2021

Thanks @dan131riley.

Just thinking out loud, one (probably not very good) option could be to make the PoolSource constructor to deliver it to StatisticsSenderService and CondorStatusService. The CondorStatusService would need some thoughts as currently it does not have a header file and therefore can not be used with edm::Service<T> handle (i.e. would need to add the header and think a bit of the placement of the class to avoid extending the dependencies of IOPool/Input).

Likely a better place would be

try {
//even if we have an exception, send the signal
std::shared_ptr<int> sentry(nullptr, [areg, &md](void*) { areg->postSourceConstructionSignal_(md); });
convertException::wrap([&]() {
input = std::unique_ptr<InputSource>(InputSourceFactory::get()->makeInputSource(*main_input, isdesc).release());
input->preEventReadFromSourceSignal_.connect(std::cref(areg->preEventReadFromSourceSignal_));
input->postEventReadFromSourceSignal_.connect(std::cref(areg->postEventReadFromSourceSignal_));
});
} catch (cms::Exception& iException) {
std::ostringstream ost;
ost << "Constructing input source of type " << modtype;
iException.addContext(ost.str());
throw;
}
return input;

Then the GUID would at least be propagated for all Source types. I wouldn't really want FWCore to gain direct dependence on StatisticsSenderService (i.e. Utilities/StorageFactory), and considering also CondorStatusService makes me think of adding a new signal to ActivityRegistry for this. I also thought about using postSourceConstructionSignal_ for this, but this information wouldn't play well with "sending signal even with exception".

@makortel
Copy link
Contributor

@joseflix (Mostly out of curiosity, I'd like to understand the bigger picture better) The framework job report (that I think get uploaded into the monitoring system) reports the processed files, and wall clock and CPU times (for the CPU efficiency calculation). Is this information not sufficient? (like something missing, or would like information already when the jobs are running instead after the fact)

@makortel
Copy link
Contributor

#36570 adds a "process-level GUID" that is used in InputSource, StatisticsSenderService, and CondorStatusService.

@joseflix The StatisticsSenderService (that sends the UDP packets for the popularity data) already had a field
unique_id: "<guid>-<fileid>". #36570 just makes the <guid> in that field to be the same as an new field Guid in the Condor Chirp messages. Could you confirm if that would be sufficient for correlating the data in downstream analysis?

@joseflix
Copy link
Author

Hi @makortel I guess what you are proposing in the last message is fine. We need a guid that is the same in the two views, so we can later correlate entries that appear in the two views.

Concerning the FJR, I think they don't go to kibana, maybe they go somewhere and they can be parsed, but I have no clues on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment