-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report all files to UDP statistics collector #34873
Comments
A new Issue was created by @nsmith- Nicholas Smith. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@nsmith- How urgently would this information be needed? |
For now it supplies only additional, monitoring information. So, within a few months would be OK |
I would appreciate if we can backport this to some of the production releases, e.g. 10.6. |
10_6_X covers the digi step for UL so that should be OK for now |
So it turns out (thanks @Dr15Jones) that #35074 (and #35180) are causing ASAN failure
On a quick inspection it is rather evident that cmssw/Utilities/StorageFactory/src/StatisticsSenderService.cc Lines 241 to 245 in 6f84b7b
and they are being set in setCurrentServer() cmssw/Utilities/StorageFactory/src/StatisticsSenderService.cc Lines 165 to 169 in 6f84b7b
that is being called from cmssw/Utilities/XrdAdaptor/src/XrdRequestManager.cc Lines 241 to 256 in 6f84b7b
But in case pileup files are also being read over xrootd, what guarantees that at the time of primary file close the m_serverhost would still point to the server providing the primary file and not a server for one of the pileup files?
Overhaul of @nsmith- (or @smuzaffar) Could you reopen this issue? |
Is this somehow not a concern for the existing PoolSource usage of StatisticsSenderService? I think even in this case we could have multiple files open thanks to secondary inputs? |
All sources do their work serially. It seems to me that the secondary input file closings are not reported to the cmssw/IOPool/Input/src/RootPrimaryFileSequence.cc Lines 117 to 126 in 40d49e2
but not here cmssw/IOPool/Input/src/RootSecondaryFileSequence.cc Lines 57 to 63 in 40d49e2
Then, given that the |
Ok, and I guess this brings up another feature request to have secondary input file close be reported as well to |
Here is another issue in addition to the server name. The size of the reported file is set here
( m_size is atomic) and that gets called in TStorageFactoryFile::Initialize cmssw/IOPool/TFileAdaptor/src/TStorageFactoryFile.cc Lines 196 to 199 in 40d49e2
that I guess gets called for every ROOT file opened in CMSSW (so also those opened for writing). |
@nsmith- Could you reopen the issue? |
I don't seem to have permission to do so |
Ok :( Maybe because it was closed because of the linked PR being merged. @smuzaffar Could you reopen this issue? |
reopened now |
@nsmith- we need further feedback from you to understand what changes could be done to still work with your use case. |
I'm a bit puzzled how it is all files, because empirically we never see any pileup files in our monitoring that receives these messages. Is it possibly related to the thread safety issues of the existing implementation? |
Although only 1 file name is reported, the statistics (e.g. bytes read) are actually the accumulated statistics for ALL ROOT files being used in the job. |
And I assume reporting the isolated statistics per file is more challenging? |
It would be a near complete rewrite since it is full of singletons/global variables to do the accounting. |
As a historical note, the code in question predates CMSSW and was taken from CERN's original 'SEAL' project which was trying to write a C++ toolkit. |
Let me add that while a "near complete rewrite" of the |
I think part of the problem is a misinterpretation of the sent information. The info is not a 'per file' message, it is instead a snapshot of the accumulated state of the job at that instant in time. For the UDP message, that state is an arbitrary time. For the statistic sender service it is when the primary file was closed. It seems now the request is to also send information when secondary/embedded files are closed. That could still send the present 'full job statistics summary' information along. |
I see, I didn't realize it was an accumulated statistic over the duration of the job. I guess since there is a |
@bbockelm what was your intent when you added the CondorStatusService and StatisticsSenderService? |
So it looks like the StatisticsSenderService has three pieces of information that would need to be handled differently if we report more than just the primary file. These are file_lfn , server_host , and server_domain . It seems like we already have a problem. Right now the XrdAdaptor::RequestManager appears to change the server_* information when ever ANY file changes its server. So right now the reports have the lfn to the primary file but might be reporting the server_* for the secondary/embedded file that last changed servers. |
@nsmith- #35505 makes the secondary and embedded file closures to cause and UDP packet to be sent. The statistics are reset only on primary file closures (like currently). Can you comment if this is ok for the downstream that interprets the statistics? (the packet has timestamps for the beginning and end of the statistics collection period) |
Is there a marker to tell whether the current file is a primary file in the packet? If that's the case then at least it is possible to reconstruct the correct statistics, if I understand correctly. |
I left the packet info exactly as it had been so as to not cause any problems with code reading those packets. Though I'd think it would be pretty easy to differentiate primary from pileup given pileup LNFs tend to have 'MinimumBias' in their name. If you want something else done, please let us know. This information is not used in anyway by the framework so you can basically say what you want. |
For pileup I agree, but if we also now catch secondary inputs (which, granted, do not ever get used in production jobs.. at least not yet) then it may be more difficult. It should be harmless to add a new field to the json, but I'll ping @vkuznet to confirm, since iirc he wrote the server daemon that parses these packets and injects them into the MONIT infrastructure. |
@nsmith- , yes, the new UDP server I wrote just reads incoming UDP packets, these packets contain JSON, and if JSON has invalid data or data-type the server drops it, and therefore we loose this information in our accounting. Each valid JSON is injected into MONIT infrastructure, while all invalid JSON are dropped. |
@vkuznet the question is, do you need to keep the same content of the JSON, or can it be modified (.e.g. adding info about if this is the primary file)? |
Since JSON is fed into MONIT infrastructure it is desired to have the same schema. But since we push data into ElasticSearch DB which is schemaless it is possible to change the schema. I suggest that we discuss and settle on a schema of the document. For clarity, here is current schema (I took a single document from ES):
If you'll add new key-value pairs it would be easier to accommodate them in ES and monitoring infrastructure (in MONIT dashboards), but if you'll change data-types or drop attributes this will lead to conflicts which should be resolved. And, we advise to keep flat JSON, i.e. keep only key-value pairs, since it simplify indexing and visualization queries. |
In this case we just propose to add a new key |
the most important thing is that you should validate your JSON. We often see that people confuse python dicts with valid JSON, and even here we're discussing C++ code which generates JSON, it is possible to supply not valid data-type which may cause issues (a typical error we saw many times, a number (int/float) is supplied as a string in JSON). Said that, if you define an attribute with some data-type you should ensure that generated JSON is fully compatible with that data-type. |
@nsmith- from the code change I'm making, it is actually easy to say if this is a primary file, a 'secondary' file (I.e. the parent file when using the two file solution) or if this is from a 'secondary source' (i.e. from a pileup file or some other module usign the EmbeddedSource objects). Do you still only want |
Specified names sound even better! |
One question, so if I understand the |
I don't know. This was not code written by the framework team. We just 'inheritted' it. From the implementation cmssw/Utilities/StorageFactory/src/StatisticsSenderService.cc Lines 110 to 113 in 2ae6e5d
read_bytes and read_bytes_at_close are identical. Going all the way back to the very first commit of that file by Brian, those two values have always been identical.
|
@nsmith- I double checked with MONIT team and there would be no problem if we'll add additional attributes to the existing schema. Therefore, you may add as many as you want. and I checked udp collector server, and it is transparent to the JSON itself, i.e. it does not care about its content. Therefore, I would assume that once you'll add one or more attributes they will simply pop-up in ES and we will only need to adjust our dashboards to use them as filters or attributes for any plots. |
Thanks Valentin, this is great news! In fact I think this gives us licence to completely redesign the reporting if we so choose (and perhaps continue to send the existing data, just add a new packet with modified data) |
For several years now, any cmsRun process at a site with
in their
site-local-config.xml
would send a small UDP packet on every primary file close, reporting the LFN and some summary statistics. The code responsible is found athttps://github.com/cms-sw/cmssw/blob/master/Utilities/StorageFactory/src/StatisticsSenderService.cc
which registers an event watcher:
However, empirically this event seems to only fire for files opened through InputSource, and if you search for
postCloseFileSignal_
, onlycmssw/FWCore/Framework/src/InputSource.cc
Lines 505 to 508 in 7e1bd4e
shows up. It seems that the mixing module uses a different class chain
EmbeddedRootSource : VectorInputSource
for file handling. Can we add apostCloseFileSignal_
to this as well?The text was updated successfully, but these errors were encountered: