-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XrdAdaptor] Modifications for xrootd-5.3 #34700
[XrdAdaptor] Modifications for xrootd-5.3 #34700
Conversation
… as glibc version is used for RHEL7.
…appending triedrc=resel to opaque parameter list.
@osschar, CMSSW_12_0_X branch is closed for direct updates. cms-bot is going to move this PR to master branch. |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34700/24329
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
Existing issue is a crash, presumably due to a race condition, at cmsRun startup while doing TFile::Open(). I was surprised TFile::Open() is called at all, I thought XrdAdaptor will get invoked immediately (the same happens with 11_3_2 and xrootd-4.12 -- TFile::Open, not the crash :) ). See stack trace below. I'm assuming it is some sort or race condition as when I ran this in debugger and set breakpoint at TFile::Open(), the thing managed to initialize properly and then ran successfully. @Dr15Jones @dan131riley @abh3 @simonmichal -- please take a look and tell me if anything rings a bell here.
|
@bbockelm please review the change for multi-source. There is a question for you in the code if we should handle redirect-limit differently as this will happen on XCache clusters all the time and with current implementation the manager will go for a new file every 5 seconds (which we are seeing at SoCal XCache): |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34700/24330
|
A new Pull Request was created by @osschar (Matevž Tadel) for master. It involves the following packages:
@makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test with cms-sw/cmsdist#7180 |
please test with cms-sw/cmsdist#7180 |
@osschar , I do not think you need any rebase. There are no conflicting changes. But if you then you can do something like
and this should rebase on top of existing master branch. After that you can push changes to your branch |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bc0314/17919/summary.html The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see more details here: Comparison SummarySummary:
|
That could be useful, so fine for me. There is still a question to @bbockelm cmssw/Utilities/XrdAdaptor/src/XrdRequestManager.cc Lines 1029 to 1034 in b2fbf83
but I would not let that prevent testing in DEVEL. |
so lets get it in DEVEL |
thank you ... I also think it's important to get this moving. i'll follow up with Brian and eventually open a new PR if needed. |
couple of workflows in DEVEL IBs are failed/timed out (after 9000s) and at the end I see messages like
looks like the job was finished with in few minutes but xrootd was hanging causing it to be killed after 9000s. |
I see five files were closed at the end of the job. This job then went on for another 3 hours, right? I assume this doesn't happen otherwise :) How could I run this to reproduce it and attach gdb to see what "main" is waiting for? Do we have debug on (or available) for these external/cmssw builds? [ An aside: another thing I added in XrdAdaptor in this PR is that it advertises when it is doing additional open for multi-source reading (which happens here). This is really bad for XCache, and I dare guess it is annoying EOS people a bit as well. The servers/redirectors already support limiting number of opened "replicas" if so configured ... this could be enabled on CMS EOS once this all gets through. ] |
correct. Note that this is not happening a lot. We run over 900 workflows and only 1 or 2 fail with this type of errors.
No xrootd and other externals are not build in debug mode but for DEVEL IBs I can build xrootd in debug mode. In order to re-produce you ca do some thing like
Hopefully this will run and generate
as I wrote that frequency of the failure is low so you might have to run it many time in order to get this type of failure. Note that CMSSW_12_1_DEVEL_X_2021-08-29-2300 is without debug symbols, I will build CMSSW_12_1_DEVEL_X_2021-08-30-1100 with xrootd in debug mode |
Thank you! I'm running the test in 10 terminals on various lxpluses, will keep running them over the day. |
I suddenly got the hangups in all 10 terminals at the same time, the processes that I managed to attach to were all stuck in main thread:
[ I say managed to attach to as I was running off my AFS home directory (in different tmp dirs so Now, I sort of assume IB tests are not run off AFS so this might or might not be the same thing. What kind of FS did the IB tests run on? Or was writing also done via xroot? |
IB/PR tests run on local SSD disk |
Also the printouts in #34700 (comment) came some time after the |
There are several usages of deleted Class QueryAttrHandler could in principle hold the FileSystem object ... but the whole thing looks over-engineered 10-times over: https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_DEVEL_X/Utilities/XrdAdaptor/src/XrdSource.cc#L82 XrdCl::FileSystem provides a sync version of Query() where one can pass in timeout directly. @simonmichal @bbockelm Can you please comment?
|
CMSSW_12_1_DEVEL_X_2021-09-05-2300 has a timeout with a stack trace that looks a lot like a deadlock.
|
Here is another crash from CMSSW_12_1_DEVEL_X_2021-09-13-2300
|
Given the frequent crashes in DEVEL, should we consider reverting these changes until we have a fix for those? (assuming repeating the crashes has not value given the extracted stack traces above) |
Sorry, I dropped the ball on this :( Let me fix the thing I've already figured out is wrong, another local variable XrdCl::FileSystem that gets deleted and later accessed from response handler. |
There is one issue and a possible improvement ... I'll add those as separate comments.