-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AXOL1TL related triggers fire on every event in CMSSW_14_1_X #44435
Comments
assign l1, hlt |
New categories assigned: l1,hlt @Martin-Grunewald,@mmusich,@epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks |
cms-bot internal usage |
A new Issue was created by @mmusich. @sextonkennedy, @smuzaffar, @rappoccio, @makortel, @Dr15Jones, @antoniovilela can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Posted in #44397 and #44433, copied here by request I'm catching up after other work yesterday afternoon. Melissa has seen that removing the CICADA emulator changes AXO's answers. The CICADA team (myself actually) caught this one earlier this week looking at CICADA firmware/emulator matching. We could run multiple CICADA emulators at the same time and get different answers from a model than we would running only one emulator. I couldn't figure it out and reverted to only doing one at a time because that's how the emulator works anyways and I suspected it was some strange internal CICADA version that could be sorted out at some point later. What I suspect now now is that this issue is fundamentally symbol collision. Strictly, there isn't anything wrong with CICADA or AXO, the issue is that they share c++ code/symbols that are not specific to themselves in the dynamic linking case used by the emulator technique. My theory on what is happening is this: n 14_1 we updated the CICADA model to 1.1.1. One of the layers of CICADA 1.1.1 is weights w2, seen here: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v1p1p1/cicada.cpp#L36 AXO, by extension, uses w2 (and b2) as well: https://github.com/cms-hls4ml/AXOL1TL/blob/f21d72bc08ab6e7a86292db9ff0637532a311c2c/AXOL1TL_v3/NN/GTADModel_v3.cpp#L65 https://github.com/cms-hls4ml/AXOL1TL/blob/f21d72bc08ab6e7a86292db9ff0637532a311c2c/AXOL1TL_v3/NN/weights.h#L5 (notably, they are also sized such that AXO could comfortably use CICADA's weights). These weights are defined globally for both models. As the CMSSW job runs, CICADA gets loaded and created, and these weights loaded into the symbol table here by the dlopen here: https://github.com/cms-hls4ml/hls4mlEmulatorExtras/blob/17790de8f2f2892dfd8ff20fead8eedd3cf59b49/src/hls4ml/emulator.cc#L21. Later, the AXO model gets created, and the load also attempted. I suspect that behind the scenes, dlopen sees that it would be loading things into the symbol table that are already defined, and quietly drops it in favor of things that are already defined. In essence, when we upgraded CICADA to 1.1.1, by no real fault of our own, we ended up accidentally switching AXO to CICADA's weights. I suspect a quick and dirty solution would be to do to CICADA what AXOL1TL has accidentally been doing (and we've all been asking AXO to stop ironically), which is create and destroy itself in one swoop, with no real persistence for model objects beyond creating a result. In the longer run, the usage of this emulator technique needs to be fixed or changed. |
Also adding my comment from the Test PR here: Right, so this AXO fix is what I expected. But what I also was anticipating and seemingly got confirmed is that other hls4ml emulators might also be affected. Processed events: 9 out of 10 (90%)
this is a PHASE2 workflow that @aloeliger added recently for the p2GT emulator (thanks for this!!) so this is indeed something very hls4ml specific |
This phase2 thing also brings up another issue I was long wondering about: why do we run the phase1 L1 emulation in the phase2? From the L1 configuration it seems the SimL1 emulator always runs both phases.. |
This is exactly what should happen if the
(from https://man7.org/linux/man-pages/man3/dlopen.3.html) Well, quick search indicates that even |
I don't think Having two different definition of the same global symbol is a violation of the C++ One Definition Rule. If those symbols are not supposed to be visible outside of each model, possible solutions are:
If those symbols must be visible, the easiest approach to avoid conflicts is to move all the symbols within a a different namespace, for each model and for each version of the model. |
I don't think even this is guaranteed to work properly. What if two different models are being loaded and used concurrently in two threads? I second @fwyzard's comment on using namespaces. In any case they are the standard way to deal disambiguate otherwise identical symbols, and thus easiest to guarantee a properly working setup. In addition, global variables such as |
Actually, why do they need to be global variables ? A model update that involves only a change in weights can reuse the same C++ type, and just use a different instance. A model update that involves a change in the model structure would result in a different C++ type. |
These are questions for the HLS4ML developers, I can't speak to the viability of any of them. This c++ code was originally intended for synthesis on FPGAs, and it is just by convenience of it being c++ that it was adapted for emulators. I really don't know if there are any of a dozen reasons that the HLS may not work for things like this. I can attempt to try to recreate the issue, and insert namespaces around delicate pieces of the machinery and see if it fixes the issue, but no guarantees on that being a fast process. |
Okay. I just want to make sure I have documented here some investigations I've done into this. I looked at an instance of just running the CICADA emulator for version 2.1.1, it ran on a single thread, no other model should be running. In case it matters, it was running on a 2023 zero bias file: I ran for 10 events, and got this as a set of baseline scores out of CICADA:
I then setup a duplicate emulator to run, this time running an earlier model version v2.1.0. To be explicit, v2.1.0 runs in the official emulator path, and v2.1.1 was running in a duplicate emulator path both running in the same configuration. In any case, the new results from the original 2.1.1 model were then changed:
I then tested recompiling the v2.1.0 model, this time, with a namespace around the weight files (i.e. putting
It does seem like weight issues in this sort of trivial example are causing the model interference problem. CICADA and AXOL1TL could try to isolate these weights from each other as a first solution, but it will be a bit tedious. And I guess I can't promise that other elements aren't going to conflict somewhere either. |
More documentation of investigation. I have spoken with the HLS4ML developers and they are also surprised and about all of this, so I'm in the middle of doing more tests. Two ideas have come from the HLS4ML developers. The first is that there is aready an HLS4ML model in CMSSW before we came up with this emulator technique, a NN Taus model, here: https://github.com/cms-sw/cmssw/tree/master/L1Trigger/Phase2L1ParticleFlow/interface/taus. They want to understand if having this around and already in CMSSW is itself responsible for interference behavior. The second is they want to see the effect of trying to namespace typedefs because that seems a likely spot for this issue to originate. I had a few things I wanted to check as well, so I reintroduced the CICADA-CICADA model conflict, and put some debugging statements into the 2.1.1 model (currently being interfered with by a separate path running the 2.1.0 model). With the interference present, this is it's summary of it's inputs, what it thinks one of it's smaller weight layers looks like, and the output it thinks it's producing:
If I remove the interference, I get this:
The inputs themselves are not being interfered with (although both 2.1.0 and 2.1.1. use a 10 bit fixed int format for input), however, the weights of the layer are obviously different. Curiously, the first few weights in the interfering models case seem to be almost exactly half of the weights in the no interference case. I think it's worth noting that none of the no interference case weights are exactly what is present in the CICADA 2.1.1 weight specification: But those weights are not expressible in the reduced precision specified for FPGA types, while the quoted weights seem to be. Also worth noting, given the approximate power of two difference, is that the interfering model, 2.1.0, defines these weights to be 10 bits long, all decimal values right of the decimal point, and model being interfered with, 2.1.1 defines these weights to be 16 bits long, 11 of those right of the decimal point. 2.1.0: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v2p1/defines.h#L26 Which lead me to believe the namespaced typedefs may be the most "accurate" solution to this issue. Inserting those namespaces around the typedef'ed defines (https://github.com/cms-hls4ml/CICADA/blob/main/CICADA_v2p1p1/defines.h), and structs(https://github.com/cms-hls4ml/CICADA/blob/main/CICADA_v2p1p1/parameters.h), (and other requisite model usage areas) recompiling, and rerunning with the "interference" still present:
Same interference as before. Note I also added a statement about how big it thinks the type it is storing the weight in is (https://docs.amd.com/r/en-US/ug1399-vitis-hls/Other-Class-Methods-Operators-and-Data-Members). It is correctly assessing this type as the 16 bit weight type from 2.1.1, and not as a 10 bit type from 2.1.0. That part's a little confusing to me. Inserting the namespace around the weights again:
Again drops the interference seen. |
Suggested readings:
That model is surely affected by the same problem, since it uses plenty of unscoped variables in the global namespace. Maybe #43639 should have been reviewed more accurately.
What do you mean by "namespace typedefs" ? |
Yet more documentation of investigation into the issue: I pulled the namespaces back out of the weights to re-introduce the interference again, and this time cracked open the internals of the 2.1.0 model. 2.1.0 and 2.1.1 are just bug-fix versions of each other, but I was somewhat surprised to discover their weights are actually the same for w3 layer (https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v2p1/weights/w3.h#L12), so I decided to test whose weights are actually being used when I run 2.1.1 I forcibly set the first three weights of the w3 layer to 0, and reran the output of the 2.1.1 model with the interference present:
Note the first 3 weights of 2.1.1 have now been set to 0, which was only edited into the 2.1.0 model. I think then what is happening here, in this scenario is this: 2.1.0 gets loaded first, it's types defined, and weights populated. Later, when 2.1.1 gets loaded, the types are redefined, and the weights are not repopulated, but repurposed, and I think reinterpreted/cast to the type the new model uses. In our case, what was 10 bits worth of decimal has an extra bit of decimal shoved in front, and integer bits appended to the left of that. In short, other models weights are used while being cast into new types. The fact that this hasn't crashed something yet is likely just a coincidence. |
I have prepared updates to the CICADA external (cms-hls4ml/CICADA#3), and made PRs to the externals (cms-sw/cmsdist#9087, cms-sw/cmsdist#9088), to add namespaces to weights and types to prevent interference from or with CICADA to other hls4ml triggers. Some discussion with hls4ml developers is ongoing about this solution. |
hi @aloeliger, from a very quick look at the changes, I noticed that the weights are now in different namespaces, but the Is it safe to keep that in a global namespace, with the possibility of conflicts ? |
@fwyzard Yeah, that should be namespace-d away too for the models where it is generically defined. Probably should be namespace-d away even where it isn't just for the sake of it. Thanks, good catch. |
Okay. The recent updates to the externals PR (cms-hls4ml/CICADA#4 and https://github.com/cms-hls4ml/CICADA/releases/tag/v1.3.1) namespace away the main function for CICADA as well, which should prevent this issue from happening in the future with CICADA. |
FYI there is also a similar PR adding protections to the AXOL1TL models:
(copying from #44510 (comment) ) p.s. The CICADA update alone fixes both Phase-1 and Phase-2 trigger results: cms-sw/cmsdist#9087 (comment), while the AXO update only affected the Phase-1 workflows as expected: cms-sw/cmsdist#9091 (comment) |
The phase 2 seen from CICADA is likely because the phase 2 modifications do not explicitly replace/remove the CICADA the calo summary card emulator. This is solvable, but we should understand how entangled Phase 1 and Phase 2 are. |
As of the merging of cms-sw/cmsdist#9087 and cms-sw/cmsdist#9088 (and cms-sw/cmsdist#9091 and cms-sw/cmsdist#9092), I believe the immediate crisis of this issue has been handled. The L1T community has discussed somewhat yesterday however the need for vigilance on the problems that caused the issue going forward. |
+l1 |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
See #44397 (comment) for details.
The text was updated successfully, but these errors were encountered: