Skip to content
This repository has been archived by the owner on Dec 9, 2024. It is now read-only.

Make 0.6 GeV Default pT Cut #397

Closed
wants to merge 14 commits into from
Closed

Make 0.6 GeV Default pT Cut #397

wants to merge 14 commits into from

Conversation

GNiendorf
Copy link
Member

PR for discussion of moving the current pT cut from 0.8 to 0.6 GeV.

@GNiendorf
Copy link
Member Author

/run standalone
/run CMSSW

@slava77
Copy link
Contributor

slava77 commented May 3, 2024

I did not remember during the meeting that there was no toggle to go to 0.6 from 0.8.

Perhaps we can first test what @VourMa proposed, to check if 0.6 GeV bin files can be used safely (no physics change and no significant slowdown) with the old default 0.8 GeV.

Copy link

github-actions bot commented May 3, 2024

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     43.1    324.7    124.0     68.7     93.5    546.3    128.4    156.3    104.3      2.2    1591.5    1002.1+/- 266.5     433.5   explicit_cache[s=4] (master)
   avg     50.6    329.8    373.5    209.2    412.9   1215.4    326.8    747.8    244.0      2.3    3912.3    2646.3+/- 836.5    1020.0   explicit_cache[s=4] (this PR)

@GNiendorf
Copy link
Member Author

Perhaps we can first test what @VourMa proposed, to check if 0.6 GeV bin files can be used safely (no physics change and no significant slowdown) with the old default 0.8 GeV.

Sure, sounds like a good check. I'll make another commit after the CI finishes running and rerun the CI with only the files changed.

@slava77
Copy link
Contributor

slava77 commented May 3, 2024

Fake rate vs pT comparison

this confirms that fakes are not localized to pt<0.8 GeV. The effect is apparently from the if (pt > 5) passThrough; logic.
I don't think this should block/prevent the low-pt variant to become the default.
But this re-iterates the necessity to review the pass-through selection logic.

@GNiendorf
Copy link
Member Author

Yes, that's what I saw before. pT3's contribute most to the fakerate increase at high pT.

Screenshot 2024-05-03 at 1 58 20 PM

@slava77
Copy link
Contributor

slava77 commented May 3, 2024

Yes, that's what I saw before. pT3's contribute most to the fakerate increase at high pT.

indeed.
the main update since then is that pT is more consistently defined and the step at 5 GeV is now explicit.

@GNiendorf
Copy link
Member Author

GNiendorf commented May 3, 2024

Breakdown plot from above, I see what you're saying now @slava77. @ariostas Any chance you could change the CI to also include the breakdown plots for the master branch in the tar file? Right now it just seems like the comparison plots and the breakdown plots for the PR.

Screenshot 2024-05-03 at 2 41 29 PM

Copy link

github-actions bot commented May 3, 2024

There was a problem while building and running with CMSSW. The logs can be found here.

@GNiendorf
Copy link
Member Author

/run standalone
/run CMSSW

Copy link

github-actions bot commented May 3, 2024

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     43.3    322.7    121.8     68.9     93.4    544.2    128.0    158.7    103.6      2.5    1587.1     999.6+/- 266.4     431.1   explicit_cache[s=4] (master)
   avg     43.6    321.5    154.1     69.8     92.6    541.6    122.5    179.0    101.9      1.8    1628.2    1043.1+/- 282.4     441.1   explicit_cache[s=4] (this PR)

@ariostas
Copy link
Member

ariostas commented May 3, 2024

There was a problem while building and running with CMSSW. The logs can be found here.

I just updated the CI to use the new version of CMSSW. @GNiendorf I'll restart the cmssw check.

Any chance you could change the CI to also include the breakdown plots for the master branch in the tar file?

I did it this way to keep the size of the archive as small as possible. You could look at the commit history, and check the breakdown plots of the previous PR. But if there's strong reasons to keep both PR and master for each PR I'm happy to implement that.

@GNiendorf
Copy link
Member Author

I did it this way to keep the size of the archive as small as possible. You could look at the commit history, and check the breakdown plots of the previous PR. But if there's strong reasons to keep both PR and master for each PR I'm happy to implement that.

Oh yeah fair point.

Copy link

github-actions bot commented May 3, 2024

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@GNiendorf
Copy link
Member Author

@slava77 Timing with 0.6 Maps
Screenshot 2024-05-03 at 4 31 19 PM

Current Master
Screenshot 2024-05-03 at 4 33 27 PM

@GNiendorf
Copy link
Member Author

/run CMSSW

Copy link

github-actions bot commented May 3, 2024

There was a problem while building and running with CMSSW. The logs can be found here.

@slava77
Copy link
Contributor

slava77 commented May 3, 2024

There was a problem while building and running with CMSSW. The logs can be found here.

From

Module: LSTOutputConverter:highPtTripletStepTrackCandidates (crashed)

this may need some manual debugging.

It looks like the 0.6 GeV maps are mostly OK; the CPU variant LS kernel is apparently slower (it's less visible/significant in the GPU case)

@slava77
Copy link
Contributor

slava77 commented May 7, 2024

There was a problem while building and running with CMSSW. The logs can be found here.

From

Module: LSTOutputConverter:highPtTripletStepTrackCandidates (crashed)

this may need some manual debugging.

It looks like the 0.6 GeV maps are mostly OK; the CPU variant LS kernel is apparently slower (it's less visible/significant in the GPU case)

@GNiendorf
perhaps you can reproduce this locally.
@ariostas since you know the CI setup, you may get there faster.
after it crashes locally, just run under the GDB.
USER_CXXFLAGS="-g" scram b ... can be used to add the debug symbols

@GNiendorf
Copy link
Member Author

GNiendorf commented May 7, 2024

Given that this is an issue within the CMSSW setup probably, maybe @ariostas or @VourMa could take a look?

@ariostas
Copy link
Member

ariostas commented May 7, 2024

@GNiendorf yeah, I'll take a look

@ariostas
Copy link
Member

ariostas commented May 8, 2024

I opened a PR to fix the issue in SegmentLinking/cmssw#24.

Also, looking at the logs you see that for one of the events you get this warning:

*********************************************************
* Warning: Pixel line segments will be truncated.       *
* You need to increase N_MAX_PIXEL_SEGMENTS_PER_MODULE. *
*********************************************************

So it seems like it's pretty close to the edge and it might be worth increasing that a bit.

I'll run the CI with the above PR to make sure that it works.

/run cmssw 24

Copy link

github-actions bot commented May 8, 2024

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@slava77
Copy link
Contributor

slava77 commented May 8, 2024

The PR was built and ran successfully with CMSSW. Here are some plots.

@ariostas please remind me why the reference is not shown in TrackLooper+cmssw PR test? Can we show the master as is as a reference?

@ariostas
Copy link
Member

ariostas commented May 8, 2024

Currently it doesn't do the comparison if using a PR or a different branch, but yeah I should change it so that it does the comparison as long as the cmssw version is the same.

GNiendorf added 2 commits May 9, 2024 12:20
@GNiendorf
Copy link
Member Author

/run standalone
/run CMSSW

SDL/Quintuplet.h Outdated
else if (category_number == 0 && eta_number == 1)
occupancy = 414;
occupancy = 86;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these numbers lower due to a more restrictive target?
We had 99.99% for T5; 99.9% for T3; 99% for segments, 99.99% for minidoublets.

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     45.3    326.0    124.2     50.1     97.2    504.6    133.8    159.1    100.7      2.0    1543.0     993.1+/- 259.4     424.7   explicit_cache[s=4] (master)
   avg     55.5    333.0    381.6    178.7    556.3   1134.7    362.0    793.0    236.9      9.0    4040.7    2850.5+/- 972.8    1077.1   explicit_cache[s=4] (this PR)

@GNiendorf
Copy link
Member Author

It seems like the T5 occupancies were set incredibly high. @YonsiG Do you know what percent they were set to? Even at 99.99% I find that the low pT occupancies are lower than the current occupancies.

@slava77
Copy link
Contributor

slava77 commented May 28, 2024

@GNiendorf
is your stat analysis based on non-zero occupancies?
or does it increase empty modules?

@GNiendorf
Copy link
Member Author

GNiendorf commented May 28, 2024

@GNiendorf is your stat analysis based on non-zero occupancies? or does it increase empty modules?

I changed it to only consider non-zero occupancies.

@SegmentLinking SegmentLinking deleted a comment from github-actions bot May 28, 2024
@GNiendorf
Copy link
Member Author

/run standalone
/run CMSSW

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     45.3    322.8    123.9     48.6     96.4    502.0    134.7    157.0     99.6      1.6    1532.0     984.7+/- 262.4     421.4   explicit_cache[s=4] (master)
   avg     57.8    333.7    398.5    185.9    576.2   1232.9    365.1    794.2    251.1     16.2    4211.6    2920.9+/- 985.2    1133.8   explicit_cache[s=4] (this PR)

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@GNiendorf
Copy link
Member Author

Plots for 1000 events: https://www.classe.cornell.edu/~gsn27/www/www/PRlowpT/

Seems like the efficiency difference is pretty minimal now at high pT @slava77

else if (category_number == 1 && eta_number == 1)
occupancy = 128;
occupancy = 653;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some updates are going up quite a bit, like here. On the other hand some others like in T5 https://github.com/SegmentLinking/TrackLooper/pull/397/files#diff-d9d3d7c519ab64f27b6e501281a0eb282be964d2802ddca40387348a68bb6cb0R3182 are going down

Cat1 is layers 4,5,6. It looks like something like layer 6 or even 5 already opens up a lot.
On one hand, an additional category may be useful.
On the other, it would be good to see some plots of the segment angles (is it phi change): there was a discussion to truncate this so that it doesn't hit pi/2 or even larger.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cat1 is layers 4,5,6. It looks like something like layer 6 or even 5 already opens up a lot.

given this is a segment; the possible layers are 4 and 5.

@slava77
Copy link
Contributor

slava77 commented May 29, 2024

@GNiendorf
do you have plots showing how the new occupancy numbers were derived?

@GNiendorf
Copy link
Member Author

@GNiendorf do you have plots showing how the new occupancy numbers were derived?

https://github.com/SegmentLinking/TrackLooper/blob/changept/scripts/compute_occupancies.ipynb

@slava77
Copy link
Contributor

slava77 commented May 29, 2024

@GNiendorf
Copy link
Member Author

https://github.com/SegmentLinking/TrackLooper/blob/changept/scripts/compute_occupancies.ipynb

file_path = "occ_1000_p06.root"

is this 1K events?

Yes, from PU200RelVal

@GNiendorf
Copy link
Member Author

If I increase the occupancies by 10x from where they are in this PR, I get the following:

So it seems like the efficiency at high pT is recovered fully if you put the occupancies high enough.

@slava77
Copy link
Contributor

slava77 commented May 29, 2024

If I increase the occupancies by 10x from where they are in this PR, I get the following:
So it seems like the efficiency at high pT is recovered fully if you put the occupancies high enough.

if the increase goes far above what we have in the master, it could be that what we observe is in part just a recovery of the inefficiency brought in by truncation initially.

How does master compare with just the increase of the occupancy cutoffs (without the change in the pt cut)?

@GNiendorf
Copy link
Member Author

If I just increase the occupancies on master by 10x I get the following.

TC_base_0_0_eff_ptzoom-1

@GNiendorf
Copy link
Member Author

GNiendorf commented May 31, 2024

This PR Timing
Screenshot 2024-05-30 at 10 08 08 PM

Master Timing
Screenshot 2024-05-30 at 10 10 55 PM

Max Memory Usage, Master (L40, 175 events, 8 streams, caching allocator on)
5621MiB / 46068MiB

Max Memory Usage, This PR (same setup)
9419MiB / 46068MiB

@slava77
Copy link
Contributor

slava77 commented May 31, 2024

Does 46068MiB correspond to the total memory of the card less some restricted value? (specs appear to say the L40 has 48 GB)?

How well is the caching allocator shared between modules of the same process?

What is happening in case of multiple processes? I think that we can expect between 8 and 16 thread jobs populating a 256-core node with at best 2 cards. So, from 8 to as many as 32 (or more?) processes will talk to the same card.

@dan131riley
perhaps you know this already, if memory is more restricted that what we usually see in our single job per node tests. What's a good way to test?

@dan131riley
Copy link

48GB for the L40 is correct, but the CUDA device driver allocates some of that. We actually see the same thing on the host system specs, max available mem is always less that the actual physical mem.

The caching allocator works a lot like malloc(). Within a process, memory available to the caching allocator is shared between modules so long as modules free caching allocator memory in a timely fashion. Memory allocated by one process is held onto by the caching allocator, so isn't normally available to other processes. I don't remember offhand if there is a call to force releasing caching allocator memory--if there is, I wouldn't expect it to be very effective because the caching allocator is likely to fragment its address space.

Last I heard I thought the preferred operating mode with GPUs (at least for HLT) was one process per socket with a GPU bound to each socket. I don't know how that's expected to scale up to 96 cores per socket, but I can't imagine it going well. Nvidia does have the "CUDA Multi Process Service management" (MPS) for sharing GPU resources between processes--I've used it in the past for benchmarking, but not recently. I don't know if it figures in the HLT plans, and Nvidia seems to keep tweaking how it works with each new architecture generation.

@slava77
Copy link
Contributor

slava77 commented May 31, 2024

Last I heard I thought the preferred operating mode with GPUs (at least for HLT) was one process per socket with a GPU bound to each socket.

Ah, indeed, it's not 8 or 16 threads.
I got 32 threads and 24 streams for the current HLT from Andrea in https://mattermost.web.cern.ch/cms-tsg/pl/pfc9i5nt8pf788btfg7y8ybowy in January.

@GNiendorf
Copy link
Member Author

Closing, see new PR: SegmentLinking/cmssw#39

@GNiendorf GNiendorf closed this Jun 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants