Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SONIC updates for site support #45182

Merged
merged 16 commits into from
Aug 6, 2024

Conversation

kpedro88
Copy link
Contributor

PR description:

This PR includes several updates to support production-style tests at several sites, including T2_US_Purdue, other T2s, and NERSC. There are also some important fixes for bugs that were found during the course of testing.

New features/changes:

  • A failed response to a repository index query is now interpreted to indicate that the server should be skipped. (The new load balancer developed at Purdue will intercept the repository index query and return an error if it detects that a GPU is saturated.)
  • A site can provide a local GPU server using environment variables $SONIC_LOCAL_BALANCER_HOST and $SONIC_LOCAL_BALANCER_PORT, set via cmsset_local.sh in SITECONF. (This is a temporary solution that is eventually expected to migrate to something more standardized, similar to storage.json. It is introduced here to facilitate tests at other T2 sites to gain operational experience before determining the final form of this configuration.)
  • Separate message category called TritonDiscovery to enable tracking/debugging of production-style jobs without requiring full verbosity. (This can be seen as an interim step toward full provenance tracking.)
    • Along these lines, improve handling of verbose output options in test script.
  • Support is added for podman and podman-hpc to launch Triton containers (useful at NERSC).
  • Failures of optional server requests are no longer treated as exceptions, to avoid unnecessary job failures.

Bug fixes:

  • Modify some apptainer workarounds (remove an old one, add a new one...)
  • Check for the presence of a GPU in the cmsTriton server launching script, rather than in Python. (Production-style jobs can dump the Python config on a local node and then reuse it on a worker node, so any system calls happen on the local node and may not give the right answer for the worker node as to the presence of a GPU.)
  • Fix a bug in the implementation of thread controls for models on the CPU fallback server.

There are some interface changes (related to supporting more container engines and checking for local GPUs), which are documented.

PR validation:

Unit tests succeed. Site-specific options have been tested successfully at the appropriate sites.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Not a backport and not intended to be backported.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 10, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45182/40540

  • This PR adds an extra 60KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @kpedro88 for master.

It involves the following packages:

  • HeterogeneousCore/SonicTriton (heterogeneous)

@fwyzard, @makortel, @cmsbuild can you please review it and eventually sign? Thanks.
@riga, @missirol, @rovere, @makortel this is something you requested to watch as well.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@kpedro88
Copy link
Contributor Author

test parameters:
workflows = 10805.31,11634.9001,24834.9001
relvals_opt = --what cleanedupgrade,standard,highstats,pileup,generator,extendedgen,production,identity,ged,machine,premix,nano,gpu,2017,2026

@kpedro88
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bac96d/39816/summary.html
COMMIT: 80c0a5e
CMSSW: CMSSW_14_1_X_2024-06-10-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45182/39816/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test DRNTest had ERRORS

Comparison Summary

Summary:

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 1, 2024

Pull request #45182 was updated. @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen can you please check and sign again.

@kpedro88
Copy link
Contributor Author

kpedro88 commented Aug 1, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2024

+1

Size: This PR adds an extra 24KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bac96d/40728/summary.html
COMMIT: e24f362
CMSSW: CMSSW_14_1_X_2024-08-01-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/45182/40728/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@makortel
Copy link
Contributor

makortel commented Aug 6, 2024

Comparison failures are related to #39803 and #45505

@makortel
Copy link
Contributor

makortel commented Aug 6, 2024

+heterogeneous

@kpedro88
Copy link
Contributor Author

kpedro88 commented Aug 6, 2024

@cms-sw/reconstruction-l2 please check - the only change in your area is propagating some interface changes to a unit test

@makortel
Copy link
Contributor

makortel commented Aug 6, 2024

@kpedro88 Could you backport the slc7 workaround (516de1d) to 14_0_X? ORP would like to get the slc7 unit test failures (#45454) fixed there too.

@mandrenguyen
Copy link
Contributor

+reconstruction
@kpedro88 Are the comments from Matti still relevant? If not can you resolve them?

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2024

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @mandrenguyen, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@kpedro88
Copy link
Contributor Author

kpedro88 commented Aug 6, 2024

@mandrenguyen the comments were resolved in e24f362, now marked as such.

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 5e1e380 into cms-sw:master Aug 6, 2024
12 checks passed
@kpedro88
Copy link
Contributor Author

kpedro88 commented Aug 6, 2024

@makortel the use of OSG apptainer was introduced in 14_1 (#43814). Any failures in 14_0 are therefore related to other apptainer incompatibilities. I will make a backport PR with the relevant commits.

@makortel
Copy link
Contributor

makortel commented Aug 6, 2024

@makortel the use of OSG apptainer was introduced in 14_1 (#43814). Any failures in 14_0 are therefore related to other apptainer incompatibilities. I will make a backport PR with the relevant commits.

Ah right, the error in 14_0_X is indeed different (

INFO:    Disabling cgroups because systemd is unavailable
ERROR:   container cleanup failed: no instance found with name triton_server_instance_CPU_4cb5181e-0d20-4231-b181-0e53d40c7b4d
FATAL:   container creation failed: while applying cgroups config: while creating cgroup manager: systemd not running on this host, cannot use systemd cgroups manager

FATAL:   while executing starter: failed to start instance: while running /usr/libexec/apptainer/bin/starter: exit status 255

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc12/CMSSW_14_0_MULTIARCHS_X_2024-08-05-2300/unitTestLogs/HeterogeneousCore/SonicTriton#/74-74 )

Thanks!

@kpedro88
Copy link
Contributor Author

kpedro88 commented Aug 6, 2024

@makortel see #45652 and #45653 (the workaround is actually now needed in this release cycle also... it must be a relatively new problem that happened to pop up after I last ran the unit tests locally when testing this PR)

@makortel
Copy link
Contributor

makortel commented Aug 6, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants