Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trilinos: Enable build statistics #7376

Open
9 of 14 tasks
jjellio opened this issue May 15, 2020 · 26 comments
Open
9 of 14 tasks

Trilinos: Enable build statistics #7376

jjellio opened this issue May 15, 2020 · 26 comments
Assignees
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project DO_NOT_AUTOCLOSE This issue should be exempt from auto-closing by the GitHub Actions bot. type: enhancement Issue is an enhancement, not a bug

Comments

@jjellio
Copy link
Contributor

jjellio commented May 15, 2020

Enhancement

This Issue is for tracking any effects from a PR I am submitting with tools for collecting very detailed build statistics, that seems to have near zero cost. A goal of this work is to enable packages/product owners to understand how their packages impact compile time, memory usage, and file size (among others).

The impact of using the tool seems to be zero. That is, the tool's overhead sits entirely inside the noise of the build. I did two builds on Rzansel, both use Cuda + Serial, which is the standard ATDM Trilinos settings (I actually built EMPIRE using the tool as well)

Without build_stats: 53:34.98 (3214.98s)
With build_stats: 53:22.07 (3202.07s) -  0.5%

Updated Data (for the more complicated python wrappers + NM usage)

NM ON

--- configure build (elapsed) build (user time)
ON 45.08 1216.21 (2.5%) 58257.09 (2%)
OFF 43.51 1186 57089.23

NM OFF

--- configure build (elapsed) build (user time)
ON 45.82 1203.11 (1.2%) 57687.75 (1.1%)
OFF 44.55 1189.23 57062.09

Clearly using python has a price... but this is still on pretty tiny.
A pass through the code for efficiency is planned (maybe move to in-memory files for the temporaries)

Path forward

The scripts work by wrapper '$MPICC' inside the ATDM trilinos ENV. CMake then uses these 'wrapped' compilers. The wrapped compilers emit copious data in the build tree along side the object file/library/executable that is created. After the build is complete, these 'timing' files are then aggregated into one massive CSV file. On Rzansel, the CSV is about 1.8MB, and it has lines equal to the number of things built.

To prevent the wrappers from tampering with CMake's configure phase, I've added a single line to CMakeLists.txt which sets an ENV CMAKE_IS_IN_CONFIGURE_MODE, this allows the wrappers to toggle on/off based on whether a real build is happening, versus the configuration phase.

One idea for making this work, is to have CTest post the resulting build statistics files directory to CDash along with any testing data. For customers not posting, I can help provide a script that will aggregate the data manually.

Once the CSV data is posted, then others can develop tools for tracking this over time. I also have some tools that operate directly on the files (using Javascript).

The tool tracks:

FileSize

# data from /usr/bin/time regarding memory use and build time
# this is the memory highwater mark
max_resident_size_Kb
# this is the actual time the file took to compile
elapsed_real_time_sec

# more data from /usr/bin/time
avg_total_memory_used_Kb
num_major_page_faults
num_filesystem_inputs
exit_status
perc_cpu_used
avg_size_unshared_data_area_Kb
num_waits
avg_size_unshared_text_area_Kb
cpu_sec_user_mode
num_swapped
num_signals
num_involuntary_context_switch
num_minor_page_faults
num_socket_msg_sent
cpu_sec_kernel_mode
num_socket_msg_recv
num_filesystem_outputs

# data from nm -aS about symbols
symbol_stack_unwind
symbol_ro_data_local
symbol_unique_global
symbol_ro_data_global
symbol_text_global
symbol_text_local
symbol_debug

This issue is also tracked in CDOFA-119.

@bartlettroscoe @jwillenbring

Links

Related to

  • SEPW-203

Tasks:

@jjellio jjellio added the type: enhancement Issue is an enhancement, not a bug label May 15, 2020
@jjellio jjellio self-assigned this May 15, 2020
@bartlettroscoe bartlettroscoe added ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project labels May 15, 2020
@bartlettroscoe
Copy link
Member

This is going to be quite nice.

@csiefer2
Copy link
Member

Nice!

@jjellio
Copy link
Contributor Author

jjellio commented May 16, 2020

PR #7377 has merged, so the tools used to collect the stats are in the repo.

The next step is sorting out how to get CMake to use the statistic gathering compiler wrappers.

If anyone has comments, this is what I've outlined to go over w/Ross.

  1. Wrapper creation.

    • We discussed using an ENV variable to enable/disable the feature
    • If done via CMake, then whatever changes needs to be propagatable to downstream clients (Sierra, EMPIRE, SPARC, …)
  2. Wrapper preservation

    • The wrappers make sense at build time, but Trilinos is a library consumed by customers - we want to enable this capability for those customers potentially
    • If the wrappers get installed, then installed CMake packages needs to have their associated variables reset to match the installed wrappers.
      That is, we told cmake CXX is ./build_dir/build_stat_wrapeprs/wrapper_cxx,
      but now we need the installed Cmake files to have CXX as $install_dir/bin/build_stat_wrappers/wrapper_cxx
      (because the the installed Trilinos should never depend on the build dir)
  3. Iron out how downstream customer can interact with this. (or defer this, and focus on showing capability w/Trilinos)

  4. Data aggregation

    • the wrappers leave *.timing files for everything a compiler creates (libraries, object code, executables)
    • After building (make all finishes), we need to aggregate all *.timing files into a single file. (The data are just CSV files)
    • The aggregation can be done outside Cmake (what I’ve done), by just adding 2 lines of bash.
      a. A more elegant solution would be to use a rule or some sort that fires after make all by default (or explicitly, make gather_build_stats)
  5. Sort out what to do with aggregated data

    • I think a good idea is to simply install this CSV file along with Trilinos
      a. If installing, this would fit naturally with the ‘rule’ to aggregate the data. That rule provides the stats.csv. (this rule just works w/install)
      b. Make sure this doesn’t break things if stats aren’t enabled!
    • Another excellent idea is to have it posted to CDash. Ideally as a CSV file (not lumped inside Stdout, but as some file that is web-accessible, e.g., http://..../stats.csv).
      Optionally, we can also compress it:
      s992398:html jjellio$ du -hs trilinos.csv*
      1.7M trilinos.csv
      256K trilinos.csv.tar.bz2
      336K trilinos.csv.tar.gz
      256K trilinos.csv.tar.xz
    • CDash/Posting is a dark-art to me, no idea how/what this entails

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 9, 2020
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 9, 2020
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 9, 2020
…nos#7376)

I also changed the logic in how the compiler wrappers are generated a little.
@bartlettroscoe
Copy link
Member

@jjellio, I posted WIP PR #7508 that gets the basic build wrappers in place. See the task list in that PR for next steps. I think with a few hours of work, we will have a very nice Minimal Viable Product that we can deploy and start using in a bunch of builds that post to CDash.

@bartlettroscoe
Copy link
Member

@jjellio, I talked with Zack Galbreath at Kitware today and he mentioned that there is another option for uploading files from and downloading files from CDash. That is the ctest_upload() command. That command also allows you to define URLs that will be listed for the build. With that, I think would could provide the data and the hooks that are needed for your tool with the prototype at:

For example, you could define a URL like:

(the ??? part yet to be filled in).

Zack is going to add a strong automated test to CDash to make sure that you can download those files, one at a time.

To support this, I could add a hook to tribits_ctest_driver() to call ctest_upload() with a custom list of files (and URLs).

We will need to do some testing of this to work this out, but if this works, then any build that is run with the tribits_ctest_driver() function would automatically support uploading the build stats file and the associated URL links to look at it in more detail. So we could easily support this for all of the ATDM Trilinos builds and all other builds that use the tribits_ctest_driver() function. But this would not work for Trilinos PR builds since those don't use the tribits_ctest_driver() function so we could not get them to build build stats. (But they could implement a call to ctest_upload().)

But even if we use ctest_upload() to upload the build_stats.csv file I think we still want a runtime test (TrilinosBuildStats_Results) that will summarize the most important stats in text form (like shown in #7508 (comment)) so we can search them with the "Test Output" filter field to the cdash/queryTests.php page and so we can put strong checks for these max values so we can fail the test if these numbers get too high. And with this latter part, you could even fail PR builds of the numbers get too high.

Anyway, I have some more work to do before PR #7508 is ready to merge so I will get to it.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 10, 2020
…s disable logic (trilinos#7376)

Now the package TrilinosBuildStats will get forced set to OFF if
<Project>_USE_BUILD_STATS_WRAPPERS=ON is not set.
@jjellio
Copy link
Contributor Author

jjellio commented Jun 11, 2020

I actually tried to pass the cdash file to my github.io page:

https://jjellio.github.io/build_stats/index.html?csv_file=https://testing-dev.sandia.gov/cdash/api/v1/testDetails.php?buildtestid=18733695&fileid=1

It fails due to security policies that block javascript from loading files from a domain other than the script... I'm not browser savvy enough to know what to do about it.... it would be nice if I could work around that. But I guess if all else failed, the webpage could get hosted inside SNL (maybe that would avoid the security issue).

Even if I work around that security stuff, I'll still need to figure out how to decode a tarball (that should be doable, I see javascript libraries for it)

@bartlettroscoe
Copy link
Member

@jjellio, worst-case scenario, developers could just download the 'build_stats.csv' file off of CDash and then upload it to your site when they are doing deeper analysis. Otherwise, we can ask Kitware for help with the web issues.

But developers are not going to bother looking at any data unless they think there is a problem. That is what we can address with filling out the test TrilinosBuildStats_Results to run a tool that summarizes the critical build stats. I suggested that in #7508 (comment). What I propose is to write a Python tool called summarize_build_stats.py that will read in the 'build_stats.csv' file and then produce, to STDOUT, a report like:

Full Project: max max_resident_size = <max_resident_size> (<file-name>)
Full Project: max elapsed_time: = <elapsed_time> (<file-name>)
Full Project: max file_size= <file_size> (<file-name.)

Kokkos: max max_resident_size = <max_resident_size> (<file-name>)
Kokkos: max elapsed_time: = <elapsed_time> (<file-name>)
Kokkos: max file_size= <file_size> (<file-name.)

Teuchos: max max_resident_size = <max_resident_size> (<file-name>)
Teuchos: max elapsed_time: = <elapsed_time> (<file-name>)
Teuchos: max file_size= <file_size> (<file-name.)

...

Panzer: max max_resident_size = <max_resident_size> (<file-name>)
Panzer: max elapsed_time: = <elapsed_time> (<file-name>)
Panzer: max file_size= <file_size> (<file-name.)

...

Such a tool needs to know how to map file names to TriBITS packages. There is already code in TriBITS that can do that.

Are you okay with me taking a crack at writing an initial version of summarize_build_stats.py? It would be better to write that as a TriBITS utility because then I could use MockTrilinos to write strong unit tests for it.

What do you think?

@bartlettroscoe
Copy link
Member

@jjellio

So it turns out that CDash does not currently support downloading files from CDash uploaded using the ctset_upload() command like you see here:

with the files (and URL) viewed at:

However, it does look like CDash supports downloading files uploaded to a test using the ATTACH_FILES ctest property. For example, for the trial build and submit shown at:

if you get the JSON from:

you see (pretty printed):

{
   ...
   test: {
      id: 8313,
      buildid: 5522160,
      build: "Linux-gnu-openmp-shared-dbg-pt",
      buildstarttime: "2020-06-09 15:45:48",
      site: "crf450.srn.sandia.gov",
      siteid: "187",
      test: "TrilinosBuildStats_Results",
      time: " 50ms",
      ...
      measurements: [
         {
            name: "Pass Reason",
            type: "text/string",
            value: "Required regular expression found.Regex=[OVERALL FINAL RESULT: TEST PASSED .TrilinosBuildStats_Results.<br />\n]"
         },
         {
            name: "Processors",
            type: "numeric/double",
            value: "1"
         },
         {
            name: "build_stats.csv",
            type: "file",
            fileid: 1,
            value: ""
         }
      ]
   },
   generationtime: 0.04
}

So it looks like you can get that data in Python (converted to recursive list/dict datastructure) and you can loop over the dicts in data['test']['measurements'] and find the file as:

         {
            name: "build_stats.csv",
            type: "file",
            fileid: 1,
            value: ""
         }

That dict is data['test']['measurements'][2] in this case.

Given that 'fileid' field value of '1', you can then download the data using the URL:

You can fund your way to this test, for example, by knowing the CDash Group, Site, Build Name, and Build Start Time and plug those into this query:

The JSON for that is shown at:

which has the element:

{
   ...
   builds: [
      {
         testname: "TrilinosBuildStats_Results",
         site: "crf450.srn.sandia.gov",
         buildName: "Linux-gnu-openmp-shared-dbg-pt",
         buildstarttime: "2020-06-09T15:47:22 MDT",
         time: 0.05,
         prettyTime: " 50ms",
         details: "Completed\n",
         siteLink: "viewSite.php?siteid=187",
         buildSummaryLink: "build/5522163",
         testDetailsLink: "test/18733864",
         status: "Passed",
         statusclass: "normal",
         nprocs: 1,
         procTime: 0.05,
         prettyProcTime: " 50ms"
      }
   ],
   ...
}

which has testDetailsLink: "test/18733864".

So there you have it. If you know the following fields:

  • Group
  • Site
  • Build Name
  • Build Start Time

you can find the test TrilinosBuildStats_Results for that build and download its attached 'build_stats.csv' file (as a tared and zipped file'build_stats.csv.tgz').

So we can do what we need by attaching the file to a test. But it is a bit of a run-around to find what we need.

It would be more straightforward to to upload with ctest_upload() and then directly download the file from CDash. But, again, CDash does not currently support that and the Trilinos PR ctest -S driver does not support that.

So for now, I would suggest that we just go with uploading the 'build_stats.csv' file to the test TrilinosBuildStats_Results and then downloading it from there for any automated tools.

@bartlettroscoe
Copy link
Member

FYI: Further discussion about CDash upload and download options should occur in newly created issue:

so we can get some direct help/advice from Kitware.

@jjellio
Copy link
Contributor Author

jjellio commented Jun 11, 2020

Ross, I think the summarize would be better (for maintenance/extensibility) implemented as script CMake calls. That (optionally promising to generate a file if needed)

If there was a dummy script, commonTools/build_stats/summarize_build_stats.py (or wherever), then you could build the CMake stuff now, but later changes would just need to fiddle with that file.

I can conceive how to implement summarize_build_stats.py as just plain bash. Since the file is CSV, you'd head -n1 to get the header, store that as an array variable. Then search the header for the indexes of the metrics you want. Next, you'd have to have grep the file for FileName = packages/foo/. From that subset of the file, cut -fN, where N is the number from the header array. Pipe that through awk or bc to sum it up. Additionally, you could sort the subset matching the package, and select the top file for each metric. This would could be a fairly simple /bin/bash script. Python + CSV makes sense if you want complex analysis, but just package summaries perhaps it would be easier via BASH.

I do think there would be value in showing package-level aggregates:

Panzer: max max_resident_size = <max_resident_size> (<file-name>)
Panzer: max elapsed_time: = <elapsed_time> (<file-name>)
Panzer: max file_size= <file_size> (<file-name.)

Panzer: Total time:
Panzer Total Memory:
Panzer Total Size:

All Files: Total Time (this is effectively total build time)
All Memory (to be consistent)
All Size:  Roughly How much storage this build required

Size in particular is helpful, as it indicates how much filesystem servers need.

All of the above can be implemented via BASH I think, just a few loops + cut/grep/awk (which are coreutils so will always be present on machines)

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jun 11, 2020

Ross, I think the summarize would be better (for maintenance/extensibility) implemented as script CMake calls. That (optionally promising to generate a file if needed)

@jjellio, yes, that is exactly what I was suggesting.

I can conceive how to implement summarize_build_stats.py as just plain bash

Such a tool would be very hard to write, test, and maintain in bash. Do you have something against Python?

Just to get this started, I will add simple Python script:

Trilinos/commonTools/build_stats/summarize_build_stats.py

that will just provide project-level stats:

Full Project: sum(max_resident_size_size_mb) = <sum_max_resident_size_mb> (<num-entries> entries)
Full Project: max(max_resident_size_size_mb) = <max_max_resident_size_mb> (<file-name>)
Full Project: max(elapsed_real_time_sec) = <max_elapsed_time_sec> (<file-name>)
Full Project: sum(elapsed_real_time_sec) = <sum_elapsed_time_sec> (<num-entries> entries)
Full Project: sum(file_size_mb) = <sum_file_size_mb> (<num-entries> entries)
Full Project: max(file_size_mb) = <max_file_size_mb> (<file-name>)

That will avoid needing to deal with the package logic for now. We can always add package-level stats later when we have the time (and that will require using some TriBITS utilities to convert from file paths to package names). That way, we can turn this on for PR testing now and merge PR #7508.

Okay?

@jjellio
Copy link
Contributor Author

jjellio commented Jun 11, 2020

I have no issues with Python other than you have to be aware of 2.x vs 3.x stuff.

I love python's regex library. Python 3.x with 'format strings' is awesome. e.g.,f'A variable in scipe: {some_var}'

Tangents below:

Another issue to consider is how to interact with developers. I'll need to improve the webpage (better explanations, and styling for sure).

Yet another issue: can you use the info here, to feedback into Ninja or CMake to improve our build system. This could be an interesting question for Kitware. E.g., if we could provide a list of targets (file.o things) plus a weight. Could Kitware use that orchestrate a good Ninja file. Or perhaps we could do that ourselves (I already have dome something similar). Given weights + num_parallel_procs, coerce the existing build.ninja such that a certain memory highwater mark is avoided. (it's effectively a variant of the knapsack packing problem I believe)

@rmmilewi (CC Reed, this may be something he'd like to be abreast of)

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 11, 2020
…sable logic (trilinos#7376)

Now, the package TrilinosBuildStats will get enabled by default if
<Project>_ENABLE_BUILD_STATS=ON is set but the package TrilinosBuildStatus
will not get disabled if <Project>_ENABLE_BUILD_STATS=ON is not set.  But the
test TrilinosBuildStats_Results will only get enabled if
<Project>_ENABLE_BUILD_STATS=ON.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 11, 2020
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 19, 2021
Also factored out small file BuildStatsSharedVars.cmake to avoid duplication.

I did this for two reasons:

  1. This code is really quite independent from the code that creates the
     wrappers.

  2. Future projects that use these build stats suuport code (once this gets
     pulled out of Trilinos and put into its own repo) may want to generate
     the build stats but not bother with a gather-build-stats target.
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue May 20, 2021
…nos#7376)

This makes the 'gather-build-stat' target completely quiet.  This responds to
feedback from @jjellio that the make command was a bit verbose (and I agree).

But I added the -v option and I used that in the TrilinosBuildStats_Results
test so that you can see the statistics so it will be shown on CDash.
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue May 20, 2021
…y default in PR builds (trilinos#7376)

Now that gather_build_stats.py is super robust, it should be fine to pick up
old *.timing files with different sets of headers and be in all types of
messed up states.
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 8, 2021
…ilds (trilinos#7376)

With the updated magic_wrapper.py, this should be safe to do.  Individual
drivers can still set Trilinos_ENABLE_BUILD_STATS=OFF if they want.  This is
just the default.

I also remove an obsolete Trilinos_CTEST_DO_ALL_AT_ONCE=TRUE since that has
been the default for many years.
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 8, 2021
)

In commit 6f2afd5 Matt Bettencourt <[email protected]> tried to update this
to clang-10 but that does not magically update what is actaully being tested
on CDash and therefore does not make this a supported build.  Someone needs to
actually add the driver scripts and update the Jenkins jobs (and clean up any
failing Trilinos tests).

Making this change allows one to run 'ctest-s-local-test-driver.sh all'
properly (as I was trying to do with trilinos#7376).
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 8, 2021
)

In commit 36b53f6 jmgate tried to update this to clang-10 but that does not
automatically update what is actaully being run on jenkins and displaysed on
CDash and therefore listed builds in this file along does not make them
supported builds.  Someone needs to actually add the driver scripts for these
builds under cmake/ctest/drivers/atdm/sems-rhel7/drivers/ and update the
Jenkins jobs, and triage any new failing Trilinos tests.

Making this change allows one to run 'ctest-s-local-test-driver.sh all'
properly (as I was trying to do with trilinos#7376).
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 9, 2021
…7376)

This should result in the full test results to be uploaded and displayed on
CDash, even for Trilinos PR testing that does not use tribits_ctest_driver().
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 11, 2021
This merges in the state of Trilinos 'develop' from 'atdm-nightly' from
testing day 2021-06-10.
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 11, 2021
…ilds (trilinos#7376)

With the updated magic_wrapper.py, this should be safe to do.  Individual
drivers can still set Trilinos_ENABLE_BUILD_STATS=OFF if they want.  This is
just the default.

I also remove an obsolete Trilinos_CTEST_DO_ALL_AT_ONCE=TRUE since that has
been the default for many years.
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 11, 2021
)

In commit 36b53f6 jmgate tried to update this to clang-10 but that does not
automatically update what is actaully being run on jenkins and displaysed on
CDash and therefore listed builds in this file along does not make them
supported builds.  Someone needs to actually add the driver scripts for these
builds under cmake/ctest/drivers/atdm/sems-rhel7/drivers/ and update the
Jenkins jobs, and triage any new failing Trilinos tests.

Making this change allows one to run 'ctest-s-local-test-driver.sh all'
properly (as I was trying to do with trilinos#7376).
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 11, 2021
…7376)

This should result in the full test results to be uploaded and displayed on
CDash, even for Trilinos PR testing that does not use tribits_ctest_driver().
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 11, 2021
This merges in the state of Trilinos 'develop' from 'atdm-nightly' from
testing day 2021-06-10.
bartlettroscoe added a commit to jjellio/Trilinos that referenced this issue Jun 14, 2021
…ilinos#7376)

Commenting out these builds just avoids people running them with:

  ./ctest-s-local-test-driver.sh all

It does not impact what currently runs on jenkins and submits to CDash.  (No
sense beating a dead horse.)
bartlettroscoe added a commit that referenced this issue Jun 14, 2021
Update build stats and turn on in all ATDM Trilinos builds!
@bartlettroscoe
Copy link
Member

CC: @jjellio

A glorious day. The PR #8638 has finally been merged! This turns on the build stats wrappers in all of the ATDM Trilinos builds (when running test ctest -S driver) and in all of the Trilinos PR builds.

We can see 141 submissions of the test TrilinosBuildStats_Results in the ATDM Trilinos builds showing the new gather script gather_build_stats.py in this query.

And we are starting to see new PRs running this looking at this query.

We need to keep an eye on the PR builds for a few days.

It would be nice to break the build-stats summary reported in the TrilinosBuildStats_Results test into libraries, executables and object files separately before we close this. But that could really be a separate story.

jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 19, 2021
…s:develop' (16c177b).

* trilinos-develop: (71 commits)
  Tpetra: Remove some output from the Bug7758 test
  MueLu Stratimikos adapter: Enable half precision for factory-style PLs
  Tpetra: remove some deprecated usage
  ROL: implement the apply function for Thyra Vector
  Piro: changes to ROL adapters comply with ROL changes
  Piro: bug-fix in Piro::NOX_Solver
  Ifpack2: disabling tests causing build errors with extended scalar types (see issue trilinos#9280).
  Ifpack2: cleaning up unused variables in tests.
  Ctest: Adding Amesos2/Belos tests
  Ctest: Stuff failing on ride that worked on ascicgpu
  Ctest: Enabling non-UVM Ifpack2 tests
  Ifpack2: changing GO to the one in Tpetra_Details_DefaultTypes.hpp.
  Disable support for Makefile.export.* files (trilinos#8498)
  Tpetra: remove unused variable (copied too many times when breaking up a function)
  ats2: Comment out listing of long-broken XL builds (trilinos#9270, trilinos#7376)
  Ifpack2: adding missing logic for new tests.
  Belos: writing tests for 'long double' and 'float128' ScalarType.
  STK: Snapshot 06-11-21 17:50
  Tpetra: remove comments that don't apply to HIP
  Tpetra: Use HIPSpace for HIPWrapperNode
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 19, 2021
…s:develop' (16c177b).

* trilinos-develop: (71 commits)
  Tpetra: Remove some output from the Bug7758 test
  MueLu Stratimikos adapter: Enable half precision for factory-style PLs
  Tpetra: remove some deprecated usage
  ROL: implement the apply function for Thyra Vector
  Piro: changes to ROL adapters comply with ROL changes
  Piro: bug-fix in Piro::NOX_Solver
  Ifpack2: disabling tests causing build errors with extended scalar types (see issue trilinos#9280).
  Ifpack2: cleaning up unused variables in tests.
  Ctest: Adding Amesos2/Belos tests
  Ctest: Stuff failing on ride that worked on ascicgpu
  Ctest: Enabling non-UVM Ifpack2 tests
  Ifpack2: changing GO to the one in Tpetra_Details_DefaultTypes.hpp.
  Disable support for Makefile.export.* files (trilinos#8498)
  Tpetra: remove unused variable (copied too many times when breaking up a function)
  ats2: Comment out listing of long-broken XL builds (trilinos#9270, trilinos#7376)
  Ifpack2: adding missing logic for new tests.
  Belos: writing tests for 'long double' and 'float128' ScalarType.
  STK: Snapshot 06-11-21 17:50
  Tpetra: remove comments that don't apply to HIP
  Tpetra: Use HIPSpace for HIPWrapperNode
  ...
@bartlettroscoe bartlettroscoe removed AT: WIP Causes the PR autotester to not test the PR. (Remove to allow testing to occur.) ATDM Config Issues that are specific to the ATDM configuration settings labels Jun 23, 2021
@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jul 15, 2021

CC: @prwolfe, @jwillenbring

@jjellio, it occurred to me that adding begin and end time stamp fields to the *.timing file generated by magic_wraper.py could help to debug out-of-memory problems like reported in #9432. If you know the start and end time for when each target is getting built and you know the max RAM usage for each target, you can compute, at any moment in time, the max possible RAM getting used on the machine and you will know which targets are involved. That will tell you where you need to put in effort to reduce the RAM usage to build specific targets and get around a build bottle neck that consumes all the RAM.

Having the build start and end time stamps also has other uses as well. For example, when doing a rebuild with old targets lying around, if you only want to report build stats for targets that got built on a rebuild, you could add an argument to summarize_build_stats.py --after=<start-time> that filtered build starts built only after the start of the last rebuild <start-time> (which you know at configure time and you can put into the definition of the test) . This would also automatically filter out build stats for targets that no longer exist in the build system from rebuilds months (or years) old. This may also be used for other purposes that I don't even realize yet but these are the obvious ones.

seamill pushed a commit to seamill/Trilinos that referenced this issue Jul 28, 2021
…develop' (7591b32).

* trilinos/develop: (77 commits)
  zoltan2:  fix memory leak when sizeof(SCOTCH_Num) == sizeof(lno_t) trilinos#9312
  Tpetra: Remove some output from the Bug7758 test
  MueLu Stratimikos adapter: Enable half precision for factory-style PLs
  Tpetra: remove some deprecated usage
  Fixed some deprecated code
  MueLu Thyra adapter: Allow construction of half precision operator
  ROL: implement the apply function for Thyra Vector
  Piro: changes to ROL adapters comply with ROL changes
  Piro: bug-fix in Piro::NOX_Solver
  MueLu: Print Scalar in MG Summary for high and extreme verbosity
  Ifpack2: disabling tests causing build errors with extended scalar types (see issue trilinos#9280).
  Ifpack2: cleaning up unused variables in tests.
  Ctest: Adding Amesos2/Belos tests
  Ctest: Stuff failing on ride that worked on ascicgpu
  Ctest: Enabling non-UVM Ifpack2 tests
  Ifpack2: changing GO to the one in Tpetra_Details_DefaultTypes.hpp.
  Disable support for Makefile.export.* files (trilinos#8498)
  Tpetra: remove unused variable (copied too many times when breaking up a function)
  ats2: Comment out listing of long-broken XL builds (trilinos#9270, trilinos#7376)
  Ifpack2: adding missing logic for new tests.
  ...
@bartlettroscoe
Copy link
Member

bartlettroscoe commented Nov 4, 2021

CC: @jjellio, @jwillenbring

So I have a Trilinos PR #9894 that is stuck in loop of failed builds due to the compiler crashing running out of memory. Following on from the discussion above, it occurred to me that if you store the beginning and end time stamps for each target in the *.timing file, then the summarize_build_stats.py tool can sort the build stats by start time and end time and compute the high watermark on the machine due to building at any time. For example, if 10 object files are currently being built then you just add up max_resident_size_mb for each of these targets and that gives you the max high water mark at that time for that build as the build stat:

Full Project: max_sum_over_active_targets(max_resident_size_mb)

This would show how close a build is to running out of memory on a given machine and we could plot that number as a function of time. In fact, we could have the CTest test that runs summarize_build_stats.py create CTest test measurements for:

Full Project: max_sum_over_active_targets(max_resident_size_mb)
Full Project: sum(max_resident_size_mb)
Full Project: max(max_resident_size_mb)
Full Project: sum(elapsed_real_time_sec)
Full Project: max(elapsed_real_time_sec)
Full Project: sum(file_size_mb)
Full Project: max(file_size_mb)

using XML in the STDOUT like:

<DartMeasurement type="numeric/double" name="Full Project: sum(max_resident_size_mb)">4667989.73</DartMeasurement>

Then you could see a graph of these measurements over time right on CDash!

@github-actions
Copy link

github-actions bot commented Nov 5, 2022

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Nov 5, 2022
@bartlettroscoe bartlettroscoe added DO_NOT_AUTOCLOSE This issue should be exempt from auto-closing by the GitHub Actions bot. and removed MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. labels Nov 5, 2022
@bartlettroscoe
Copy link
Member

FYI: Kitware is adding build stats support to native CMake, CTest, and CDash. See:

Therefore, I think there will be no need for a separate compiler wrapper tool to gather build stats or scripts to manage that and submit it to CDash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project DO_NOT_AUTOCLOSE This issue should be exempt from auto-closing by the GitHub Actions bot. type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

3 participants