0.5.0 Performance, Reworked UI, New formats, Deno
This release of CPUpro introduces significant updates, including performance enhancements, a redesigned user interface, and expanded format and runtime support. This version introduces groundbreaking enhancements that significantly reduce the time to load and process extremely large profiles, making CPUpro highly efficient for analyzing complex long-running scripts. The user interface has been thoroughly revamped to offer a more intuitive and responsive experience, enhancing usability across various features and views. New profile formats and support for the Deno runtime has been added, expanding the tool's versatility and adaptability to modern development environments.
Performance
CPUpro has been entirely re-engineered to optimize the preprocessing of profiles upon loading and for subsequent computations. This redesign enables it to handle massive profiles (exceeding 100MB) significantly faster than other tools. CPUpro is currently the best option for analyzing intense long-running scripts that generate extensive CPU profiles, such as webpack build profiles or prolonged browser sessions (that can last minutes or even tens of minutes).
The table below illustrates the time of loading and first render of profiles of varying sizes across different tools:
Profile size | Profile type | CPUpro v0.5 | CPUpro v0.4 | Chromium DevTools | speedscope |
---|---|---|---|---|---|
33MB 215k samples / 120k call tree |
V8 cpuprofile | 0.5s | 0.8s | 4.6s | 6.5s |
113MB 625k samples / 62k call tree |
Chromium Profile | 1.3s | 1.6s | 10.6s | 12.4s |
114MB 739k samples / 446k call tree |
V8 cpuprofile | 1.3s | 2.6s | 12.3s | 18.5s |
239MB 11.6M samples / 489k call tree |
V8 cpuprofile | 2.8s | 11.3s | 48s | Out of memory (after 23s) |
277MB 127k samples / 35k call tree |
Chromium Profile | 1.9s | 2.2s | 4.2s | Out of memory (after 30s) |
418MB 897k samples / 1.86M call tree |
V8 cpuprofile | 4.6s | 8.7s | Out of memory (after 36s) |
Out of memory (after 49s) |
2GB 7.3M samples / 7.28M call tree |
V8 cpuprofile | 27.1s | Out of memory (after 57s) |
Invalid string length (after 20s) |
Out of memory (after 43s) |
Chrome 124 / MacBook Pro 13-inch, M1, 2020
As indicated in the table, the time is affected not only by the profile size but also by its format, the number of samples and the size of the call tree (note that some profiles contain millions of samples and nodes). Notably, the Chromium Profile, which includes extensive additional data beside CPU profile, tends to load faster than .cpuprofile files of the same size. It is worth mentioning that some tools struggle with large profiles, hitting the heap size limit (4GB) and resulting in crashes because of "Out of Memory" errors, which is particularly frustrating when a lengthy load time yields no results. Unlike these tools, CPUpro avoids such pitfalls thanks to new optimizations, now capable of loading and processing even 2GB profiles.
When comparing the loading time between CPUpro versions 0.4 and 0.5, the difference does not look so impressive. The reason for this is that a significant portion of the time is spent on loading and parsing JSON which remains unchanged. However, if we isolate the processing time and initial rendering, where main optimization efforts were concentrated, the new version shows performance improvements ranging from 1.5 to 11 times:
Profile size | Profile type | Load data & parse | CPUpro v0.5 (computations + render) |
CPUpro v0.4 (computations + render) |
Delta |
---|---|---|---|---|---|
33MB 215k samples / 120k call tree |
V8 cpuprofile | 0.3s | 0.16s | 0.52s | 3.1x |
113MB 625k samples / 62k call tree |
Chromium Profile | 1.1s | 0.21s | 0.64s | 3.0x |
114MB 739k samples / 446k call tree |
V8 cpuprofile | 0.9s | 0.37s | 1.48s | 4.0x |
239MB 11.6M samples / 489k call tree |
V8 cpuprofile | 2.2s | 0.79s | 9.21s | 11.7x |
277MB 127k samples / 35k call tree |
Chromium Profile | 1.9s | 0.15s | 0.24s | 1.7x |
418MB 897k samples / 1.86M call tree |
V8 cpuprofile | 3.6s | 1.12s | 4.26s | 3.6x |
2GB 7.3M samples / 7.28M call tree |
V8 cpuprofile | 22.1s | 4.98s | – | – |
Chrome 124 / MacBook Pro 13-inch, M1, 2020
The acceleration was achieved by switching to linear memory (TypedArrays) for tree representation and calculations storage, despite the increased number and complexity of computations added since v0.4. The majority of the calculation algorithms are implemented using simple loops without recursion or complex branching. Experiments with WebAssembly for some calculations have resulted in up to a 2x speed increase in JavaScriptCore (Safari) and SpiderMonkey (Firefox), aligning execution times with V8, where there was no change in performance. Remarkably, the new algorithms allow V8 to optimize JavaScript execution to match the efficiency of WebAssembly, which was an unexpected.
Adopting TypedArray has drastically reduced heap memory usage. While modern browsers typically offer up to 4GB of heap space, exceeding this limit can crash browser's tab (and, accordingly, the app). CPUpro primarily uses the heap only for loading and parsing JSON and during the initial stages of data processing, then most data is managed using TypedArrays. These buffers, stored in what is termed "external memory", are only limited by the system's available memory, significantly lowering the risk of crashes due to "Out of memory". However, there is no reason to worry about it, since CPUpro consumes memory sparingly:
Profile size | Profile type | CPUpro v0.5 | CPUpro v0.4 | Chromium DevTools | speedscope |
---|---|---|---|---|---|
33MB 215k samples / 120k call tree |
V8 cpuprofile | 8MB External: 20MB |
97MB | 752MB | 916MB |
113MB 625k samples / 62k call tree |
Chromium Profile | 7MB External: 17MB |
61MB | 1063MB | 466MB |
114MB 739k samples / 446k call tree |
V8 cpuprofile | 8MB External: 155MB |
324MB | 1803MB | 2001MB |
239MB 11.6M samples / 489k call tree |
V8 cpuprofile | 12MB External: 92MB |
463MB | 3877MB | Out of memory |
277MB 127k samples / 35k call tree |
Chromium Profile | 8MB External: 9MB |
34MB | 488MB | Out of memory |
418MB 897k samples / 1.86M call tree |
V8 cpuprofile | 18MB External: 233MB |
1387MB | Out of memory | Out of memory |
2GB 7.3M samples / 7.28M call tree |
V8 cpuprofile | 22MB External: 866MB |
Out of memory | Invalid string length | Out of memory |
Data collected after loading the profile and calling the garbage collector
After loading the profile and initial calculations, CPUpro is ready for rapid timings recalculations and data sampling on demand, e.g. filter changes. This enhancement enabled the introduction of new complex views that were previously impossible due to prolonged calculations (many seconds) and UI freezing, which broke the user experience. Most views have also been optimized to react almost instantaneously to changes in filters, ensuring a seamless user experience even with large profiles.
cpupro-perf.mov
The optimizations in speed and memory efficiency are not just about improving profile loading and UI responsiveness, they also unlock new capabilities. Notably, it's crucial for features such as profile comparison, which requires loading at least two profiles, potentially doubling both the computation time and memory usage. These challenges have been addressed, setting the stage for future enhancements including profile comparison and more.
User interface
The user interface has undergone a significant redesign. The start page now appears more compact and provides a clearer overview of how the V8 engine operates. It features a timeline categorized by work type and function clustering tables, followed by a flamechart.
Other pages have also been reworked to be more informative. Each page now includes:
- A timeline that not only displays self time but also nested time, with the distribution of nested time by categories.
- A new section titled "Nested Time Distribution" that offers insights into the distribution of nested time in a hierarchical format, from a package to a function.
- A basic flamechart displaying all frames related to the current subject (category, package, module, or function) as root frames.
The timeline has been enhanced with a tooltip that provides expanded details and the capability to select a range, a feature previously lacking when focusing on specific segments of work.
The Flamechart is now faster and smoother. It includes new selection capabilities and a detailed information block for the selected or zoomed frame.
The welcome page has been redesigned as well, and now offers example profiles in various formats to try:
New formats, runtimes, and registries
Support for new formats has been introduced:
- V8 log converted into JSON with the
--preprocess
option (node --prof-process --preprocess v8.log > v8log.json
). Although the filtered version of a V8 log file loses many details, it remains more informative than.cpuprofile
files. This release includes basic support for V8 logs, with plans for expanded support in future releases. - Edge Enhanced Performance Traces (
.devtools
). Currently, using.devtools
offers no much benefits over Chromium Performance Profiles but eliminates the need to convert to other supported formats. Future releases may utilize additional data provided by this format.
Adding support for the V8 log format has enabled the analysis of profiles captured in Deno, where capturing .cpuprofile
is less suitable (if at all possible) as in Node.js or browsers. However, Deno supports v8 flags, allowing the capture and conversion of V8 logs into JSON (Deno manual):
> deno run --v8-flags=--prof script.ts
> node --prof-process --preprocess isolate-0x000000000-v8.log > v8log.json
Detection for Deno, Electron, and Edge runtimes has been added where feasible. Runtime icons was added as well:
Given the nature of Deno programs to source directly from CDNs, detection for the most popular CDNs and registries has also been included. The following screenshot demonstrates before and after CDN detection was added:
Changelog
- Changed the terminology: replaced "area" with "category"
- Formats
- Added support for Edge Enhanced Performance Traces
- Added support for V8 log preprocessed with
--preprocess
- Fixed the extraction of a CPU profile from Chrome tracing when it contains several profiles
- Computations
- Reworked the computations on profile loading from scratch with performance and memory usage in mind, achieving a 4-8 times speed increase and reduced memory consumption
- Implemented GC nodes reparenting to the script node
- Fixed the placement of bundle modules to be placed in the "script" category instead of the "bundle" category
- Changed the handling of negative time deltas, they are now corrected by rearranging instead of being ignored
- Resolved the issue with shortening paths to scripts when
webpack/runtime
is present in the CPU profile - Adjusted call frame reference computation by omitting line and column when they are not specified or less than zero
- Runtimes & registries
- Added Deno detection
- Added Electron detection
- Added detection for CDNs and registries: JSR, deno.land, jsdelivr, unpkg, esm.sh, esm.run, jspm, and skypack
- Redesigned welcome page, added "Try an example" buttons
- Reworked the layout and UX of the main page
- Implemented permanent colors and a fixed timeline order for areas and module types
- Improved the display of regular expressions, particularly long ones
- Reworked subject pages, each page now includes:
- A timeline that not only displays self time but also nested time, with the distribution of nested time by categories
- A section "Nested time distribution"
- A basic flamechart displaying all frames related to the current subject as root frames
- Timeline
- Added the capability to select a range
- Added a tooltip that provides expanded details on a range
- Flamechart
- Added vertical scrolling locking when not activated
- Added a detailed information block for the selected or zoomed frame
- Added the capability to select frames
- Improved performance and reliability
- Changed colors to match category colors and module types