Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain URL hash cache, cache file hashes #2940

Merged
merged 2 commits into from
Dec 9, 2019

Conversation

HebaruSan
Copy link
Member

Motivation

We'd like to reduce the CPU usage of the Inflator component of the infrastructure. Toward that end, we added some more debug statements and observed netkan.exe's output carefully, which turned up some surprising issues...

Background

NetFileCache uses a Dictionary to map from the URL hash (first 8 characters of the download URL's SHA1 interpreted in hexadecimal) to the file path in cache. This is intended to find cached files quickly without checking every single file. To try to protect this mapping from unexpected changes, we listen to events from FileSystemWatcher and clear this dictionary whenever they fire, which results in it being regenerated the next time it's needed.

The DownloadAttributeTransformer is in charge of populating file hashes and other properties based on the contents of each ZIP, some of which are pretty large:

  "download_size": 273244617,
  "download_hash": {
    "sha1": "316E1DCA70A9544055FFA3C02EE60DB633531C0D",
    "sha256": "8D240471165D302A3ACDAFBB2278362881112AC58F0317AE04853799E20D1BD2"
  },
  "download_content_type": "application/zip",

The Inflator's download cache persists across passes, and in a typical pass every download is already cached. If modders have been very active, the number of new or changed downloads may approach the high single digits.

Problems

  1. FileSystemWatcher.Changed seems to fire even if you just read info about a file in the cache. In one test using my 50 GB download cache to inflate a cached module, it triggered 186499 times!! This means that the Dictionary mapping was almost never used more than once; it would be null at the start of virtually any cache access, which would force it to be regenerated, then it would be used once, and then it would be almost immediately wiped out again. This amounts to searching every file for every access, a waste of CPU and disk.
  2. Calculating the SHA1 and SHA256 can take a long-ish time for large files, since every byte has to be read from disk and processed through both hashing algorithms. This work is not necessary if the file has not changed on disk since the last time we calculated the hash.

Changes

  1. Several new log.Debug statements are added to netkan.exe to make it easier to see what it's doing
  2. Now we don't listen to FileSystemWatcher.Changed anymore. This will avoid repeated unnecessary purges of the file cache data, hopefully amounting to a significant savings for a long-running Inflator process. Similarly, other operations that previously cleared the cache object are now changed to merely make an appropriate small update to it, such as Store and Remove.
  3. Now after we calculate an SHA1 or SHA256 for a cached file, we store it in a Dictionary. If we need the same hash for the same file again, we check the Dictionary first and use it if found, skipping the computationally expensive recalculation of the hash from the file on disk. The Dictionary objects are also updated or cleared anytime we perform a similar operation for the main cache data.

Overall this should result in a significant savings in the Inflator's CPU consumption.

@HebaruSan HebaruSan added Bug Something is not working as intended Enhancement New features or functionality Core (ckan.dll) Issues affecting the core part of CKAN Pull request Netkan Issues affecting the netkan data Infrastructure Issues affecting everything around CKAN (the GitHub repos, build process, CI, ...) Network Issues affecting internet connections of CKAN labels Dec 7, 2019
@HebaruSan HebaruSan requested a review from DasSkelett December 7, 2019 22:59
@DasSkelett
Copy link
Member

DasSkelett commented Dec 9, 2019

Similarly, other operations that previously cleared the cache object are now changed to merely make an appropriate small update to it, such as Store and Remove.

Store() and Remove() still trigger the FSW and thus index rebuilds though.
I tried suppressing the watcher by removing the event handler from the watcher, but this doesn't work because they are called from thread pool workers.

However setting watcher.EnableRaisingEvents to false just before and resetting it to true after the file system actions (tx_file.Move()/tx_file.Copy() and tx_file.Delete()) seems to work.

@HebaruSan
Copy link
Member Author

However toggling watcher.EnableRaisingEvents just before and after the file system actions seems to work.

That would create a race condition, other processes could sneak changes into the cache in that interval, as small as it would be.

@techman83
Copy link
Member

This is also likely the cause a huge amount of IOPs, which when using magnetic storage is charged per million (pretty cheap though), where as ssd they're part of the hourly. I was considering rebuilding with ssd, but I'll hold off to see what the impact of this change is. Likely we're going to be able to increase our indexing rate as well!

It's likely the FS updating the Access Time is firing the notify. We could mount our cache using noatime if not listening for changes is having knock on effects. I'm not super across the impacts of the changes, but LGTM from my quick skim.

Copy link
Member

@DasSkelett DasSkelett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then, let's keep it as is. This is a big improvement, the few times a new download is actually saved in the cache are negligible.
Really curious how much this affects performance in the end.

@HebaruSan HebaruSan merged commit 22775fc into KSP-CKAN:master Dec 9, 2019
@HebaruSan HebaruSan deleted the fix/cache-caching branch December 9, 2019 18:42
@techman83
Copy link
Member

All I can say is. Holy fork that's made a difference!
2019-12-10_10-31-15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something is not working as intended Core (ckan.dll) Issues affecting the core part of CKAN Enhancement New features or functionality Infrastructure Issues affecting everything around CKAN (the GitHub repos, build process, CI, ...) Netkan Issues affecting the netkan data Network Issues affecting internet connections of CKAN
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants