Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved "GetPublishedAtUtcAsync" method efficiency #436

Merged
merged 4 commits into from
Jul 6, 2023

Conversation

morended
Copy link
Contributor

Current implementation of "GetPublishedAtUtcAsync" is fetching all the metadata for a given package version, which is causing significant delay. To improve the latency, optimized "GetPublishedAtUtcAsync" method to fetch the uploadtime alone.

@morended
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 436 in repo microsoft/OSSGadget

@gfs
Copy link
Contributor

gfs commented Jun 28, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@morended
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 436 in repo microsoft/OSSGadget

@gfs
Copy link
Contributor

gfs commented Jun 28, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@morended morended changed the title Improved "GetPublishedAtUtcAsync" method's efficiency Improved "GetPublishedAtUtcAsync" method efficiency Jun 28, 2023
Copy link
Contributor

@pmalmsten pmalmsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally straightforward - it gets straight to the point, and should run very fast. I'm curious whether you have measured the speed of this change before and after for a package having many thousands of versions, like say https://www.npmjs.com/package/@graphql-codegen/cli?

As a side note, I would point out that the implementations of GetPackageMetadataAsync are left without any improvement with this approach. That's not necesssarily a problem, but there is a certain amount of elegance in the approach taken by the prior implentation by leaning on GetPackageMetadataAsync to be the one thing that parses package metadata (whereas this approach duplicates some of that parsing logic a bit). That's not a dealbreaker - the approach in this PR simplifies things - but it's a tradeoff that might need to be revisited later.

However, there are other more significant points that we must address when we start thinking about caching:

  1. The memory cache object employed by GetJsonCache has a size limit of about 8 MB, whereas the size of the JSON for graphql-codegen/cli is ~25 MB. So we likely need to increase the size limit of the cache, perhaps to say 100 MB. To go with that, we would likely need to bump the memory allocation for Terrapin containers up 50 to 100 additional MB.
  2. When adding entries to the cache, we do not set an expiration time. As a result, objects may be cached indefinitely. This does not matter all that much for OSS Gadget CLIs that terminate after a few moments, but will matter a lot for Terrapin processes which will keep the running for hours or days at a time. We should probably specify a sliding expiration of something like 30 minutes and an absolute expiration of something like 6 hours to ensure that data in Terrapin stays fresh. It would be ideal to do this for all places where we set values in the cache.

Those are what come to mind first - getting caching right can be tricky, so I'll let you know if I think of anything else later. Let me know if you have any questions on the above.

@pmalmsten
Copy link
Contributor

We also discussed offline how NPM does not set the Content-Length header on HTTP responses (which causes GetJsonCache to signifigcantly underestimate the size of objects that are cached), and that we should fix the response size estimation in OSSGadget so that we just count how many bytes are in response bodies instead of trying to use an optional header.

Increased cache size.
Added CacheInvalidation and CacheExpiration.
@morended
Copy link
Contributor Author

This is generally straightforward - it gets straight to the point, and should run very fast. I'm curious whether you have measured the speed of this change before and after for a package having many thousands of versions, like say https://www.npmjs.com/package/@graphql-codegen/cli?

As a side note, I would point out that the implementations of GetPackageMetadataAsync are left without any improvement with this approach. That's not necesssarily a problem, but there is a certain amount of elegance in the approach taken by the prior implentation by leaning on GetPackageMetadataAsync to be the one thing that parses package metadata (whereas this approach duplicates some of that parsing logic a bit). That's not a dealbreaker - the approach in this PR simplifies things - but it's a tradeoff that might need to be revisited later.

However, there are other more significant points that we must address when we start thinking about caching:

  1. The memory cache object employed by GetJsonCache has a size limit of about 8 MB, whereas the size of the JSON for graphql-codegen/cli is ~25 MB. So we likely need to increase the size limit of the cache, perhaps to say 100 MB. To go with that, we would likely need to bump the memory allocation for Terrapin containers up 50 to 100 additional MB.
  2. When adding entries to the cache, we do not set an expiration time. As a result, objects may be cached indefinitely. This does not matter all that much for OSS Gadget CLIs that terminate after a few moments, but will matter a lot for Terrapin processes which will keep the running for hours or days at a time. We should probably specify a sliding expiration of something like 30 minutes and an absolute expiration of something like 6 hours to ensure that data in Terrapin stays fresh. It would be ideal to do this for all places where we set values in the cache.

Those are what come to mind first - getting caching right can be tricky, so I'll let you know if I think of anything else later. Let me know if you have any questions on the above.

As discussed offline, increased the cache limit and added cache expiration.

@morended
Copy link
Contributor Author

We also discussed offline how NPM does not set the Content-Length header on HTTP responses (which causes GetJsonCache to signifigcantly underestimate the size of objects that are cached), and that we should fix the response size estimation in OSSGadget so that we just count how many bytes are in response bodies instead of trying to use an optional header.

I have changed this to get contentlength from the Http response body.

Copy link
Contributor

@pmalmsten pmalmsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress, but a few things could still use tweaking.

src/Shared/PackageManagers/BaseProjectManager.cs Outdated Show resolved Hide resolved
src/Shared/PackageManagers/BaseProjectManager.cs Outdated Show resolved Hide resolved
@morended morended force-pushed the morended/cache-jsondoc branch from 1821588 to ee22d66 Compare June 30, 2023 23:18
@morended morended force-pushed the morended/cache-jsondoc branch from ee22d66 to 2e7609a Compare July 5, 2023 16:39
Copy link
Contributor

@pmalmsten pmalmsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes look great! Thanks Mounika.

@jpinz
Copy link
Member

jpinz commented Jul 6, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jpinz jpinz merged commit 8da1fcc into microsoft:main Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants