Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 4-tier system for storing package metadata #1609

Merged
merged 3 commits into from
Oct 11, 2024

Conversation

keshav-space
Copy link
Member

@keshav-space keshav-space commented Oct 10, 2024

The tiers are as follows:

  1. Super Large Ecosystem (~5M packages): 2^10 = 1,024 git repositories
  2. Large Ecosystem (~500K packages): 2^7 = 128 git repositories
  3. Medium Ecosystem (~50K packages): 2^5 = 32 git repositories
  4. Small Ecosystem (~2K packages): 2^0 = 1 git repository

- The 4-tiers are super large, large, medium, and small, which
  correspond to 1024, 128, 32, and 1 repository, respectively

Signed-off-by: Keshav Priyadarshi <[email protected]>
Signed-off-by: Keshav Priyadarshi <[email protected]>
@keshav-space keshav-space force-pushed the 3-update-disk-storage branch from 328e6f0 to d139fc7 Compare October 10, 2024 10:56
Signed-off-by: Keshav Priyadarshi <[email protected]>
"""
if isinstance(purl, str):
purl = PackageURL.from_string(purl)

purl_hash = get_purl_hash(purl)
bit_count = BIT_COUNT_BY_ECOSYSTEM.get(purl.type, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if we don't have an ecosystem we should not simply consider bit as 0. We should log it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can throw an exception here. In any case if we have not provisioned git repo for that unknown ecosystem aboutcode-org/scancode.io#1400 pipeline will fail. Earlier we were using 13-bit as the default option, but after discussion with @pombredanne we agreed on using a single repository (0-bit) for ecosystems not covered in our exhaustive list.

Copy link
Contributor

@TG1999 TG1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Minor nit for your consideration

@keshav-space keshav-space merged commit d1f4c74 into main Oct 11, 2024
9 checks passed
@keshav-space keshav-space deleted the 3-update-disk-storage branch October 11, 2024 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design on disk storage structure for packages and vulnerabilties data
2 participants