-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new content ID function #1766
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TG1999, see some suggestions below.
vulnerabilities/utils.py
Outdated
# Normalize fields | ||
normalized_data = { | ||
"summary": normalize_text(advisory_data.summary), | ||
"affected_packages": normalize_list(advisory_data.affected_packages), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no certainty that this will work since we don't have a proper implementation for handling comparisons between AffectedPackage
.
For example, this will not be able to normalize the list of affected_packages below:
affected_packages = [
AffectedPackage(
package=PackageURL(
type="alpine",
namespace=None,
name="linux-lts",
version=None,
qualifiers={
"arch": "aarch64",
"distroversion": "v3.20",
"reponame": "main",
},
subpath=None,
),
affected_version_range=None,
fixed_version="6.6.13-r1",
),
AffectedPackage(
package=PackageURL(
type="alpine",
namespace=None,
name="linux-lts",
version=None,
qualifiers={"arch": "armhf", "distroversion": "v3.21", "reponame": "main"},
subpath=None,
),
affected_version_range=None,
fixed_version="6.6.13-r1",
),
]
vulnerabilities/utils.py
Outdated
} | ||
|
||
if include_metadata: | ||
normalized_data["created_by"] = advisory_data.created_by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created_by
is a model field, not an attribute of AdvisoryData.
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
8793912
to
ebf1a32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TG1999, some nits for your consideration.
""" | ||
Find advisories with the same content and keep only the latest one. | ||
""" | ||
# Get all advisories that have duplicates based on content ID | ||
duplicate_content_ids = ( | ||
Advisory.objects.values("unique_content_id") | ||
.annotate(count=Count("id")) | ||
.filter(count__gt=1) | ||
.values_list("unique_content_id", flat=True) | ||
) | ||
|
||
self.log( | ||
f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO | ||
) | ||
|
||
for content_id in duplicate_content_ids: | ||
# Get all advisories with this content ID | ||
advisories = Advisory.objects.filter(unique_content_id=content_id) | ||
|
||
# Find the latest advisory | ||
latest = advisories.latest("date_imported") | ||
|
||
# Delete all except the latest | ||
advisories.exclude(id=latest.id).delete() | ||
|
||
if self.log: | ||
self.log( | ||
f"Kept advisory {latest.id} and removed " | ||
f"{advisories.count() - 1} duplicates for content ID {content_id}", | ||
level=logging.INFO, | ||
) | ||
|
||
def update_content_ids(self): | ||
""" | ||
Update content IDs for all advisories that don't have one. | ||
""" | ||
advisories = Advisory.objects.filter( | ||
Q(unique_content_id="") | Q(unique_content_id__isnull=True) | ||
) | ||
|
||
self.log(f"Found {advisories.count()} advisories without content ID", level=logging.INFO) | ||
|
||
for advisory in advisories: | ||
advisory.unique_content_id = compute_content_id(advisory) | ||
advisory.save() | ||
|
||
if self.log: | ||
self.log(f"Updated content ID for advisory {advisory.id}", level=logging.DEBUG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do something simpler like this.
""" | |
Find advisories with the same content and keep only the latest one. | |
""" | |
# Get all advisories that have duplicates based on content ID | |
duplicate_content_ids = ( | |
Advisory.objects.values("unique_content_id") | |
.annotate(count=Count("id")) | |
.filter(count__gt=1) | |
.values_list("unique_content_id", flat=True) | |
) | |
self.log( | |
f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO | |
) | |
for content_id in duplicate_content_ids: | |
# Get all advisories with this content ID | |
advisories = Advisory.objects.filter(unique_content_id=content_id) | |
# Find the latest advisory | |
latest = advisories.latest("date_imported") | |
# Delete all except the latest | |
advisories.exclude(id=latest.id).delete() | |
if self.log: | |
self.log( | |
f"Kept advisory {latest.id} and removed " | |
f"{advisories.count() - 1} duplicates for content ID {content_id}", | |
level=logging.INFO, | |
) | |
def update_content_ids(self): | |
""" | |
Update content IDs for all advisories that don't have one. | |
""" | |
advisories = Advisory.objects.filter( | |
Q(unique_content_id="") | Q(unique_content_id__isnull=True) | |
) | |
self.log(f"Found {advisories.count()} advisories without content ID", level=logging.INFO) | |
for advisory in advisories: | |
advisory.unique_content_id = compute_content_id(advisory) | |
advisory.save() | |
if self.log: | |
self.log(f"Updated content ID for advisory {advisory.id}", level=logging.DEBUG) | |
""" | |
Recompute content id and remove advisories with the same content and keep only the latest one. | |
""" | |
advisories = Advisory.objects.all().order_by("-id").paginated() | |
advisories_count = Advisory.objects.all().count() | |
self.log(f"Computing new content id for {advisories_count} and removing duplicates.") | |
batch_size = 10000 | |
deleted_advisory_count = 0 | |
updated_advisory_count = 0 | |
duplicate_advisory_id = [] | |
updated_advisory = [] | |
content_ids = set() | |
progress = LoopProgress( | |
total_iterations=advisories_count, | |
logger=self.log, | |
progress_step=1, | |
) | |
for advisory in progress.iter(advisories): | |
content_id = compute_content_id(advisory) | |
if content_id in content_ids: | |
duplicate_advisory_id.append(advisory.id) | |
else: | |
advisory.unique_content_id = content_id | |
updated_advisory.append(advisory) | |
content_ids.add(content_id) | |
if len(duplicate_advisory_id) > batch_size: | |
deleted_advisory_count += delete_advisories( | |
advisory_ids=duplicate_advisory_id, | |
logger=self.log, | |
) | |
if len(updated_advisory) > batch_size: | |
updated_advisory_count += bulk_update_advisory( | |
items=updated_advisory, | |
fields=["unique_content_id"], | |
logger=self.log, | |
) | |
deleted_advisory_count += delete_advisories( | |
advisory_ids=duplicate_advisory_id, | |
logger=self.log, | |
) | |
updated_advisory_count += bulk_update_advisory( | |
items=updated_advisory, | |
fields=["unique_content_id"], | |
logger=self.log, | |
) | |
self.log(f"Removed {deleted_advisory_count} duplicates advisories.") | |
self.log(f"Updated content id for {deleted_advisory_count} advisories.") | |
def bulk_update_advisory(items, fields, logger): | |
item_count = 0 | |
if items: | |
try: | |
Advisory.objects.bulk_update(objs=items, fields=fields) | |
item_count += len(items) | |
except Exception as e: | |
logger(f"Error updating Advisory: {e}") | |
items.clear() | |
return item_count | |
def delete_advisories(advisory_ids, logger): | |
item_count = 0 | |
if advisory_ids: | |
try: | |
Advisory.objects.filter(id__in=advisory_ids).delete() | |
item_count += len(advisory_ids) | |
except Exception as e: | |
logger(f"Error deleting Advisory: {e}") | |
advisory_ids.clear() | |
return item_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is some more feedback.
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TG1999, some feedback.
duplicated_advisories = groupby( | ||
Advisory.objects.order_by("unique_content_id").all().paginated(), | ||
key=lambda x: x.unique_content_id, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly doubt that this will work smoothly in production, ordering 118 million advisories using unique_content_id
, which is an unindexed field is not good idea.
#1766 (comment) would be much more practical approach since id
is an autogenerated primary key field and hence it's already indexed, and we can use this to select the latest or oldest advisory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db_index is in models for that! And since we believe there are lots of dupes. I believe index to work fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're adding this db_index
now. I am not sure how long the index creation itself is going to take.
advisory.unique_content_id = compute_content_id(advisory) | ||
advisories.append(advisory) | ||
|
||
Advisory.objects.bulk_update(advisories, ["unique_content_id"], batch_size=1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping all that advisories
in memory is going to be very expensive!
Assuming 1 Advisory object ≈ 1000 bytes.
That would mean for 118 million advisories: (118 * 10^6) * 10^3 bytes ≈ 118 GB of memory!.
We should not keep all these advisories in memory. Instead we should bulk update as soon as we reach the batch size.
from vulnerabilities.models import Advisory | ||
|
||
if isinstance(advisory_data, Advisory): | ||
normalized_data = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Content id should also include aliases
field.
Let's look at this data source https://github.com/nodejs/security-wg/blob/75c78bbbd2ef86d289c16818bb487a70e315dc43/vuln/npm/7.json. This single advisory contains 2 CVEs, and for this we create 2 different advisory records. If we do not consider aliases while computing the content ID, then we will delete the one advisory in our dedupe pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, we should do it!
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Reference: #1583