Skip to content

Commit

Permalink
Move dependency tree storage from the database to S3 (#48)
Browse files Browse the repository at this point in the history
* Move dependency tree storage from the database to S3

Storing the SBOM dependency tree in the database turned out to not be the
right decision due to performance issues at scale. Previous changes to
improve performance reduced the usage of the dependency table to just
generating SBOM reports. This change moves the storage of the dependency
tree from the database to S3, removing the need to deconstruct and
reconstruct the tree and removes the overhead that goes along with that.
The S3 key is structured so that other SBOM file formats, such as SPDX
or CycloneDX, could also be stored alongside.

- Update engine SBOM processing to write the dependency tree to a JSON
  file in S3 instead of the database. The dependency tree is still
  processed in order to store component and license information in the
  database.
- Update sbom_report Lambda to pull the dependency tree JSON file from
  S3. If the file is not found in S3 it falls back to pulling the tree
  from the database. This allows for the gradual migration of the
  dependency tree data from the database to S3 as new scans are run and
  old scans are purged by the db_cleanup Lambda.
- Update the db_cleanup to identify and remove dependency files that
  were orphaned when their associated scans were deleted. Deleting scans
  via the ORM will clean up the dependency files from S3. This is a
  backstop just in case a scan is deleted directly or something else
  happens that prevents the cleanup at deletion time from succeeding.
- Update localstack config to add an S3 bucket that can store dependency
  tree files during local testing and update AWSConnect in artemislib so
  that it can be configured to use this S3 bucket for scan data.
- Update IAM permissions in Terraform configuration so that the right
  things can read and write to the scans/ portion of the S3 bucket.
- Add sbom_dependency_migration utility to migrate the dependency trees
  from existing scans from the database to S3. This is useful for
  testing and also if there are key scans that need the performance
  improvement and can't wait for the scan replacement and cleanup
  process.

Unrelated to the SBOM dependency changes but included out of necessity:
- Pin urllib3 version to 1.x because of compatibility issue with
  botocore: boto/botocore#2926
  • Loading branch information
pizen authored May 10, 2023
1 parent a587bc7 commit cdcebee
Show file tree
Hide file tree
Showing 38 changed files with 513 additions and 176 deletions.
2 changes: 1 addition & 1 deletion backend/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,7 @@ dist/lambdas/layers/artemislib.zip: ${SHARED_LIB_SRC}
@echo "${INFO}Building $@"
mkdir -p ${LAMBDA_LAYERS_BUILD_DIR}/artemislib/python
${PIP} install --upgrade --target ${LAMBDA_LAYERS_BUILD_DIR}/artemislib/python --python-version ${LAMBDA_PYTHON_VER} --no-deps ${ARTEMISLIB}
${PIP} install --upgrade --target ${LAMBDA_LAYERS_BUILD_DIR}/artemislib/python --python-version ${LAMBDA_PYTHON_VER} --only-binary=:all: pyjwt requests
${PIP} install --upgrade --target ${LAMBDA_LAYERS_BUILD_DIR}/artemislib/python --python-version ${LAMBDA_PYTHON_VER} --only-binary=:all: pyjwt requests "urllib3<2"
${PIP} install --upgrade --target ${LAMBDA_LAYERS_BUILD_DIR}/artemislib/python --python-version ${LAMBDA_PYTHON_VER} --only-binary=:all: ${LAMBDA_PLATFORM_FLAGS} cryptography
mkdir -p ${DIST_DIR}/lambdas/layers/artemislib/python
cd ${LAMBDA_LAYERS_BUILD_DIR}/artemislib; zip -r ${DIST_DIR}/lambdas/layers/artemislib.zip *
Expand Down
1 change: 1 addition & 0 deletions backend/Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ cwe = "*"
pyjwt = "*"
cryptography = "*"
packaging = "==21.3"
urllib3 = "<2"

[dev-packages]
pytest = "==6.0.1"
Expand Down
218 changes: 109 additions & 109 deletions backend/Pipfile.lock

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions backend/docker-compose.local.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,3 +63,5 @@ services:
ARTEMIS_METADATA_SCHEME_MODULES: ${ARTEMIS_METADATA_SCHEME_MODULES}
ARTEMIS_FEATURE_SNYK_ENABLED: ${ARTEMIS_FEATURE_SNYK_ENABLED}
ARTEMIS_FEATURE_GHAS_ENABLED: ${ARTEMIS_FEATURE_GHAS_ENABLED}
ARTEMIS_SCAN_DATA_S3_BUCKET: ${ARTEMIS_SCAN_DATA_S3_BUCKET}
ARTEMIS_SCAN_DATA_S3_ENDPOINT: ${INTERNAL_ARTEMIS_SCAN_DATA_S3_ENDPOINT}
42 changes: 28 additions & 14 deletions backend/engine/processor/sbom.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,35 @@
import uuid

from django.db.utils import IntegrityError
import simplejson as json

from artemisdb.artemisdb.consts import ComponentType
from artemisdb.artemisdb.models import Component, Dependency, License, RepoComponentScan, Scan
from artemisdb.artemisdb.models import Component, License, RepoComponentScan, Scan
from artemislib.aws import AWSConnect
from artemislib.consts import SBOM_JSON_S3_KEY
from artemislib.env import SCAN_DATA_S3_BUCKET, SCAN_DATA_S3_ENDPOINT
from artemislib.logging import Logger
from utils.plugin import Result

logger = Logger(__name__)


def process_sbom(result: Result, scan: Scan):
# The graphs are all moved into one list instead of lists of lists so that all of the tree roots are in this list.
# The "source" field identifies the different graphs from each other.
flattened = []

# Go through the graphs
for graph in result.details:
# Process all the direct dependencies of this graph
for direct in graph:
process_dependency(direct, scan, None)
process_dependency(direct, scan)
flattened.append(direct) # Add the direct to the flattened list

# Write the dependency information to S3
write_sbom_json(scan.scan_id, flattened)


def process_dependency(dep: dict, scan: Scan, parent: Dependency):
def process_dependency(dep: dict, scan: Scan):
component = get_component(dep["name"], dep["version"], scan, dep.get("type"))

# Keep a copy of the license objects so they only have to be retrieved from the DB once
Expand All @@ -38,17 +51,8 @@ def process_dependency(dep: dict, scan: Scan, parent: Dependency):
if licenses:
component.licenses.set(licenses)

try:
dependency = Dependency(
label=component.label, component=component, scan=scan, source=dep["source"], parent=parent
)
dependency.save()
except IntegrityError as e:
logger.error("Unable to create dependency record %s (error: %s)", dependency, str(e))
return

for child in dep["deps"]:
process_dependency(dep=child, scan=scan, parent=dependency)
process_dependency(dep=child, scan=scan)


def get_component(name: str, version: str, scan: Scan, component_type: str = None) -> Component:
Expand Down Expand Up @@ -76,3 +80,13 @@ def get_component(name: str, version: str, scan: Scan, component_type: str = Non
component_repo.save()

return component


def write_sbom_json(scan_id: str, sbom: str) -> None:
aws = AWSConnect()
aws.write_s3_file(
path=(SBOM_JSON_S3_KEY % scan_id),
body=json.dumps(sbom),
s3_bucket=SCAN_DATA_S3_BUCKET,
endpoint_url=SCAN_DATA_S3_ENDPOINT,
)
5 changes: 4 additions & 1 deletion backend/lambdas/api/repo/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@
url=("https://github.com/warnermedia/artemis/backend/lambdas/api/repo"),
packages=find_packages(),
setup_requires=["pytest-runner"],
install_requires=["requests"],
install_requires=[
"requests",
"urllib3<2", # https://github.com/boto/botocore/issues/2926
],
tests_require=["pytest"],
classifiers=[
"Programming Language :: Python :: 3.9",
Expand Down
5 changes: 4 additions & 1 deletion backend/lambdas/api/signin/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@
url=("https://github.com/warnermedia/artemis/lambdas/api/signin"),
packages=find_packages(),
setup_requires=["pytest-runner"],
install_requires=["requests"],
install_requires=[
"requests",
"urllib3<2", # https://github.com/boto/botocore/issues/2926
],
tests_require=["pytest"],
classifiers=[
"Programming Language :: Python :: 3.9",
Expand Down
5 changes: 4 additions & 1 deletion backend/lambdas/events/secrets_handler/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@
url=("https://github.com/warnermedia/artemis/backend/lambdas/events/secrets_handler"),
packages=find_packages(),
setup_requires=["pytest-runner"],
install_requires=["requests"],
install_requires=[
"requests",
"urllib3<2", # https://github.com/boto/botocore/issues/2926
],
tests_require=["pytest"],
classifiers=[
"Programming Language :: Python :: 3.9",
Expand Down
5 changes: 4 additions & 1 deletion backend/lambdas/events/splunk_handler/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@
url=("https://github.com/warnermedia/artemis/backend/lambdas/events/splunk_handler"),
packages=find_packages(),
setup_requires=["pytest-runner"],
install_requires=["requests"],
install_requires=[
"requests",
"urllib3<2", # https://github.com/boto/botocore/issues/2926
],
tests_require=["pytest"],
classifiers=[
"Programming Language :: Python :: 3.9",
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "2021.11"
__version__ = "2023.5"
50 changes: 46 additions & 4 deletions backend/lambdas/generators/sbom_report/sbom_report/report.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,41 @@
import simplejson as json
from typing import Union

from artemisdb.artemisdb.models import Dependency, Scan
from artemislib.aws import AWSConnect
from artemislib.consts import SBOM_JSON_S3_KEY
from artemislib.env import SCAN_DATA_S3_BUCKET, SCAN_DATA_S3_ENDPOINT
from artemislib.logging import Logger

LOG = Logger(__name__)


def get_report(scan_id):
def get_report(scan_id, skip_s3: bool = False):
scan = Scan.objects.filter(scan_id=scan_id).first()
if not scan:
LOG.error("Scan %s does not exist", scan_id)
return None

LOG.info("Generating SBOM report for scan %s", scan.scan_id)

report = scan.to_dict()
report["sbom"] = []

for dep in scan.dependency_set.filter(parent__isnull=True):
report["sbom"].append(process_dep(dep))
# Get the SBOM JSON file from S3, if it exists
sbom = None
if not skip_s3: # Bypass the S3 version to pull directly from the database
sbom = get_sbom_json(scan.scan_id)

if sbom:
# File exists, use it for the SBOM contents
report["sbom"] = sbom
else:
# File doesn't exist, fall back to pulling the dependencies from the DB
LOG.info("SBOM file not retrieved from S3, falling back to database")

report["sbom"] = []

for dep in scan.dependency_set.filter(parent__isnull=True):
report["sbom"].append(process_dep(dep))

return report

Expand All @@ -26,3 +51,20 @@ def get_deps(parent) -> list:
for dep in Dependency.objects.filter(parent=parent):
ret.append(process_dep(dep))
return ret


def get_sbom_json(scan_id: str) -> Union[list, None]:
filename = SBOM_JSON_S3_KEY % scan_id
LOG.info("Retrieving %s from S3", filename)

aws = AWSConnect()
sbom_file = aws.get_s3_file(filename, SCAN_DATA_S3_BUCKET, SCAN_DATA_S3_ENDPOINT)
if sbom_file:
try:
return json.loads(sbom_file)
except json.JSONDecodeError:
LOG.error("Unable to load JSON file")
return None

LOG.error("Unable to retrieve file from S3")
return None
2 changes: 1 addition & 1 deletion backend/lambdas/generators/sbom_report/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
url=("https://github.com/warnermedia/artemis/backend/lambdas/generators/sbom_report"),
packages=find_packages(),
setup_requires=["pytest-runner"],
install_requires=[],
install_requires=["simplejson"],
tests_require=["pytest"],
classifiers=[
"Programming Language :: Python :: 3.9",
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "2022.5"
__version__ = "2023.5"
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from db_cleanup.tasks.component import obsolete_components
from db_cleanup.tasks.engine import old_engines, unterminated_engines
from db_cleanup.tasks.repo import orphan_repos
from db_cleanup.tasks.scan import old_scans, sbom_scans, secrets_scans
from db_cleanup.tasks.scan import old_scans, sbom_scans, secrets_scans, orphaned_s3_scan_data

LOG = Logger("db_cleanup")

Expand All @@ -14,6 +14,7 @@
orphan_repos,
sbom_scans,
obsolete_components,
orphaned_s3_scan_data,
]


Expand Down
18 changes: 18 additions & 0 deletions backend/lambdas/maintenance/db_cleanup/db_cleanup/tasks/scan.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
from artemisdb.artemisdb.consts import ScanStatus
from artemisdb.artemisdb.models import Scan
from artemislib.aws import AWSConnect
from artemislib.consts import SCANS_S3_KEY
from artemislib.datetime import format_timestamp, get_utc_datetime
from artemislib.env import SCAN_DATA_S3_BUCKET, SCAN_DATA_S3_ENDPOINT
from artemislib.logging import Logger
from db_cleanup.util.delete import sequential_delete
from db_cleanup.util.env import MAX_SCAN_AGE, MAX_SECRET_SCAN_AGE
Expand Down Expand Up @@ -36,3 +39,18 @@ def _sbom_delete_check(scan: Scan) -> bool:
).count()
> 0
)


def orphaned_s3_scan_data(log: Logger) -> None:
log.info("Cleaning up orphaned scan data from S3")
aws = AWSConnect()
files = aws.get_s3_file_list(prefix=SCANS_S3_KEY, s3_bucket=SCAN_DATA_S3_BUCKET, endpoint_url=SCAN_DATA_S3_ENDPOINT)
count = 0
for f in files:
scan_id = f.key.split("/")[1] # Extract the scan ID from the S3 key: scans/<SCAN_ID>/...
if not Scan.objects.filter(scan_id=scan_id).exists():
# Scan data is for a scan that no longer exists in the database
f.delete()
count += 1
log.debug("Deleted %s", f.key)
log.info("%s total files deleted", count)
5 changes: 4 additions & 1 deletion backend/lambdas/maintenance/license_retriever/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@
url=("https://github.com/warnermedia/artemis/backend/lambdas/maintenance/license_retriever"),
packages=find_packages(),
setup_requires=["pytest-runner"],
install_requires=["requests"],
install_requires=[
"requests",
"urllib3<2", # https://github.com/boto/botocore/issues/2926
],
tests_require=["pytest"],
entry_points={"console_scripts": ["artemis_license_retriever=license_retriever.handlers:handler"]},
classifiers=[
Expand Down
5 changes: 4 additions & 1 deletion backend/lambdas/scans/callback/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@
url=("https://github.com/warnermedia/artemis/backend/lambdas/scans/callback"),
packages=find_packages(),
setup_requires=["pytest-runner"],
install_requires=["requests"],
install_requires=[
"requests",
"urllib3<2", # https://github.com/boto/botocore/issues/2926
],
tests_require=["pytest"],
classifiers=[
"Programming Language :: Python :: 3.9",
Expand Down
5 changes: 4 additions & 1 deletion backend/lambdas/scheduled/scheduled_scan_handler/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@
url=("https://github.com/warnermedia/artemis/backend/lambdas/scheduled/scheduled_scan_handler"),
packages=find_packages(),
setup_requires=["pytest-runner"],
install_requires=["requests"],
install_requires=[
"requests",
"urllib3<2", # https://github.com/boto/botocore/issues/2926
],
tests_require=["pytest"],
classifiers=[
"Programming Language :: Python :: 3.9",
Expand Down
2 changes: 1 addition & 1 deletion backend/libs/artemisdb/artemisdb/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "2023.3"
__version__ = "2023.5"
19 changes: 17 additions & 2 deletions backend/libs/artemisdb/artemisdb/artemisdb/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@
from artemisdb.artemisdb.fields.ltree import LtreeField
from artemisdb.artemisdb.util.auth import group_chain_filter
from artemisdb.artemisdb.util.severity import ComparableSeverity
from artemislib.aws import AWSConnect
from artemislib.consts import SCAN_DATA_S3_KEY
from artemislib.datetime import format_timestamp
from artemislib.db_cache import DBLookupCache
from artemislib.env import SCAN_DATA_S3_BUCKET, SCAN_DATA_S3_ENDPOINT
from artemislib.logging import Logger
from artemislib.services import get_services_and_orgs_for_scope

Expand Down Expand Up @@ -585,7 +588,7 @@ class Meta:
def __str__(self):
return str(self.scan_id)

def delete(self):
def delete_dependency_set(self):
# A scan can potentially have hundreds of thousands or even millions of associated rows in the
# dependency table. The dependency table contains a tree structure where the rows have foreign
# key relationships to other rows. The result is that when Django performs the cascade deletion
Expand All @@ -595,11 +598,23 @@ def delete(self):
# This is a performance enhancement to bypass the cascade deletion of dependencies by instead
# having the database delete the rows associated with the scan being deleted directly instead of
# having Django do it.
LOG.debug("Deleting %s", self.scan_id)
start = datetime.utcnow()
with connection.cursor() as cursor:
cursor.execute(f"DELETE FROM {Dependency._meta.db_table} WHERE scan_id = %s", [self.pk])
LOG.debug("Bulk dependency deletion of scan %s completed in %s", self.scan_id, str(datetime.utcnow() - start))

def delete(self):
LOG.debug("Deleting %s", self.scan_id)

self.delete_dependency_set()

# A scan may have data stored in S3 that needs to be deleted
aws = AWSConnect()
deleted = aws.delete_s3_files(
prefix=(SCAN_DATA_S3_KEY % self.scan_id), s3_bucket=SCAN_DATA_S3_BUCKET, endpoint_url=SCAN_DATA_S3_ENDPOINT
)
LOG.debug("Deleted %s scan items from S3", deleted)

return super(Scan, self).delete()

def to_dict(self, history_format: bool = False):
Expand Down
2 changes: 1 addition & 1 deletion backend/libs/artemislib/artemislib/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "2022.4"
__version__ = "2023.5"
Loading

0 comments on commit cdcebee

Please sign in to comment.