Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Migration to add security relevance to DB #405

Merged
merged 6 commits into from
Aug 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ prospector/output.pstats
prospector/kaybee-new-statements
prospector/run.sh
prospector/cve_data
prospector/evaluation
.DS_Store
kaybee/internal/reconcile/debug.test
prospector/client/web/node-app/build
1 change: 1 addition & 0 deletions prospector/.env-sample
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ POSTGRES_PORT=5432
POSTGRES_HOST=localhost
POSTGRES_DBNAME=postgres
POSTGRES_PASSWORD=example
POSTGRES_DATA=a/real/path/to/a/folder/to/save/postgres/data
REDIS_URL=redis://localhost:6379/0
NVD_API_KEY=APIkey
PYTHONPATH=.
32 changes: 27 additions & 5 deletions prospector/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ To quickly set up Prospector, follow these steps. This will run Prospector in it
```
mv config-sample.yaml config.yaml
```
Note: If you want to use the backend, make sure you set the `POSTGRES_DATA` variable in your `.env` to point to a local directory to save the backend data to. This ensures that the database can persist, even if the docker containers are stopped.

4. Execute the bash script *run_prospector.sh* specifying the *-h* flag. <br> This will display a list of options that you can use to customize the execution of Prospector.
```
Expand Down Expand Up @@ -121,7 +122,11 @@ You can set the `use_llm_<...>` parameters in *config.yaml* for fine-grained con

Following these steps allows you to run Prospector's components individually: [Backend database and worker containers](#starting-the-backend-database-and-the-job-workers), [RESTful Server](#starting-the-restful-server) for API endpoints, [Prospector CLI](#running-the-cli-version) and [Tests](#testing).

Prerequisites:
If you have issues with these steps, please open a Github issue and
explain in detail what you did and what unexpected behaviour you observed
(also indicate your operating system and Python version).

**Prerequisites:**

* Python 3.10
* postgreSQL
Expand All @@ -146,12 +151,13 @@ set -a; source .env; set +a

You can configure prospector from CLI or from the *config.yaml* file. The (recommended) API Keys for Github and the NVD can be configured from the `.env` file (which must then be sourced with `set -a; source .env; set +a`)

#### Requirements

If at any time you wish to use a different version of the python interpreter, beware that the `requirements.txt` file contains the exact versioning for `python 3.10.6`.

If you have issues with these steps, please open a Github issue and
explain in detail what you did and what unexpected behaviour you observed
(also indicate your operating system and Python version).
If you need to update the requirements, add the packages to `requirements.in`. Then recompile `requirements.txt` with `pip-cmpile --no-annotate --strip-extras` (You'll need to have pip-tools installed: `python3 -m pip install pip-tools`). If `requirements.txt` gets generated with `pip extra`'s at the top, remove these before you push (as this will make the build try to fetch them for hours).

#### Code Formatting

:exclamation: **IMPORTANT**: this project adopts `black` for code formatting. You may want to configure
your editor so that autoformatting is enforced "on save". The pre-commit hook ensures that
Expand All @@ -177,7 +183,23 @@ You can then start the necessary containers with the following command:
make docker-setup
```

This also starts a convenient DB administration tool at http://localhost:8080
This also starts a convenient DB administration tool at http://localhost:8080. Also, make sure you have set your `POSTGRES_DATA` environment
variable in `.env`. It should point to a local folder to where the database data can be saved to, in order for the database to persist,
even if the containers are stopped. If you want to delete the existing database (eg. because changes to the schema have been made), attach
to the db docker container `db` in interactive mode by running:

```bash
docker exec -it db bash
```

Then navigate to the folder containing the database data: `/var/lib/postgresql/data/` and empty it with:

```bash
$/var/lib/postgresql/data/ rm -rf *
```

This needs to be done before stopping and restarting the containers: The `db` container will not execute any scripts if the `/var/lib/postgresql/data/` folder
is not empty and therefore not create a new database, even if you cleanup docker with the command below.

If you wish to cleanup docker to run a fresh version of the backend you can run:

Expand Down
149 changes: 149 additions & 0 deletions prospector/commitdb/postgres.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
"""
This module implements an abstraction layer on top of
the underlying database where pre-processed commits are stored
"""

import os
from typing import Any, Dict, List

import psycopg2
from psycopg2.extensions import parse_dsn
from psycopg2.extras import DictCursor, DictRow, Json

from commitdb import CommitDB
from log.logger import logger

# DB_CONNECT_STRING = "postgresql://{}:{}@{}:{}/{}".format(
# os.getenv("POSTGRES_USER", "postgres"),
# os.getenv("POSTGRES_PASSWORD", "example"),
# os.getenv("POSTGRES_HOST", "localhost"),
# os.getenv("POSTGRES_PORT", "5432"),
# os.getenv("POSTGRES_DBNAME", "postgres"),
# ).lower()


class PostgresCommitDB(CommitDB):
"""
This class implements the database abstraction layer
for PostgreSQL
"""

def __init__(self, user, password, host, port, dbname):
self.user = user
self.password = password
self.host = host
self.port = port
self.dbname = dbname
self.connection = None

def connect(self):
try:
self.connection = psycopg2.connect(
database=self.dbname,
user=self.user,
password=self.password,
host=self.host,
port=self.port,
)
except Exception:
self.host = "localhost"
self.connection = psycopg2.connect(
database=self.dbname,
user=self.user,
password=self.password,
host=self.host,
port=self.port,
)

def lookup(
self, repository: str, commit_id: str = None
) -> List[Dict[str, Any]]:
if not self.connection:
raise Exception("Invalid connection")

results = list()
try:
cur = self.connection.cursor(cursor_factory=DictCursor)

if commit_id is None:
cur.execute(
"SELECT * FROM commits WHERE repository = %s", (repository,)
)
if cur.rowcount > 0:
results = cur.fetchall()
else:
for id in commit_id.split(","):
cur.execute(
"SELECT * FROM commits WHERE repository = %s AND commit_id = %s",
(repository, id),
)
if cur.rowcount > 0:
results.append(cur.fetchone())
return [dict(row) for row in results] # parse_commit_from_db
except Exception:
logger.error(
"Could not lookup commit vector in database", exc_info=True
)
return []
finally:
cur.close()

def save(self, commit: Dict[str, Any]):
if not self.connection:
raise Exception("Invalid connection")

try:
cur = self.connection.cursor()
statement = build_statement(commit)
args = get_args(commit)
cur.execute(statement, args)
self.connection.commit()
cur.close()
except Exception:
logger.error(
"Could not save commit vector to database", exc_info=True
)
cur.close()

def reset(self):
self.run_sql_script("ddl/10_commit.sql")
self.run_sql_script("ddl/20_users.sql")

def run_sql_script(self, script_file):
if not self.connection:
raise Exception("Invalid connection")

with open(script_file, "r") as file:
ddl = file.read()

cursor = self.connection.cursor()
cursor.execute(ddl)
self.connection.commit()

cursor.close()


def parse_connect_string(connect_string):
try:
return parse_dsn(connect_string)
except Exception:
raise Exception(f"Invalid connect string: {connect_string}")


def build_statement(data: Dict[str, Any]):
data = data.to_dict() # LASCHA: check if this is correct
columns = ",".join(data.keys())
on_conflict = ",".join([f"EXCLUDED.{key}" for key in data.keys()])
return f"INSERT INTO commits ({columns}) VALUES ({','.join(['%s'] * len(data))}) ON CONFLICT ON CONSTRAINT commits_pkey DO UPDATE SET ({columns}) = ({on_conflict})"


def get_args(data: Dict[str, Any]):
return tuple(
[Json(val) if isinstance(val, dict) else val for val in data.values()]
)


def parse_commit_from_db(raw_data: DictRow) -> Dict[str, Any]:
out = dict(raw_data)
out["hunks"] = [(int(x[1]), int(x[3])) for x in raw_data["hunks"]]
return out
62 changes: 37 additions & 25 deletions prospector/core/prospector.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from git.version_to_tag import get_possible_tags
from llm.llm_service import LLMService
from log.logger import get_level, logger, pretty_log
from rules.rules import RULES_PHASE_1, apply_rules
from rules.rules import NUM_COMMITS_PHASE_2, RULES_PHASE_1, apply_rules
from stats.execution import (
Counter,
ExecutionTimer,
Expand All @@ -45,6 +45,7 @@


core_statistics = execution_statistics.sub_collection("core")
llm_statistics = execution_statistics.sub_collection("LLM")


# @profile
Expand Down Expand Up @@ -91,25 +92,26 @@ def prospector( # noqa: C901
return None, -1

if use_llm_repository_url:
with ConsoleWriter("LLM Usage (Repo URL)") as console:
try:
repository_url = LLMService().get_repository_url(
advisory_record.description, advisory_record.references
)
console.print(
f"\n Repository URL: {repository_url}",
status=MessageStatus.OK,
)
except Exception as e:
logger.error(
e,
exc_info=get_level() < logging.INFO,
)
console.print(
e,
status=MessageStatus.ERROR,
)
sys.exit(1)
with ExecutionTimer(llm_statistics.sub_collection("repository_url")):
with ConsoleWriter("LLM Usage (Repo URL)") as console:
try:
repository_url = LLMService().get_repository_url(
advisory_record.description, advisory_record.references
)
console.print(
f"\n Repository URL: {repository_url}",
status=MessageStatus.OK,
)
except Exception as e:
logger.error(
e,
exc_info=get_level() < logging.INFO,
)
console.print(
e,
status=MessageStatus.ERROR,
)
sys.exit(1)

fixing_commit = advisory_record.get_fixing_commit()
# print(advisory_record.references)
Expand Down Expand Up @@ -240,14 +242,18 @@ def prospector( # noqa: C901
and use_backend != USE_BACKEND_NEVER
and len(missing) > 0
):
save_preprocessed_commits(backend_address, payload)
save_or_update_processed_commits(backend_address, payload)
else:
logger.warning("Preprocessed commits are not being sent to backend")

ranked_candidates = evaluate_commits(
preprocessed_commits, advisory_record, enabled_rules
preprocessed_commits, advisory_record, backend_address, enabled_rules
)

# Save outcome of security relevance to DB (Phase 2 Rule)
payload = [c.to_dict() for c in ranked_candidates[:NUM_COMMITS_PHASE_2]]
save_or_update_processed_commits(backend_address, payload)

# ConsoleWriter.print("Commit ranking and aggregation...")
ranked_candidates = remove_twins(ranked_candidates)
# ranked_candidates = tag_and_aggregate_commits(ranked_candidates, next_tag)
Expand Down Expand Up @@ -288,7 +294,10 @@ def filter(commits: Dict[str, RawCommit]) -> Dict[str, RawCommit]:


def evaluate_commits(
commits: List[Commit], advisory: AdvisoryRecord, enabled_rules: List[str]
commits: List[Commit],
advisory: AdvisoryRecord,
backend_address: str,
enabled_rules: List[str],
) -> List[Commit]:
"""This method applies the rule phases. Each phase is associated with a set of rules:
- Phase 1: Original rules
Expand All @@ -308,7 +317,7 @@ def evaluate_commits(
with ExecutionTimer(core_statistics.sub_collection("candidates analysis")):
with ConsoleWriter("Candidate analysis") as _:
ranked_commits = apply_rules(
commits, advisory, enabled_rules=enabled_rules
commits, advisory, backend_address, enabled_rules=enabled_rules
)

return ranked_commits
Expand Down Expand Up @@ -393,13 +402,16 @@ def retrieve_preprocessed_commits(
return (missing, commits)


def save_preprocessed_commits(backend_address, payload):
def save_or_update_processed_commits(backend_address, payload):
with ExecutionTimer(
core_statistics.sub_collection(name="save commits to backend")
):
with ConsoleWriter("Saving processed commits to backend") as writer:
logger.debug("Sending processing commits to backend...")
try:
# logger.debug(
# f"the address: {backend_address}, the payload: {payload}"
# ) # Sanity Check
r = requests.post(
backend_address + "/commits/",
json=payload,
Expand Down
6 changes: 2 additions & 4 deletions prospector/core/report.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,8 @@ def json_(
data = {
"parameters": params,
"advisory_record": advisory_record.__dict__,
"commits": [
r.as_dict(no_hash=True, no_rules=False, no_diff=no_diff)
for r in results
],
"commits": [r.as_dict(no_hash=True, no_rules=False) for r in results],
"processing_statistics": execution_statistics,
}
logger.info(f"Writing results to {fn}")
file = Path(fn)
Expand Down
Loading
Loading