Skip to content

Commit

Permalink
Increase query performance and change env var names
Browse files Browse the repository at this point in the history
  • Loading branch information
ahosgood committed Nov 27, 2024
1 parent aa6d5e9 commit 2b67659
Show file tree
Hide file tree
Showing 4 changed files with 87 additions and 54 deletions.
48 changes: 26 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,27 +41,31 @@ docker compose exec dev format

In addition to the [base Docker image variables](https://github.com/nationalarchives/docker/blob/main/docker/tna-python/README.md#environment-variables), this application has support for:

| Variable | Purpose | Default |
| -------------------------------- | --------------------------------------------------------------------------- | --------------------------------------------------------- |
| `CONFIG` | The configuration to use | `config.Production` |
| `DEBUG` | If true, allow debugging[^1] | `False` |
| `COOKIE_DOMAIN` | The domain to save cookie preferences against | _none_ |
| `CSP_IMG_SRC` | A comma separated list of CSP rules for `img-src` | `'self'` |
| `CSP_SCRIPT_SRC` | A comma separated list of CSP rules for `script-src` | `'self'` |
| `CSP_SCRIPT_SRC_ELEM` | A comma separated list of CSP rules for `script-src-elem` | `'self'` |
| `CSP_STYLE_SRC` | A comma separated list of CSP rules for `style-src` | `'self'` |
| `CSP_STYLE_SRC_ELEM` | A comma separated list of CSP rules for `style-src-elem` | `'self'` |
| `CSP_FONT_SRC` | A comma separated list of CSP rules for `font-src` | `'self'` |
| `CSP_CONNECT_SRC` | A comma separated list of CSP rules for `connect-src` | `'self'` |
| `CSP_MEDIA_SRC` | A comma separated list of CSP rules for `media-src` | `'self'` |
| `CSP_WORKER_SRC` | A comma separated list of CSP rules for `worker-src` | `'self'` |
| `CSP_FRAME_SRC` | A comma separated list of CSP rules for `frame-src` | `'self'` |
| `CSP_FEATURE_FULLSCREEN` | A comma separated list of rules for the `fullscreen` feature policy | `'self'` |
| `CSP_FEATURE_PICTURE_IN_PICTURE` | A comma separated list of rules for the `picture-in-picture` feature policy | `'self'` |
| `FORCE_HTTPS` | Redirect requests to HTTPS as part of the CSP | _none_ |
| `CACHE_TYPE` | https://flask-caching.readthedocs.io/en/latest/#configuring-flask-caching | _none_ |
| `CACHE_DEFAULT_TIMEOUT` | The number of seconds to cache pages for | production: `300`, staging: `60`, develop: `0`, test: `0` |
| `CACHE_DIR` | Directory for storing cached responses when using `FileSystemCache` | `/tmp` |
| `GA4_ID` | The Google Analytics 4 ID | _none_ |
| Variable | Purpose | Default |
| -------------------------------- | ----------------------------------------------------------------------------- | --------------------------------------------------------- |
| `CONFIG` | The configuration to use | `config.Production` |
| `DEBUG` | If true, allow debugging[^1] | `False` |
| `COOKIE_DOMAIN` | The domain to save cookie preferences against | _none_ |
| `CSP_IMG_SRC` | A comma separated list of CSP rules for `img-src` | `'self'` |
| `CSP_SCRIPT_SRC` | A comma separated list of CSP rules for `script-src` | `'self'` |
| `CSP_SCRIPT_SRC_ELEM` | A comma separated list of CSP rules for `script-src-elem` | `'self'` |
| `CSP_STYLE_SRC` | A comma separated list of CSP rules for `style-src` | `'self'` |
| `CSP_STYLE_SRC_ELEM` | A comma separated list of CSP rules for `style-src-elem` | `'self'` |
| `CSP_FONT_SRC` | A comma separated list of CSP rules for `font-src` | `'self'` |
| `CSP_CONNECT_SRC` | A comma separated list of CSP rules for `connect-src` | `'self'` |
| `CSP_MEDIA_SRC` | A comma separated list of CSP rules for `media-src` | `'self'` |
| `CSP_WORKER_SRC` | A comma separated list of CSP rules for `worker-src` | `'self'` |
| `CSP_FRAME_SRC` | A comma separated list of CSP rules for `frame-src` | `'self'` |
| `CSP_FEATURE_FULLSCREEN` | A comma separated list of rules for the `fullscreen` feature policy | `'self'` |
| `CSP_FEATURE_PICTURE_IN_PICTURE` | A comma separated list of rules for the `picture-in-picture` feature policy | `'self'` |
| `FORCE_HTTPS` | Redirect requests to HTTPS as part of the CSP | _none_ |
| `CACHE_TYPE` | https://flask-caching.readthedocs.io/en/latest/#configuring-flask-caching | _none_ |
| `CACHE_DEFAULT_TIMEOUT` | The number of seconds to cache pages for | production: `300`, staging: `60`, develop: `0`, test: `0` |
| `CACHE_DIR` | Directory for storing cached responses when using `FileSystemCache` | `/tmp` |
| `GA4_ID` | The Google Analytics 4 ID | _none_ |
| `WEBARCHIVE_REWRITE_DOMAINS` | A CSV list of domains to consider archived | _none_ |
| `RELEVANCE_TITLE_MATCH_WEIGHT` | The multiplier to use for every query match in the title | `5` |
| `RELEVANCE_BODY_MATCH_WEIGHT` | The multiplier to use for every query match in the body | `1` |
| `RELEVANCE_ARCHIVED_WEIGHT` | The multiplier to use for a result with a URL in `WEBARCHIVE_REWRITE_DOMAINS` | `0.5` |

[^1] [Debugging in Flask](https://flask.palletsprojects.com/en/2.3.x/debugging/)
67 changes: 40 additions & 27 deletions app/sitemap_search/routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,14 @@
@bp.route("/")
@cache.cached(key_prefix=cache_key_prefix)
def index():
query = unquote(request.args.get("q", "")).strip(" ").lower()
query = unquote(request.args.get("q", "")).strip(" ")
page = (
int(request.args.get("page"))
if request.args.get("page") and request.args.get("page").isnumeric()
else 1
)
results_per_page = 12
webarchive_domains = current_app.config.get(
"FEATURE_WEBARCHIVE_REWRITE_DOMAINS"
)
webarchive_domains = current_app.config.get("WEBARCHIVE_REWRITE_DOMAINS")
conn = psycopg2.connect(
host=os.environ.get("DB_HOST"),
database=os.environ.get("DB_NAME"),
Expand All @@ -32,54 +30,69 @@ def index():
)
cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
if query:
title_score = 5
description_score = 3
url_score = 2
body_instance_score = 1
title_match_weight = current_app.config.get(
"RELEVANCE_TITLE_MATCH_WEIGHT"
)
body_match_weight = current_app.config.get(
"RELEVANCE_BODY_MATCH_WEIGHT"
)
archived_weight = current_app.config.get("RELEVANCE_ARCHIVED_WEIGHT")
cur.execute(
"""WITH scored_results AS (
SELECT
id,
title,
url,
description,
(%(title_score)s * ((CHAR_LENGTH(title) - CHAR_LENGTH(REPLACE(LOWER(title), %(query)s, ''))) / CHAR_LENGTH(%(query)s))) +
/*(%(description_score)s * ((CHAR_LENGTH(description) - CHAR_LENGTH(REPLACE(LOWER(description), %(query)s, ''))) / CHAR_LENGTH(%(query)s))) +*/
/*(%(url_score)s * ((CHAR_LENGTH(url) - CHAR_LENGTH(REPLACE(LOWER(url), %(query)s, ''))) / CHAR_LENGTH(%(query)s))) +*/
(%(body_instance_score)s * ((CHAR_LENGTH(body) - CHAR_LENGTH(REPLACE(LOWER(body), %(query)s, ''))) / CHAR_LENGTH(%(query)s))) AS relevance
(
(
(
CHAR_LENGTH(title) -
CHAR_LENGTH(REPLACE(LOWER(title), %(query)s, ''))
) * %(title_match_weight)s
) +
(
(
CHAR_LENGTH(body) -
CHAR_LENGTH(REPLACE(LOWER(body), %(query)s, ''))
) * %(body_match_weight)s
)
) *
(
CASE
WHEN url LIKE %(webarchive_domains)s THEN %(archived_weight)s
ELSE 1
END
) AS relevance
FROM sitemap_urls
WHERE title IS NOT NULL
), filtered_scored_results AS (
SELECT
id,
title,
url,
description,
relevance
FROM scored_results
)
SELECT
id,
title,
url,
description,
relevance,
(SELECT COUNT(*) FROM filtered_scored_results WHERE relevance > 0) AS total_results
FROM filtered_scored_results
(SELECT COUNT(*) FROM scored_results WHERE relevance > 0) AS total_results
FROM scored_results
WHERE relevance > 0
ORDER by relevance DESC
LIMIT %(limit)s
OFFSET %(offset)s;""",
{
"query": query,
"title_score": title_score,
"url_score": url_score,
"description_score": description_score,
"body_instance_score": body_instance_score,
"query": query.lower(),
"query_length": len(query),
"title_match_weight": title_match_weight,
"body_match_weight": body_match_weight,
"archived_weight": archived_weight,
"limit": results_per_page,
"offset": (page - 1) * results_per_page,
"webarchive_domains": "|".join(
[f"%{domain}%" for domain in webarchive_domains]
),
},
)
# return cur.query
results = cur.fetchall()
total_results = results[0]["total_results"] if len(results) else 0
pages = math.ceil(total_results / results_per_page)
Expand Down
24 changes: 20 additions & 4 deletions config.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@


class Features(object):
FEATURE_WEBARCHIVE_REWRITE_DOMAINS: list[str] = os.environ.get(
"FEATURE_WEBARCHIVE_REWRITE_DOMAINS", ""
).split(",")
pass


class Base(object):
Expand Down Expand Up @@ -80,7 +78,25 @@ class Base(object):
CACHE_IGNORE_ERRORS: bool = True
CACHE_DIR: str = os.environ.get("CACHE_DIR", "/tmp")

GA4_ID = os.environ.get("GA4_ID", "")
GA4_ID: str = os.environ.get("GA4_ID", "")

WEBARCHIVE_REWRITE_DOMAINS: list[str] = [
domain
for domain in os.environ.get("WEBARCHIVE_REWRITE_DOMAINS", "").split(
","
)
if domain
]

RELEVANCE_TITLE_MATCH_WEIGHT: float = float(
os.environ.get("RELEVANCE_TITLE_MATCH_WEIGHT", "5")
)
RELEVANCE_BODY_MATCH_WEIGHT: float = float(
os.environ.get("RELEVANCE_BODY_MATCH_WEIGHT", "1")
)
RELEVANCE_ARCHIVED_WEIGHT: float = float(
os.environ.get("RELEVANCE_ARCHIVED_WEIGHT", "0.5")
)


class Production(Base, Features):
Expand Down
2 changes: 1 addition & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ services:
- DB_PASSWORD=postgres
- SITEMAPS=https://www.nationalarchives.gov.uk/sitemap.xml,https://develop.tna.dblclk.dev/sitemap.xml,https://blog.nationalarchives.gov.uk/sitemap_index.xml,https://nationalarchives.github.io/design-system/sitemap.xml
- POPULATE_ON_STARTUP=False
- FEATURE_WEBARCHIVE_REWRITE_DOMAINS=blog.nationalarchives.gov.uk
- WEBARCHIVE_REWRITE_DOMAINS=blog.nationalarchives.gov.uk
ports:
- 65525:8080
depends_on:
Expand Down

0 comments on commit 2b67659

Please sign in to comment.