Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch storage backend #25

Draft
wants to merge 430 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
430 commits
Select commit Hold shift + click to select a range
8895fb6
Explicitly decompress Gzip, fix encoding
janheinrichmerker Nov 3, 2023
fb1bb41
Fix test
janheinrichmerker Nov 3, 2023
8fd5030
Fix workflow permissions
janheinrichmerker Nov 5, 2023
cb42aff
Move web archive APIs and WARC S3 store to separate libs, move compat…
janheinrichmerker Nov 5, 2023
05095fc
Add S3 config
janheinrichmerker Nov 13, 2023
a0045b6
Remove outdated parser
janheinrichmerker Nov 13, 2023
8e8a489
Disable captures import
janheinrichmerker Nov 13, 2023
e38256b
Add monitoring progress
janheinrichmerker Nov 13, 2023
2eaa96e
Disable tests
janheinrichmerker Nov 13, 2023
30631d1
Simplify Helm commands
janheinrichmerker Nov 13, 2023
c072b59
Update Helm chart
janheinrichmerker Nov 13, 2023
54d2886
Update Helm chart
janheinrichmerker Nov 13, 2023
6770ea4
Fix namespaces
janheinrichmerker Nov 13, 2023
a920c25
Disable restriction
janheinrichmerker Nov 13, 2023
d33c2de
Fix code format
janheinrichmerker Nov 13, 2023
794b732
Add URL query parsers command
janheinrichmerker Nov 13, 2023
85f6e4e
Update ORM
janheinrichmerker Nov 13, 2023
56f091d
Add URL page parsers
janheinrichmerker Nov 13, 2023
4a8bfd6
Add URL offset parsers
janheinrichmerker Nov 13, 2023
e107972
Fix Helm config
janheinrichmerker Nov 13, 2023
28fe96e
Fix template
janheinrichmerker Nov 13, 2023
1369c25
Add stats
janheinrichmerker Nov 13, 2023
bd55cf0
Remove invalid parser
janheinrichmerker Nov 13, 2023
648b3aa
Fix monitoring
janheinrichmerker Nov 13, 2023
3501078
Update versions
janheinrichmerker Nov 13, 2023
0987fd7
Fix UI
janheinrichmerker Nov 14, 2023
60f5d99
Improve UI
janheinrichmerker Nov 14, 2023
60b3f20
Make AQL-22 import a cronjob
janheinrichmerker Nov 14, 2023
ee7394c
Update dependency
janheinrichmerker Nov 14, 2023
3d336d9
Remove invalid parsers
janheinrichmerker Nov 14, 2023
5014e4e
Add parsing utils
janheinrichmerker Nov 14, 2023
020c491
Add SERP URL query parsing CLI
janheinrichmerker Nov 14, 2023
0275cec
Rename jobs
janheinrichmerker Nov 14, 2023
901d7c6
Add SERP URL query parsing to Helm chart
janheinrichmerker Nov 14, 2023
5005023
Add last modified timestamp
janheinrichmerker Nov 14, 2023
2adcb7b
Update dependency
janheinrichmerker Nov 14, 2023
ed71346
Add namespace
janheinrichmerker Nov 14, 2023
f3faaa1
Add length to WARC location ORM
janheinrichmerker Nov 14, 2023
351a6a2
Fix start time
janheinrichmerker Nov 14, 2023
1a33fde
Add SERP WARC download CLI
janheinrichmerker Nov 14, 2023
25fede6
Update dependency
janheinrichmerker Nov 15, 2023
47b59c9
Disable WARC S3 logging
janheinrichmerker Nov 15, 2023
387c81c
Fix WARC downloading
janheinrichmerker Nov 15, 2023
4f49572
Fix mapping
janheinrichmerker Nov 15, 2023
d2ecb8d
Simplify SERP URL query parsing
janheinrichmerker Nov 15, 2023
ed5b58c
Add stats
janheinrichmerker Nov 15, 2023
e774e6c
Add parsers modification date
janheinrichmerker Nov 15, 2023
0e1bba5
Update dependency
janheinrichmerker Nov 15, 2023
9f64fe9
Remove invalid parser
janheinrichmerker Nov 15, 2023
861a106
Limit number of records per S3 WARC file
janheinrichmerker Nov 15, 2023
b868269
Update chart
janheinrichmerker Nov 15, 2023
de34b8e
Rename variable
janheinrichmerker Nov 15, 2023
0f5eca3
Rename annotation
janheinrichmerker Nov 15, 2023
8565d86
Downscale cron job completions, enable cron job
janheinrichmerker Nov 15, 2023
8619320
Add WARC download cron job
janheinrichmerker Nov 15, 2023
3d5e8e1
Update chart
janheinrichmerker Nov 15, 2023
56c82e5
Refactor parsers
janheinrichmerker Nov 15, 2023
1451d3d
Add URL page and offset parsers
janheinrichmerker Nov 15, 2023
6b6c070
Fix mappings
janheinrichmerker Nov 15, 2023
125eae0
Fix parser sorting
janheinrichmerker Nov 15, 2023
d466e5b
Add parsing for URL page and offset
janheinrichmerker Nov 15, 2023
c612541
Simplify index initialization
janheinrichmerker Nov 15, 2023
11e0b56
Improve documentation
janheinrichmerker Nov 15, 2023
4d367ce
Fix code format
janheinrichmerker Nov 15, 2023
1350b6f
Merge remote-tracking branch 'origin/main' into elastic
janheinrichmerker Nov 15, 2023
f259655
Update Helm chart
janheinrichmerker Nov 15, 2023
12c9bc4
Add page and offset parsing to Helm chart
janheinrichmerker Nov 15, 2023
866dfe5
Fix parsing
janheinrichmerker Nov 16, 2023
b74b94f
Fix monitoring cache
janheinrichmerker Nov 16, 2023
6100cd7
Add parsing timestamp even if no parser worked
janheinrichmerker Nov 16, 2023
5e14c90
Update Helm chart
janheinrichmerker Nov 16, 2023
2b86769
Update dependency
janheinrichmerker Nov 16, 2023
934c3f6
Fix parsers
janheinrichmerker Nov 16, 2023
c54214c
Update cron job schedules
janheinrichmerker Nov 16, 2023
c007c29
Improve scan efficiency
janheinrichmerker Nov 16, 2023
7e9648b
Decouple CLI and implementations
janheinrichmerker Nov 16, 2023
cda692d
Fix code format and LINT errors
janheinrichmerker Nov 16, 2023
9925ff2
Fix ES helper
janheinrichmerker Nov 18, 2023
f4b41b4
Improve ES bulk helpers
janheinrichmerker Nov 18, 2023
3f9b80c
Add ES helpers
janheinrichmerker Nov 19, 2023
0b4040c
Improve monitoring
janheinrichmerker Nov 19, 2023
c898a18
Remove unnecessary forced index refreshes
janheinrichmerker Nov 19, 2023
ea8f7c7
Use ES bulk API where possible
janheinrichmerker Nov 19, 2023
9c681a9
Fix code
janheinrichmerker Nov 19, 2023
3ae3a74
Fix parsers
janheinrichmerker Nov 19, 2023
668fb0e
Fix update action
janheinrichmerker Nov 19, 2023
e311824
Fix code
janheinrichmerker Nov 19, 2023
ee61f88
Update Helm chart
janheinrichmerker Nov 19, 2023
899359e
Fix monitoring
janheinrichmerker Nov 19, 2023
20162a3
Add WARC parser ORM
janheinrichmerker Nov 19, 2023
f2c4a2b
Add WARC parser namespaces
janheinrichmerker Nov 19, 2023
f8d7f7d
Add WARC parser XML utils
janheinrichmerker Nov 19, 2023
b3188ad
Add WARC parsers
janheinrichmerker Nov 19, 2023
6c5c79b
Add WARC parser import
janheinrichmerker Nov 19, 2023
e5d048f
Add WARC parser index init
janheinrichmerker Nov 19, 2023
8743083
Add SERP WARC parsing CLI
janheinrichmerker Nov 19, 2023
db89ac6
Add dependencies
janheinrichmerker Nov 19, 2023
0063d2d
Add dependencies
janheinrichmerker Nov 19, 2023
cd05ca0
Improve parsers
janheinrichmerker Nov 20, 2023
b92bd74
Fix WARC query parsing
janheinrichmerker Nov 20, 2023
349227c
Fix WARC query parsing
janheinrichmerker Nov 20, 2023
07c5c12
Simplify doc counts
janheinrichmerker Nov 20, 2023
9cb9559
Improve monitoring
janheinrichmerker Nov 20, 2023
e064125
Fix code
janheinrichmerker Nov 20, 2023
e3202ed
Merge remote-tracking branch 'origin/main' into elastic
janheinrichmerker Nov 20, 2023
62b3093
Add WARC query parsing to Helm chart
janheinrichmerker Nov 20, 2023
e1a7464
Increase retries
janheinrichmerker Nov 20, 2023
e97f29a
Count rate limits per host
janheinrichmerker Nov 20, 2023
3eb13ba
Only download WARCs from captures with status 200
janheinrichmerker Nov 20, 2023
d3b2288
Fix building sources
janheinrichmerker Nov 20, 2023
1004485
Fix monitoring
janheinrichmerker Nov 20, 2023
9dd5f41
Add missing filters in monitoring
janheinrichmerker Nov 20, 2023
e1a2e62
Revert "Update urllib3 requirement from ~=1.26 to ~=2.1"
janheinrichmerker Nov 20, 2023
717d379
Update chart
janheinrichmerker Nov 20, 2023
1018489
Fix mapping
janheinrichmerker Nov 20, 2023
a3fd4a7
Fix parser IDs
janheinrichmerker Nov 20, 2023
536b7fd
Decouple warc reading
janheinrichmerker Nov 20, 2023
939deb9
Add text cleaning utils
janheinrichmerker Nov 20, 2023
3cc7498
Simplify WARC query parsing
janheinrichmerker Nov 20, 2023
0443d2c
Add type safe XPath wrapper
janheinrichmerker Nov 20, 2023
871853f
Add WARC snippet parsing
janheinrichmerker Nov 20, 2023
cdddc8a
Improve XPath utils
janheinrichmerker Nov 20, 2023
5eec733
Add WARC snippets parser import
janheinrichmerker Nov 20, 2023
9f5df1a
Remove unsupported parsers
janheinrichmerker Nov 20, 2023
17522f7
Improve CSS selector conversion
janheinrichmerker Nov 20, 2023
d271f60
Add WARC snippets parser CLI
janheinrichmerker Nov 20, 2023
ff091f1
Fix code
janheinrichmerker Nov 20, 2023
fbbbab0
Add WARC snippets parsing CLI
janheinrichmerker Nov 20, 2023
0579449
Add WARC snippets parsing to chart
janheinrichmerker Nov 20, 2023
93817d6
Do not fail on CDX timeouts
janheinrichmerker Nov 21, 2023
c3b5d75
Make absolute XPaths explicit
janheinrichmerker Nov 21, 2023
8141119
Decrease retries
janheinrichmerker Nov 21, 2023
668b157
Fix CLI
janheinrichmerker Nov 21, 2023
d8a95fc
Fix WARC snippets parsing
janheinrichmerker Nov 21, 2023
0696d11
Fix code
janheinrichmerker Nov 21, 2023
d67b9b9
Fix monitoring
janheinrichmerker Nov 21, 2023
83677bd
Update chart
janheinrichmerker Nov 21, 2023
c26951a
Increase scroll duration
janheinrichmerker Nov 21, 2023
140dd17
Fix SERP snippet parsing
janheinrichmerker Nov 21, 2023
3be1dff
Add result WARC location ORM
janheinrichmerker Nov 21, 2023
10b532e
Update dependency
janheinrichmerker Nov 22, 2023
192e84b
Change mapping
janheinrichmerker Nov 22, 2023
ad5afee
Rename variable
janheinrichmerker Nov 22, 2023
283a3c9
Update provider ORM
janheinrichmerker Nov 22, 2023
78cad54
Update provider import and CLI
janheinrichmerker Nov 22, 2023
5a34a3a
Update parser import and CLI
janheinrichmerker Nov 22, 2023
31dc9cf
Add archive priority
janheinrichmerker Nov 23, 2023
5da398f
Propagate archive and provider priorities
janheinrichmerker Nov 23, 2023
ab6a0fd
Prioritize captures fetching
janheinrichmerker Nov 23, 2023
c22adb5
Fix updating documents with score
janheinrichmerker Nov 23, 2023
e7fe5a3
Simplify out-of-date handling
janheinrichmerker Nov 23, 2023
89f901d
Add entries to monitoring, fix progress
janheinrichmerker Nov 23, 2023
f40d0e7
Simplify out-of-date handling
janheinrichmerker Nov 23, 2023
793581e
Prioritize WARC download
janheinrichmerker Nov 23, 2023
d8670c8
Fix WARC parsing
janheinrichmerker Nov 23, 2023
5afd9aa
Update chart
janheinrichmerker Nov 23, 2023
4e74817
Fix code
janheinrichmerker Nov 23, 2023
f3a2170
Fix initializations
janheinrichmerker Nov 24, 2023
34e0cf0
Add results WARC download
janheinrichmerker Nov 24, 2023
27ffc43
Fix code
janheinrichmerker Nov 24, 2023
7485c73
Update documentation
janheinrichmerker Nov 24, 2023
318b692
Update documentation
janheinrichmerker Nov 24, 2023
219684b
Fix parser sorting
janheinrichmerker Nov 24, 2023
a9dcc2e
Update documentation
janheinrichmerker Nov 24, 2023
e77bfb9
Update Helm chart
janheinrichmerker Nov 24, 2023
f84d417
Make parser provider ID optional
janheinrichmerker Nov 24, 2023
2a6e447
Fix offset parser
janheinrichmerker Nov 27, 2023
3c537a7
Prepare main content extraction
janheinrichmerker Nov 27, 2023
130c6f8
Decrease stats cache duration
janheinrichmerker Nov 27, 2023
22ffb26
Add HTML utils
janheinrichmerker Nov 27, 2023
d43a466
Fix parser
janheinrichmerker Nov 27, 2023
48678da
Keep WARC download running on individual connection errors
janheinrichmerker Nov 27, 2023
0c1e68e
Update Helm chart
janheinrichmerker Nov 27, 2023
cc86410
Fix captures fetching
janheinrichmerker Nov 27, 2023
a896479
Fix captures fetching
janheinrichmerker Nov 27, 2023
a642a2d
Fix snippets parser
janheinrichmerker Nov 27, 2023
bc7bf0b
Increase timeout
janheinrichmerker Nov 27, 2023
eeed979
Fix captures fetching
janheinrichmerker Nov 27, 2023
5bd573a
Increase monitoring cache duration
janheinrichmerker Nov 30, 2023
242ad49
Update Helm chart
janheinrichmerker Nov 30, 2023
8dada02
Rename Helm folder
janheinrichmerker Jan 26, 2024
a60cc13
Improve install instructuons
janheinrichmerker Feb 6, 2024
fb3d745
Remove legacy code
janheinrichmerker Feb 6, 2024
3b9e27a
Merge branch 'elastic' of github.com:webis-de/archive-query-log into …
janheinrichmerker Feb 6, 2024
dbc527b
Add Flask dependencies
janheinrichmerker Feb 6, 2024
bc28000
Add main content parser CLI
janheinrichmerker Feb 6, 2024
3fefc0b
Increase shards
janheinrichmerker Feb 6, 2024
130a4d8
Update readme
janheinrichmerker Feb 13, 2024
e156ca4
Add MarkDownLINT config
janheinrichmerker Feb 13, 2024
4174181
Review code security
janheinrichmerker Feb 13, 2024
e029679
Update dependencies
janheinrichmerker Feb 13, 2024
a815982
Fix approvals
janheinrichmerker Feb 13, 2024
b6baac9
Fix approvals
janheinrichmerker Feb 13, 2024
06e8e76
Fix approvals and parsers
janheinrichmerker Feb 13, 2024
a138b1e
Fix approvals and parsers
janheinrichmerker Feb 13, 2024
d8435cc
Improve code checking
janheinrichmerker Feb 13, 2024
6cc0151
Fix CI workflow config
janheinrichmerker Feb 13, 2024
db8802c
Fix type checking
janheinrichmerker Feb 13, 2024
6d28a7a
Simplify Dockerfile
janheinrichmerker Mar 5, 2024
2bfb832
Update Dockerfile
janheinrichmerker Mar 5, 2024
0d61370
Update Dockerfile
janheinrichmerker Mar 5, 2024
83ec503
Merge pull request #40 from webis-de/heinrichreimer-patch-1
janheinrichmerker Mar 5, 2024
d1bbe27
Merge branch 'elastic' into patch-1
janheinrichmerker Mar 5, 2024
85b84bd
Merge pull request #41 from webis-de/patch-1
janheinrichmerker Mar 5, 2024
cf8300e
Cache Docker CI
janheinrichmerker Mar 5, 2024
2cf94c7
build new direct answer parser
JKlueber Mar 23, 2024
615d0f4
added direct answer
JKlueber Mar 23, 2024
d98e59b
Prepare docs
janheinrichmerker Apr 2, 2024
f3f11e8
- removed rank
JKlueber Apr 2, 2024
45ee69d
changed xpaths List[str] to xpath str
JKlueber Apr 4, 2024
dabe86e
changed direct_answer to plural
JKlueber Apr 10, 2024
2da7caa
added direct answers
JKlueber Apr 10, 2024
a7af589
dubugged import
JKlueber Apr 10, 2024
9e62837
added priority
JKlueber Apr 11, 2024
14ddc9a
removed "import_warc_direct_answers_parsers" from CLI
JKlueber Apr 15, 2024
218ae44
Merge pull request #45 from webis-de/direct-answers
JKlueber Apr 15, 2024
948a55e
Merge branch 'elastic' of github.com:webis-de/archive-query-log into …
janheinrichmerker Apr 15, 2024
5eb084c
Fix CLI
janheinrichmerker Apr 15, 2024
6d05c24
build fast api fur aql dashboard
JKlueber Jun 11, 2024
918b251
added comment how to start the api
JKlueber Aug 19, 2024
6eeeae9
Added Vue Dashboard
JKlueber Aug 21, 2024
8368a6e
Update name and email
janheinrichmerker Oct 14, 2024
ca6d807
Merge branch 'elastic' of github.com:webis-de/archive-query-log into …
janheinrichmerker Oct 14, 2024
668de7e
Remove unused code
janheinrichmerker Oct 14, 2024
b37791d
Add dependencies
janheinrichmerker Jan 29, 2025
bc63bad
Update gitignore
janheinrichmerker Jan 29, 2025
3159fc0
Use env vars for config, specify index explicitly
janheinrichmerker Jan 29, 2025
225ca1c
Fix index parameters
janheinrichmerker Jan 29, 2025
a1875c6
Add config field for WARC cache
janheinrichmerker Feb 25, 2025
798703b
Add local WARC cache
janheinrichmerker Feb 25, 2025
c53fd49
Fix local WARC cache
janheinrichmerker Feb 26, 2025
028bc73
Migrate WARC download to local WARC cache (WIP)
janheinrichmerker Feb 26, 2025
88e9b1c
Refactor WARC cache config
janheinrichmerker Feb 27, 2025
f06eee2
Defer default arg in CLI
janheinrichmerker Feb 27, 2025
73be0e2
Workaround for https://github.com/boto/boto3/issues/3738
janheinrichmerker Feb 27, 2025
e0f30a2
Add WARC cache read all function
janheinrichmerker Feb 27, 2025
0772a0f
Implement WARC download to cache and upload from cache (WIP)
janheinrichmerker Feb 27, 2025
f78313e
Add CLI to upload WARCs from cache
janheinrichmerker Feb 27, 2025
a09a32c
Update config
janheinrichmerker Feb 27, 2025
c9a8f27
Update dependency
janheinrichmerker Feb 27, 2025
2bbea5a
Update config
janheinrichmerker Feb 27, 2025
ce51b44
Fix unwrapping
janheinrichmerker Feb 27, 2025
ed76bdc
Finalize WARC upload
janheinrichmerker Feb 27, 2025
ee45ec7
Remove migrated-out code
janheinrichmerker Feb 27, 2025
1fcae09
Fix config
janheinrichmerker Feb 27, 2025
994a651
Fix CI
janheinrichmerker Feb 27, 2025
7d957d1
Fix code
janheinrichmerker Feb 27, 2025
c56597e
Replace named tuple with data class
janheinrichmerker Feb 28, 2025
c2bd430
Update Helm deployment
janheinrichmerker Feb 28, 2025
548d40b
Remove unused import
janheinrichmerker Feb 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Added Vue Dashboard
JKlueber committed Aug 21, 2024

Verified

This commit was signed with the committer’s verified signature. The key has expired.
charlyx Charles-Henri GUERIN
commit 6eeeae90b7bc48b56f1b080a09c9b4062ff9c1be
4 changes: 4 additions & 0 deletions archive_query_log/dashboard/.browserslistrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
> 1%
last 2 versions
not dead
not ie 11
5 changes: 5 additions & 0 deletions archive_query_log/dashboard/.editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[*.{js,jsx,ts,tsx,vue}]
indent_style = space
indent_size = 2
trim_trailing_whitespace = true
insert_final_newline = true
22 changes: 22 additions & 0 deletions archive_query_log/dashboard/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
.DS_Store
node_modules
/dist

# local env files
.env.local
.env.*.local

# Log files
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*

# Editor directories and files
.idea
.vscode
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?
17 changes: 17 additions & 0 deletions archive_query_log/dashboard/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# 📈 aql-monitoring

Monitor and manage the crawling of the [Archive Query Log](https://github.com/webis-de/archive-query-log).

## Installation

```shell
npm install
```

TODO: Add better instructions.

## Usage

For starting the website, navigate to the repo and do
```npm run dev-vite```
in your terminal
16 changes: 16 additions & 0 deletions archive_query_log/dashboard/components.d.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
/* eslint-disable */
/* prettier-ignore */
// @ts-nocheck
// Generated by unplugin-vue-components
// Read more: https://github.com/vuejs/core/pull/3399
export {}

declare module 'vue' {
export interface GlobalComponents {
Footer: typeof import('./src/components/Footer.vue')['default']
Header: typeof import('./src/components/Header.vue')['default']
Home: typeof import('./src/components/Home.vue')['default']
ProgressTable: typeof import('./src/components/ProgressTable.vue')['default']
StatisticsTable: typeof import('./src/components/StatisticsTable.vue')['default']
}
}
14 changes: 14 additions & 0 deletions archive_query_log/dashboard/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="UTF-8" />
<title>Archive Query Log</title>
</head>

<body>
<div id="app"></div>
<script type="module" src="/src/main.ts"></script>
</body>

</html>
3,633 changes: 3,633 additions & 0 deletions archive_query_log/dashboard/package-lock.json

Large diffs are not rendered by default.

39 changes: 39 additions & 0 deletions archive_query_log/dashboard/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{
"name": "aql-dashboard",
"version": "0.0.0",
"scripts": {
"dev": "concurrently --kill-others \"npm run dev-vite\" \"npm run dev-proxy\"",
"dev-vite": "vite",
"dev-proxy": "node proxy.js",
"build": "vue-tsc --noEmit && vite build",
"preview": "vite preview"
},
"dependencies": {
"@mdi/font": "6.2.95",
"axios": "^1.7.2",
"elastic-tiny-client": "^0.1.4",
"http-proxy": "^1.18.1",
"roboto-fontface": "*",
"vue": "^3.4.21",
"vuetify": "^3.5.8"
},
"devDependencies": {
"@babel/types": "^7.24.0",
"@types/node": "^20.11.25",
"@vitejs/plugin-vue": "^5.0.4",
"concurrently": "^8.2.2",
"sass": "^1.71.1",
"typescript": "^5.4.2",
"unplugin-fonts": "^1.1.1",
"unplugin-vue-components": "^0.26.0",
"vite": "^5.1.5",
"vite-plugin-node-polyfills": "^0.21.0",
"vite-plugin-vuetify": "^2.0.3",
"vue-tsc": "^2.0.6"
},
"description": "Monitor and manage the crawling of the [Archive Query Log](https://github.com/webis-de/archive-query-log).",
"main": "index.js",
"keywords": [],
"author": "",
"license": "ISC"
}
81 changes: 81 additions & 0 deletions archive_query_log/dashboard/src/App.vue
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
<template>
<v-container class="fill-height d-flex flex-column">
<v-responsive class="flex-grow-1">
<div class="content">
<Header />
<div class="py-10"></div>
<StatisticsTable :statistics="statistics" />
<div class="py-10"></div>
<ProgressTable :progress="progress" />
</div>
</v-responsive>
<Footer />
</v-container>
</template>

<script>
import Header from './components/Header.vue';
import StatisticsTable from './components/StatisticsTable.vue';
import ProgressTable from './components/ProgressTable.vue';
import Footer from './components/Footer.vue';
import { fetchData } from './client.js';
export default {
components: {
Header,
StatisticsTable,
ProgressTable,
Footer,
},
data() {
return {
statistics: [],
progress: [],
};
},
async mounted() {
try {
this.statistics = await fetchData('/statistics');
this.progress = await fetchData('/progress');
} catch (error) {
console.error(error.message);
}
},
};
</script>

<style>
.fill-height {
height: 100%;
}
.d-flex {
display: flex;
}
.flex-column {
flex-direction: column;
}
.flex-grow-1 {
flex-grow: 1;
}
.align-center {
align-items: center;
}
.justify-center {
justify-content: center;
}
.mx-auto {
margin-left: auto;
margin-right: auto;
}
.content {
padding-bottom: 80px;
}
</style>

13 changes: 13 additions & 0 deletions archive_query_log/dashboard/src/client.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import axios from 'axios';

const BASE_URL = 'http://localhost:8000';

export async function fetchData(endpoint) {
try {
const response = await axios.get(`${BASE_URL}${endpoint}`);
return response.data;
} catch (error) {
console.error(`Error fetching data from ${endpoint}:`, error.message);
throw error;
}
}
18 changes: 18 additions & 0 deletions archive_query_log/dashboard/src/components/Footer.vue
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
<template>
<div class="footer">
&copy; 2024 Webis Group |
<a href="https://webis.de/people.html" class="white--text">Contact</a> |
<a href="https://webis.de/legal.html" class="white--text">Imprint</a> |
<a href="https://webis.de/legal.html" class="white--text">Privacy</a>
</div>
</template>

<style>
.footer {
text-align: center;
padding: 16px;
width: 100%;
background: #fff;
}
</style>

32 changes: 32 additions & 0 deletions archive_query_log/dashboard/src/components/Header.vue
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
<template>
<div class="text-center">
<div class="text-body-2 font-weight mb-n1">Welcome to</div>
<h1 class="text-h2 font-weight-bold">Archive Query Log</h1>
<div class="text-body-2 font-weight-light">
The Archive Query Log monitoring interface.
</div>

<div class="py-4"></div>

<v-btn
text
href="https://github.com/webis-de/archive-query-log"
class="button-margin"
>GitHub</v-btn
>
<v-btn
text
href="https://webis.de/publications.html?q=archive#reimer_2023"
class="button-margin"
>Paper</v-btn
>
<v-btn text href="https://webis.de/">Webis</v-btn>
</div>
</template>

<style>
.button-margin {
margin-right: 100px;
}
</style>

27 changes: 27 additions & 0 deletions archive_query_log/dashboard/src/components/ProgressTable.vue
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
<template>
<div>
<h2>Progress</h2>
<div class="py-4"></div>
<v-data-table :items="progress" item-key="input_name" items-per-page="10" class="table-shadow mx-auto">
</v-data-table>
</div>
</template>

<script>
export default {
props: {
progress: {
type: Array,
required: true,
},
},
};
</script>

<style>
.table-shadow {
box-shadow: 0 4px 8px rgba(0, 0, 0, 0.5);
margin: 0 16px;
}
</style>

26 changes: 26 additions & 0 deletions archive_query_log/dashboard/src/components/StatisticsTable.vue
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
<template>
<div>
<h2>Statistics</h2>
<div class="py-4"></div>
<v-data-table :items="statistics" item-key="name" items-per-page="10" class="table-shadow mx-auto">
</v-data-table>
</div>
</template>

<script>
export default {
props: {
statistics: {
type: Array,
required: true,
},
},
};
</script>

<style>
.table-shadow {
box-shadow: 0 4px 8px rgba(0, 0, 0, 0.5);
margin: 0 16px;
}
</style>
20 changes: 20 additions & 0 deletions archive_query_log/dashboard/src/main.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
/**
* main.ts
*
* Bootstraps Vuetify and other plugins then mounts the App`
*/

// Plugins
import { registerPlugins } from '@/plugins'

// Components
import App from './App.vue'

// Composables
import { createApp } from 'vue'

const app = createApp(App)

registerPlugins(app)

app.mount('#app')
3 changes: 3 additions & 0 deletions archive_query_log/dashboard/src/plugins/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Plugins

Plugins are a way to extend the functionality of your Vue application. Use this folder for registering plugins that you want to use globally.
15 changes: 15 additions & 0 deletions archive_query_log/dashboard/src/plugins/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
/**
* plugins/index.ts
*
* Automatically included in `./src/main.ts`
*/

// Plugins
import vuetify from './vuetify'

// Types
import type { App } from 'vue'

export function registerPlugins (app: App) {
app.use(vuetify)
}
19 changes: 19 additions & 0 deletions archive_query_log/dashboard/src/plugins/vuetify.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
/**
* plugins/vuetify.ts
*
* Framework documentation: https://vuetifyjs.com`
*/

// Styles
import '@mdi/font/css/materialdesignicons.css'
import 'vuetify/styles'

// Composables
import { createVuetify } from 'vuetify'

// https://vuetifyjs.com/en/introduction/why-vuetify/#feature-guides
export default createVuetify({
theme: {
defaultTheme: 'light',
},
})
7 changes: 7 additions & 0 deletions archive_query_log/dashboard/src/vite-env.d.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
/// <reference types="vite/client" />

declare module '*.vue' {
import type { DefineComponent } from 'vue'
const component: DefineComponent<{}, {}, any>
export default component
}
32 changes: 32 additions & 0 deletions archive_query_log/dashboard/tsconfig.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"compilerOptions": {
"target": "ESNext",
"jsx": "preserve",
"lib": ["DOM", "ESNext"],
"baseUrl": ".",
"module": "ESNext",
"moduleResolution": "bundler",
"paths": {
"@/*": ["src/*"]
},
"resolveJsonModule": true,
"types": [
"vite/client",
"vite-plugin-vue-layouts/client",
"unplugin-vue-router/client"
],
"allowJs": true,
"strict": true,
"strictNullChecks": true,
"noUnusedLocals": true,
"esModuleInterop": true,
"forceConsistentCasingInFileNames": true,
"isolatedModules": true,
"skipLibCheck": true
},
"include": [
"./src/typed-router.d.ts"
],
"exclude": ["dist", "node_modules", "cypress"],
"references": [{ "path": "./tsconfig.node.json" }],
}
9 changes: 9 additions & 0 deletions archive_query_log/dashboard/tsconfig.node.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"compilerOptions": {
"composite": true,
"module": "ESNext",
"moduleResolution": "bundler",
"allowSyntheticDefaultImports": true
},
"include": ["vite.config.mts"]
}
42 changes: 42 additions & 0 deletions archive_query_log/dashboard/vite.config.mts
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
// Plugins
import Components from "unplugin-vue-components/vite";
import Vue from "@vitejs/plugin-vue";
import Vuetify, { transformAssetUrls } from "vite-plugin-vuetify";
import ViteFonts from "unplugin-fonts/vite";
import { nodePolyfills } from "vite-plugin-node-polyfills";
// Utilities
import { defineConfig } from "vite";
import { fileURLToPath, URL } from "node:url";

// https://vitejs.dev/config/
export default defineConfig({
plugins: [
Vue({
template: { transformAssetUrls },
}),
// https://github.com/vuetifyjs/vuetify-loader/tree/master/packages/vite-plugin#readme
Vuetify(),
Components(),
ViteFonts({
google: {
families: [
{
name: "Roboto",
styles: "wght@100;300;400;500;700;900",
},
],
},
}),
nodePolyfills(),
],
define: { "process.env": {} },
resolve: {
alias: {
"@": fileURLToPath(new URL("./src", import.meta.url)),
},
extensions: [".js", ".json", ".jsx", ".mjs", ".ts", ".tsx", ".vue"],
},
server: {
port: 3000,
},
});

Unchanged files with check annotations Beta

LOGGER: Logger = getLogger(__name__)
if __name__ == "__main__":
from archive_query_log.legacy.cli import main

Check warning on line 12 in archive_query_log/legacy/__init__.py

Codecov / codecov/patch

archive_query_log/legacy/__init__.py#L12

Added line #L12 was not covered by tests
main()

Check warning on line 14 in archive_query_log/legacy/__init__.py

Codecov / codecov/patch

archive_query_log/legacy/__init__.py#L14

Added line #L14 was not covered by tests
record_url_header
)
except JSONDecodeError:
LOGGER.warning(

Check warning on line 65 in archive_query_log/legacy/download/iterable.py

Codecov / codecov/patch

archive_query_log/legacy/download/iterable.py#L65

Added line #L65 was not covered by tests
f"Could not index {record_url_header} from record "
f"{record.rec_headers.get_header('WARC-Record-ID')}."
)
return None
encoding = record.http_headers.get_header("Content-Type")
if encoding is None:
encoding = ""

Check warning on line 72 in archive_query_log/legacy/download/iterable.py

Codecov / codecov/patch

archive_query_log/legacy/download/iterable.py#L72

Added line #L72 was not covered by tests
encoding = encoding.split(";")[-1].split("=")[-1].strip().lower()
if encoding == "" or "/" in encoding:
encoding = "utf8"
"""
MD5 hash of the original URL.
"""
return md5(self.url.encode(), usedforsecurity=False).hexdigest()

Check warning on line 64 in archive_query_log/legacy/model/__init__.py

Codecov / codecov/patch

archive_query_log/legacy/model/__init__.py#L64

Added line #L64 was not covered by tests
@cached_property
def datetime(self) -> datetime:
),
)
elif parser_type == "fragment_segment":
from archive_query_log.legacy.queries.parse import \

Check warning on line 90 in archive_query_log/legacy/model/parse.py

Codecov / codecov/patch

archive_query_log/legacy/model/parse.py#L90

Added line #L90 was not covered by tests
FragmentSegmentQueryParser
return FragmentSegmentQueryParser(
url_pattern=pattern(value["url_pattern"], IGNORECASE),
),
)
elif parser_type == "fragment_segment":
from archive_query_log.legacy.queries.parse import \

Check warning on line 156 in archive_query_log/legacy/model/parse.py

Codecov / codecov/patch

archive_query_log/legacy/model/parse.py#L156

Added line #L156 was not covered by tests
FragmentSegmentPageOffsetParser
return FragmentSegmentPageOffsetParser(
url_pattern=pattern(value["url_pattern"], IGNORECASE),
),
)
elif parser_type == "chatnoir":
from archive_query_log.legacy.results.chatnoir import \

Check warning on line 234 in archive_query_log/legacy/model/parse.py

Codecov / codecov/patch

archive_query_log/legacy/model/parse.py#L234

Added line #L234 was not covered by tests
ChatNoirResultsParser
return ChatNoirResultsParser(
url_pattern=pattern(value["url_pattern"], IGNORECASE),
if output_path.exists() and not self.overwrite:
return
output_path.parent.mkdir(parents=True, exist_ok=True)
archived_urls: Iterable[ArchivedUrl] = ArchivedUrls(input_path)

Check warning on line 183 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L183

Added line #L183 was not covered by tests
if self.verbose:
# noinspection PyTypeChecker
archived_urls = tqdm(
desc="Parse SERP URLs",
unit="URL",
)
archived_serp_urls_nullable = (

Check warning on line 191 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L191

Added line #L191 was not covered by tests
self._parse_single(archived_url, focused)
for archived_url in archived_urls
)
if archived_serp_url is not None
)
output_schema = ArchivedQueryUrl.schema()
with (output_path.open("wb") as file,

Check warning on line 201 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L201

Added line #L201 was not covered by tests
GzipFile(fileobj=file, mode="wb") as gzip_file,
text_io_wrapper(gzip_file) as text_file):
for archived_serp_url in archived_serp_urls:
focused: bool,
) -> ArchivedQueryUrl | None:
query: str | None = None
for query_parser in self.query_parsers:
query = query_parser.parse(archived_url)

Check warning on line 215 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L214-L215

Added lines #L214 - L215 were not covered by tests
if query is not None:
break
return None
page: int | None = None
for page_parser in self.page_parsers:
page = page_parser.parse(archived_url)

Check warning on line 224 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L223-L224

Added lines #L223 - L224 were not covered by tests
if page is not None:
break
return None
offset: int | None = None
for offset_parser in self.offset_parsers:
offset = offset_parser.parse(archived_url)

Check warning on line 233 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L232-L233

Added lines #L232 - L233 were not covered by tests
if offset is not None:
break
]
if cdx_page is not None:
if domain is None:
raise RuntimeError(

Check warning on line 284 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L283-L284

Added lines #L283 - L284 were not covered by tests
"Domain must be specified when page is specified.")
if len(domain_paths) < 1:
raise RuntimeError(

Check warning on line 287 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L286-L287

Added lines #L286 - L287 were not covered by tests
"There must be exactly one domain path.")
cdx_page_paths = [domain_paths[0] / f"{cdx_page:010}.jsonl.gz"]
else:
domain: str | None = None,
cdx_page: int | None = None,
):
pages_list: Sequence[_CdxPage] = self._service_pages(

Check warning on line 320 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L320

Added line #L320 was not covered by tests
data_directory=data_directory,
focused=focused,
service=service,
cdx_page=cdx_page,
)
if len(pages_list) == 0:

Check warning on line 328 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L328

Added line #L328 was not covered by tests
return
pages: Iterable[_CdxPage] = pages_list
if len(pages_list) > 1:

Check warning on line 332 in archive_query_log/legacy/queries/parse.py

Codecov / codecov/patch

archive_query_log/legacy/queries/parse.py#L331-L332

Added lines #L331 - L332 were not covered by tests
# noinspection PyTypeChecker
pages = tqdm(
pages,
if output_path.exists() and not self.overwrite:
return
output_path.parent.mkdir(parents=True, exist_ok=True)
archived_serp_contents: Iterable[ArchivedRawSerp] = (

Check warning on line 177 in archive_query_log/legacy/results/parse.py

Codecov / codecov/patch

archive_query_log/legacy/results/parse.py#L177

Added line #L177 was not covered by tests
ArchivedRawSerps(input_path))
if self.verbose:
# noinspection PyTypeChecker
desc="Parse SERP WARC records",
unit="record",
)
archived_parsed_serps_nullable = (

Check warning on line 186 in archive_query_log/legacy/results/parse.py

Codecov / codecov/patch

archive_query_log/legacy/results/parse.py#L186

Added line #L186 was not covered by tests
self.parse_single(archived_serp_content)
for archived_serp_content in archived_serp_contents
)
]
if cdx_page is not None:
if domain is None:
raise RuntimeError(

Check warning on line 271 in archive_query_log/legacy/results/parse.py

Codecov / codecov/patch

archive_query_log/legacy/results/parse.py#L270-L271

Added lines #L270 - L271 were not covered by tests
"Domain must be specified when page is specified.")
if len(domain_paths) < 1:
raise RuntimeError(

Check warning on line 274 in archive_query_log/legacy/results/parse.py

Codecov / codecov/patch

archive_query_log/legacy/results/parse.py#L273-L274

Added lines #L273 - L274 were not covered by tests
"There must be exactly one domain path.")
cdx_page_paths = [domain_paths[0] / f"{cdx_page:010}"]
else:
domain: str | None = None,
cdx_page: int | None = None,
):
pages_list: Sequence[_CdxPage] = self._service_pages(

Check warning on line 307 in archive_query_log/legacy/results/parse.py

Codecov / codecov/patch

archive_query_log/legacy/results/parse.py#L307

Added line #L307 was not covered by tests
data_directory=data_directory,
focused=focused,
service=service,
cdx_page=cdx_page,
)
if len(pages_list) == 0:

Check warning on line 315 in archive_query_log/legacy/results/parse.py

Codecov / codecov/patch

archive_query_log/legacy/results/parse.py#L315

Added line #L315 was not covered by tests
return
pages: Iterable[_CdxPage] = pages_list
if len(pages_list) > 1:

Check warning on line 319 in archive_query_log/legacy/results/parse.py

Codecov / codecov/patch

archive_query_log/legacy/results/parse.py#L318-L319

Added lines #L318 - L319 were not covered by tests
# noinspection PyTypeChecker
pages = tqdm(
pages,
try:
service = Service.schema(unknown="exclude").load(service_dict)
services += [(service.name, service)]
except ValidationError as e:
if not ignore_parsing_errors:

Check warning on line 22 in archive_query_log/legacy/services/__init__.py

Codecov / codecov/patch

archive_query_log/legacy/services/__init__.py#L21-L22

Added lines #L21 - L22 were not covered by tests
raise ValueError(
f"Could not parse service {service_dict}") from e
if not ignore_parsing_errors:
self._check_urls_path()
def _check_urls_path(self):
if not self.path.exists() or not self.path.is_file():
raise ValueError(

Check warning on line 26 in archive_query_log/legacy/urls/iterable.py

Codecov / codecov/patch

archive_query_log/legacy/urls/iterable.py#L25-L26

Added lines #L25 - L26 were not covered by tests
f"URLs path must be a file: {self.path}"
)
def __len__(self) -> int:
with (self.path.open("rb") as file,

Check warning on line 31 in archive_query_log/legacy/urls/iterable.py

Codecov / codecov/patch

archive_query_log/legacy/urls/iterable.py#L31

Added line #L31 was not covered by tests
GzipFile(fileobj=file, mode="rb") as gzip_file):
return count_lines(gzip_file)

Check warning on line 33 in archive_query_log/legacy/urls/iterable.py

Codecov / codecov/patch

archive_query_log/legacy/urls/iterable.py#L33

Added line #L33 was not covered by tests
def __iter__(self) -> Iterator[ArchivedUrl]:
schema = ArchivedUrl.schema()
with (self.path.open("rb") as file,

Check warning on line 37 in archive_query_log/legacy/urls/iterable.py

Codecov / codecov/patch

archive_query_log/legacy/urls/iterable.py#L36-L37

Added lines #L36 - L37 were not covered by tests
GzipFile(fileobj=file, mode="rb") as gzip_file,
text_io_wrapper(gzip_file) as text_file):
for line in text_file:
url = schema.loads(line)
if isinstance(url, list):
raise ValueError(f"Expected one URL per line: {line}")
yield url

Check warning on line 44 in archive_query_log/legacy/urls/iterable.py

Codecov / codecov/patch

archive_query_log/legacy/urls/iterable.py#L40-L44

Added lines #L40 - L44 were not covered by tests
def text_io_wrapper(file: IO[bytes] | IOBase) -> IO[str]:
return TextIOWrapper(file) # type: ignore

Check warning on line 19 in archive_query_log/legacy/util/text.py

Codecov / codecov/patch

archive_query_log/legacy/util/text.py#L19

Added line #L19 was not covered by tests