Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add architecture for record linkage #160

Closed

Conversation

alejandrosame
Copy link
Contributor

No description provided.

RaitoBezarius and others added 4 commits December 19, 2023 03:08
This enables to search through available CVE containers in the database
using a very simple PostgreSQL TS vector search.
According to the specs of the VM.
With the fields and indices added, searching for `python` in /triage
takes ~1s (previously, it would take ~7.5s).

The triggers added will make sure that the vector searches get updated
as rows of data get added, deleted or updated.
@alejandrosame
Copy link
Contributor Author

I took #84 as starting point. The rough roadmap is:

  • provide a functional manual triage view
  • show candidate linked records in the triage view
  • feed candidates from recordlinkage heuristics.

@alejandrosame
Copy link
Contributor Author

I just noticed I mistakenly took away the query filter in commit 7d58699, so it indeed was not filtering and the query is not hitting the indices.

I'll move on with UI and architectural tasks and I'll come back to performance later.

The previous query failed to hit the GIN indices created for the
dedicated SearchVector fields introduced. Now the query makes use of
them and returns results for a `python` search in ~1.5s.
The general workflow gets introduced by generated random
matches using record linkage toolkit.
This increases its utility for inserting test data during development.
@fricklerhandwerk
Copy link
Collaborator

@alejandrosame can we salvage anything from here except for more self-descriptive names? Seems like the business logic was ported a while ago.

@alejandrosame
Copy link
Contributor Author

V1 already implemented in #254

@alejandrosame
Copy link
Contributor Author

alejandrosame commented Dec 8, 2024

@alejandrosame can we salvage anything from here except for more self-descriptive names? Seems like the business logic was ported a while ago.

I didn't noticed this question before. I don't think so. The original discussions were nudging the implementation to use or at least consider the recordlinkage python library, so if anything #254 should have taken into account these changes.

Since clearly code wasn't reused there, anything going forward should just go from there. The vision for the record linkage wasn't completely clear before attempting this anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants