-
-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment/Draft: Search engine #1986
Draft
SmallCoccinelle
wants to merge
48
commits into
stashapp:develop
Choose a base branch
from
SmallCoccinelle:search-engine
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduce package event with a dispatcher subsystem. Add this to the manager and to the plugin subsystem. Whenever the plugin subsystem execute a PostHook, we dispatch an Change event on the event dispatcher bus. This currently has no effect, but allows us to register subsystems on the event bus for further processing. In particular, search. By design, we opt to hook the plugin system and pass to the event bus for now. One, it makes it easier to remove again, and two, the context handling inside the plugin subsystem doesn't want to live on the other side of an event bus. While here, write a test for the dispatcher code.
The rollup service turns events into batches for processing. The search engine wraps the rollup engine, making it private to the search subsystem.
Add a search experiment to the code: Schema in GraphQL is extended with an early search system. Engine is extended with search, and gets passed through the resolver. Some conversion is currently done to glue things together, but that structure is ugly and needs some improvementificationism.
In go 1.18 strings.Cut becomes a reality. However, since it is such a useful tool, add it to the utils package for now. Once we are on go 1.18, we can replace utils.Cut with strings.Cut
This is almost not needed, but to be safe, add the ability to protect changes to the engine, and lock most usage via an RLock().
Search results are Connection objects. Wrap each result in a contextual object. This can be used for scoring/highligting/facets later. Introduce interface SearchResultItem. Implement the interface for models.Scene. Add hydration code for scenes.
Add scores into search results. Move Search-internal NodeIDs into the search system. Introduce search.Item which protects the rest of the system against search-specific structures. Simplify hydration since it can now use search.Item.
This experiment tells us facets want to be an input type rather than the current enum of predefined facets.
Reindexing of scenes at the moment, because that's what we have. The core idea is fairly simple: batch-process a table, a 1000 entries at a time, index them. Replace the data loader every 10 rounds (10k entries) so it doesn't grow too big. While reindexing is ongoing, the online changemap is still being built in the background. If reindexing takes more than the timer ticker, it will fire immediately after. If reindexing takes more than twice the timer ticker, the ticker protects against this and only fires once.
It is really a set of changes. The map used to implement the set is an implementation detail that shouldn't be part of the name.
Pull stat tracking outward. Set up a reporting ticker and use it for reporting progress. This rolls up the log lines into something a bit more comprehensible.
Change the schema to support performer searches. Performers are SearchResultItems. Make the search type optional, default to searching everything. Enable hydration of performers. Add performers to the data loader code. Introduce a performer document for the search index. Load performers before loading scenes, to utilize the dataloader cache maximally. When considering scenes, find the needed performers, and prime the cache with them. When processing scenes, denormalize the performer into the scene.
If we update performers, all scenes those performers are in should also change. Push this in. Currently, we over-apply on a full reindex, which can be fixed later, perhaps by moving preprocessing upward, or by having a flag on the batch processing layer. It's plenty fast right now though.
Plug a hole with scenes that can be nil.
This change anticipates far better batch processing in the future. By explicitly preprocessing, we can do this in the online processing loop, but avoid it in the offline processing loop. This will avoid processing elements twice.
Things which are easily salvaged out of this:
interface Node {
id: ID!
}
type Scene implements Node {
...
}
type Query {
nodes(ids: [ID]): [Node]
...
} isn't really possible. |
Early tag support setup.
People will expect a tag to be fairly easy to grab. So prefer a direct encoding over a nested subdocument. This allows a search for `tag:woodworking` rather than `tag.name:woodworking`. While here, add the tag ids into the scene document as well. This will help with deletion.
Changesets will keep growing.
Implement Stringer formatting for event.Change. Introduce engine_preprocess.go. Move preprocessing code into the engine itself. Use the engine to pull data which needs a change on a performer deletion. Rework changeset into changeset code only.
|
Code is a bit spammy at the moment with logging, but that will be fixed at some point.
|
Introduce studios * In data loading * In search documents * In changesets * In the search path * In the GraphQL schema No functional indexing yet.
The strategy is to fold reindexing into a worklist which we process through systematically. This reduces the full reindexer into a single loop, which then collapses the code to a far simpler code path, where the only variance is a switch on the document type. Use this new strategy to handle studios as well for full reindexing.
|
Rather than having a single large function, split the work into smaller functions and let the function names describe what is being done. This should make the code more local and easier to read.
Introduce indexing of studios in scenes. Introduce documents.DocType to properly type the documents as an enum.
Facets are going to be a thing we add later on. An MVP doesn't need facets, and we can remove lots of complexity if we don't have to worry about them right now.
If a merge is called, we should process all sources and the destination. Create an event ofr each of these.
@SmallCoccinelle Is there any interest in finishing this or would it need started from scratch? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This branch implements a search engine in the stash backend. It is currently very much a draft, as there are numerous unclosed cases which needs tracking. But In the spirit of making the experiment a bit more visible, I'll make a draft PR for it.
Status
Currently we implement:
Query.search(..)
.We can query for scenes and performers. Scoring is TF-IDF.
Query string support is what Bleve supports. I.e., a query
kitty redhead
searches for documents withkitty
ORredhead
, but scores objects where both match higher than documents with a partial match. You can do either"kitty redhead"
for an exact phrase match,+kitty +redhead
for an conjunctive match (i.e., AND), or+kitty /red(head)?/
for the corresponding regex match and so on.We also support dates. I.e.,
+date:>=2013 +date:<2020
searches for objects in that date range.We support a simple date range facet tracking recent stuff (less than 30 days old) from older stuff. But clearly, this needs to be a GraphQL input type so the front-end can manipulate the facets.
Performance
Currently, indexing 20k scenes and 300 performers takes about 5-6 seconds. This is fast, but as we flesh out the documents, it's going to take a lot longer because we need to analyze more and more data. I have a ~4TB stash and the current index size is 7 megabyte. Again, this is going to climb. My Sqlite db is 148 megabyte.
Things missing
Rough code overview
The code is mostly orthogonal to the rest of stash. Our interfacing points are:
pkg/models
.The go files are as follows:
documents/documents.go
- Implements the documents which get stored in the search index, together with the index mappings of those documents.loaders.go
a collection of data loaders used by the search engine. Can be pushed into the resolver code as well.changeset.go
- implements changesets. Changesets can be turned into index batches and applied to the index.rollup.go
- A rollup goroutine tracks event changes into a changeset and hands them off to the search engine upon request.search.go
- The main API. Implementssearch.Search(..)
the main API entry.engine.go
- Implements the meat of the search engine backend. An engine is governed by a managing goroutine, communicating with a rollup goroutine to maintain the index. The engine handles reindexing.engine_indexing.go
- Indexing code for the engineengine_preprocess.go
- Preprocessing code for the index. This analyzes a changeset to figure out collateral updates.Playground
Add
Into your stash config. Start stash. It should generate an index and start a full reindex in 5s. Every 15s it should write out stats for the reindexing. If you trash the
index.bleve
folder, it will reindex. Then go to your GraphQL playground. A typical query would look something like