-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PHPORM-273 Add schema helpers to create search and vector indexes #3230
Conversation
tests/SchemaTest.php
Outdated
@@ -514,4 +558,16 @@ protected function getIndex(string $collection, string $name) | |||
|
|||
return false; | |||
} | |||
|
|||
protected function getSearchIndex(string $collection, string $name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense for this to return a nullable BSONDocument instance? IIRC, Collection::listSearchIndexes()
inherits the default type map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that I simply copied the getIndex
method. Should be named hasIndex
and hasSearchIndex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is supposed to be hasSearchIndex()
, shouldn't it then just return a bool
?
If there is actually a need to return the index definition, I'd suggest always returning a nullable BSONDocument and working with the return value accordingly in the calling context. But it looks like something should still be changed with this method.
src/Schema/Blueprint.php
Outdated
* @phpstan-type SearchIndexAnalyser array{name: string, charFilters?: list<array<string, mixed>>, tokenizer: array{type: string}, tokenFilters?: list<array<string, mixed>>} | ||
* @phpstan-type SearchIndexStoredSource bool | array{includes: array<string>} | array{excludes: array<string>} | ||
* @phpstan-type SearchIndexDefinition array{analyser?: string, analyzers?: SearchIndexAnalyser[], searchAnalyzer?: string, mappings: array{dynamic: true} | array{dynamic?: bool, fields: array<string, SearchIndexField>}, storedSource?: SearchIndexStoredSource} | ||
* @phpstan-type VectorSearchIndexField array{type: 'vector', path: string, numDimensions: int, similarity: 'euclidean'|'cosine'|'dotProduct', quantization?: 'none'|'scalar'|'binary'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/ suggests that type
may be either "vector" or "filter".
src/Schema/Blueprint.php
Outdated
@@ -16,6 +16,14 @@ | |||
use function is_string; | |||
use function key; | |||
|
|||
/** | |||
* @phpstan-type SearchIndexField array{type: 'boolean'|'date'|'dateFacet'|'objectId'|'stringFacet'|'uuid'} | array{type: 'autocomplete', analyzer?: string, maxGrams?: int, minGrams?: int, tokenization?: 'edgeGram'|'rightEdgeGram'|'nGram', foldDiacritics?: bool} | array{type: 'document'|'embeddedDocuments', dynamic?:bool, fields: array<string, array>} | array{type: 'geo', indexShapes?: bool} | array{type: 'number'|'numberFacet', representation?: 'int64'|'double', indexIntegers?: bool, indexDoubles?: bool} | array{type: 'token', normalizer?: 'lowercase'|'none'} | array{type: 'string', analyzer?: string, searchAnalyzer?: string, indexOptions?: 'docs'|'freqs'|'positions'|'offsets', store?: bool, ignoreAbove?: int, multi?: array<string, array>, norms?: 'include'|'omit'} | |||
* @phpstan-type SearchIndexAnalyser array{name: string, charFilters?: list<array<string, mixed>>, tokenizer: array{type: string}, tokenFilters?: list<array<string, mixed>>} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using "SearchIndexAnalyzer". American English spelling should be preferred for consistency with the MongoDB documentation.
I'll note you had analyser?: string, analyzers?: SearchIndexAnalyser[]
below, which is even internally inconsistent.
charFilters
seems correct based on https://www.mongodb.com/docs/atlas/atlas-search/analyzers/character-filters/#std-label-char-filters-ref, although you could break that out into a separate definition to mandate type
if you wish. Doing so would also give you a space to directly link to that documentation page. But considering how varied the structures are, mixed
is also entirely reasonable.
Same feedback on tokenFilters
(https://www.mongodb.com/docs/atlas/atlas-search/analyzers/token-filters/#std-label-token-filters-ref). Actually, it looks like both of those could utilize the same common definition (e.g. "charOrTokenFilter"), although that might send the wrong signal to users. I'm OK to keep this as-is.
src/Schema/Blueprint.php
Outdated
* @phpstan-type SearchIndexField array{type: 'boolean'|'date'|'dateFacet'|'objectId'|'stringFacet'|'uuid'} | array{type: 'autocomplete', analyzer?: string, maxGrams?: int, minGrams?: int, tokenization?: 'edgeGram'|'rightEdgeGram'|'nGram', foldDiacritics?: bool} | array{type: 'document'|'embeddedDocuments', dynamic?:bool, fields: array<string, array>} | array{type: 'geo', indexShapes?: bool} | array{type: 'number'|'numberFacet', representation?: 'int64'|'double', indexIntegers?: bool, indexDoubles?: bool} | array{type: 'token', normalizer?: 'lowercase'|'none'} | array{type: 'string', analyzer?: string, searchAnalyzer?: string, indexOptions?: 'docs'|'freqs'|'positions'|'offsets', store?: bool, ignoreAbove?: int, multi?: array<string, array>, norms?: 'include'|'omit'} | ||
* @phpstan-type SearchIndexAnalyser array{name: string, charFilters?: list<array<string, mixed>>, tokenizer: array{type: string}, tokenFilters?: list<array<string, mixed>>} | ||
* @phpstan-type SearchIndexStoredSource bool | array{includes: array<string>} | array{excludes: array<string>} | ||
* @phpstan-type SearchIndexDefinition array{analyser?: string, analyzers?: SearchIndexAnalyser[], searchAnalyzer?: string, mappings: array{dynamic: true} | array{dynamic?: bool, fields: array<string, SearchIndexField>}, storedSource?: SearchIndexStoredSource} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting the docs, mappings.fields
is "required only if dynamic mapping is disabled." In that case, wouldn't it make sense to use dynamic?: false
for that branch instead of a generic bool
? I don't think there's any case where you'd specify dynamic: true
and define fields
.
It looks like synonyms
is missing here. See: https://www.mongodb.com/docs/atlas/atlas-search/synonyms/#std-label-synonyms-ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's allowed by the server, I think it's a bad idea to be more restrictive. In fact, dynamic: true
and a list of fields is exactly what I do for Laravel Scout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what file/line you were linking to above, as I have a lot of condensed files in that view. Was it the DEFAULT_DEFINITION
using dynamic: true
?
Even if you leave this as-is, I would inquire with the Atlas team about this as it's not clear from the documentation what dynamic: true
with field mappings would actually imply. That it's accepted by the server might have been an oversight, or something they neglected to address in the docs.
src/Schema/Blueprint.php
Outdated
@@ -16,6 +16,14 @@ | |||
use function is_string; | |||
use function key; | |||
|
|||
/** | |||
* @phpstan-type SearchIndexField array{type: 'boolean'|'date'|'dateFacet'|'objectId'|'stringFacet'|'uuid'} | array{type: 'autocomplete', analyzer?: string, maxGrams?: int, minGrams?: int, tokenization?: 'edgeGram'|'rightEdgeGram'|'nGram', foldDiacritics?: bool} | array{type: 'document'|'embeddedDocuments', dynamic?:bool, fields: array<string, array>} | array{type: 'geo', indexShapes?: bool} | array{type: 'number'|'numberFacet', representation?: 'int64'|'double', indexIntegers?: bool, indexDoubles?: bool} | array{type: 'token', normalizer?: 'lowercase'|'none'} | array{type: 'string', analyzer?: string, searchAnalyzer?: string, indexOptions?: 'docs'|'freqs'|'positions'|'offsets', store?: bool, ignoreAbove?: int, multi?: array<string, array>, norms?: 'include'|'omit'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corresponding docs for this type would be: https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings/#std-label-fts-field-mappings
I'm holding off reviewing this in case you're able to break it apart into multiple lines for readability.
src/Schema/Blueprint.php
Outdated
* indexShapes?: bool, | ||
* } | array{ | ||
* type: 'number'|'numberFacet', | ||
* representation?: 'int64'|'double', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there's a reason the docs used string
instead of enum
for this type: https://www.mongodb.com/docs/atlas/atlas-search/field-types/number-facet-type/#configure-fts-field-type-field-properties
Seems inconsistent with the autocomplete
tokenization
definition. Maybe something to alert the Atlas docs teams about. I'll defer to you if you'd rather relax this to string
or restrict the values here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this thread open for the time being but I'm going to close my other comments about string
and enum
consistency.
tests/SchemaTest.php
Outdated
@@ -514,4 +558,16 @@ protected function getIndex(string $collection, string $name) | |||
|
|||
return false; | |||
} | |||
|
|||
protected function getSearchIndex(string $collection, string $name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is supposed to be hasSearchIndex()
, shouldn't it then just return a bool
?
If there is actually a need to return the index definition, I'd suggest always returning a nullable BSONDocument and working with the return value accordingly in the calling context. But it looks like something should still be changed with this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all of my previous comments have been addressed.
Fix PHPORM-273
The new helpers have advanced array-shape types for the index definition. The support for autocompletion of the array-shaped arguments in PHPStorm is not good for use-case 😞 (even if it was improved in 2022)
Checklist