Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHPORM-273 Add schema helpers to create search and vector indexes #3230

Merged
merged 10 commits into from
Jan 2, 2025

Conversation

GromNaN
Copy link
Member

@GromNaN GromNaN commented Dec 18, 2024

Fix PHPORM-273

The new helpers have advanced array-shape types for the index definition. The support for autocompletion of the array-shaped arguments in PHPStorm is not good for use-case 😞 (even if it was improved in 2022)

Checklist

  • Add tests and ensure they pass

@GromNaN GromNaN requested a review from jmikola December 18, 2024 12:34
@GromNaN GromNaN requested a review from a team as a code owner December 18, 2024 12:34
@@ -514,4 +558,16 @@ protected function getIndex(string $collection, string $name)

return false;
}

protected function getSearchIndex(string $collection, string $name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense for this to return a nullable BSONDocument instance? IIRC, Collection::listSearchIndexes() inherits the default type map.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that I simply copied the getIndex method. Should be named hasIndex and hasSearchIndex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is supposed to be hasSearchIndex(), shouldn't it then just return a bool?

If there is actually a need to return the index definition, I'd suggest always returning a nullable BSONDocument and working with the return value accordingly in the calling context. But it looks like something should still be changed with this method.

tests/SchemaTest.php Outdated Show resolved Hide resolved
tests/SchemaTest.php Show resolved Hide resolved
* @phpstan-type SearchIndexAnalyser array{name: string, charFilters?: list<array<string, mixed>>, tokenizer: array{type: string}, tokenFilters?: list<array<string, mixed>>}
* @phpstan-type SearchIndexStoredSource bool | array{includes: array<string>} | array{excludes: array<string>}
* @phpstan-type SearchIndexDefinition array{analyser?: string, analyzers?: SearchIndexAnalyser[], searchAnalyzer?: string, mappings: array{dynamic: true} | array{dynamic?: bool, fields: array<string, SearchIndexField>}, storedSource?: SearchIndexStoredSource}
* @phpstan-type VectorSearchIndexField array{type: 'vector', path: string, numDimensions: int, similarity: 'euclidean'|'cosine'|'dotProduct', quantization?: 'none'|'scalar'|'binary'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/Schema/Blueprint.php Outdated Show resolved Hide resolved
@@ -16,6 +16,14 @@
use function is_string;
use function key;

/**
* @phpstan-type SearchIndexField array{type: 'boolean'|'date'|'dateFacet'|'objectId'|'stringFacet'|'uuid'} | array{type: 'autocomplete', analyzer?: string, maxGrams?: int, minGrams?: int, tokenization?: 'edgeGram'|'rightEdgeGram'|'nGram', foldDiacritics?: bool} | array{type: 'document'|'embeddedDocuments', dynamic?:bool, fields: array<string, array>} | array{type: 'geo', indexShapes?: bool} | array{type: 'number'|'numberFacet', representation?: 'int64'|'double', indexIntegers?: bool, indexDoubles?: bool} | array{type: 'token', normalizer?: 'lowercase'|'none'} | array{type: 'string', analyzer?: string, searchAnalyzer?: string, indexOptions?: 'docs'|'freqs'|'positions'|'offsets', store?: bool, ignoreAbove?: int, multi?: array<string, array>, norms?: 'include'|'omit'}
* @phpstan-type SearchIndexAnalyser array{name: string, charFilters?: list<array<string, mixed>>, tokenizer: array{type: string}, tokenFilters?: list<array<string, mixed>>}
Copy link
Member

@jmikola jmikola Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using "SearchIndexAnalyzer". American English spelling should be preferred for consistency with the MongoDB documentation.

I'll note you had analyser?: string, analyzers?: SearchIndexAnalyser[] below, which is even internally inconsistent.


charFilters seems correct based on https://www.mongodb.com/docs/atlas/atlas-search/analyzers/character-filters/#std-label-char-filters-ref, although you could break that out into a separate definition to mandate type if you wish. Doing so would also give you a space to directly link to that documentation page. But considering how varied the structures are, mixed is also entirely reasonable.

Same feedback on tokenFilters (https://www.mongodb.com/docs/atlas/atlas-search/analyzers/token-filters/#std-label-token-filters-ref). Actually, it looks like both of those could utilize the same common definition (e.g. "charOrTokenFilter"), although that might send the wrong signal to users. I'm OK to keep this as-is.

src/Schema/Blueprint.php Outdated Show resolved Hide resolved
* @phpstan-type SearchIndexField array{type: 'boolean'|'date'|'dateFacet'|'objectId'|'stringFacet'|'uuid'} | array{type: 'autocomplete', analyzer?: string, maxGrams?: int, minGrams?: int, tokenization?: 'edgeGram'|'rightEdgeGram'|'nGram', foldDiacritics?: bool} | array{type: 'document'|'embeddedDocuments', dynamic?:bool, fields: array<string, array>} | array{type: 'geo', indexShapes?: bool} | array{type: 'number'|'numberFacet', representation?: 'int64'|'double', indexIntegers?: bool, indexDoubles?: bool} | array{type: 'token', normalizer?: 'lowercase'|'none'} | array{type: 'string', analyzer?: string, searchAnalyzer?: string, indexOptions?: 'docs'|'freqs'|'positions'|'offsets', store?: bool, ignoreAbove?: int, multi?: array<string, array>, norms?: 'include'|'omit'}
* @phpstan-type SearchIndexAnalyser array{name: string, charFilters?: list<array<string, mixed>>, tokenizer: array{type: string}, tokenFilters?: list<array<string, mixed>>}
* @phpstan-type SearchIndexStoredSource bool | array{includes: array<string>} | array{excludes: array<string>}
* @phpstan-type SearchIndexDefinition array{analyser?: string, analyzers?: SearchIndexAnalyser[], searchAnalyzer?: string, mappings: array{dynamic: true} | array{dynamic?: bool, fields: array<string, SearchIndexField>}, storedSource?: SearchIndexStoredSource}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting the docs, mappings.fields is "required only if dynamic mapping is disabled." In that case, wouldn't it make sense to use dynamic?: false for that branch instead of a generic bool? I don't think there's any case where you'd specify dynamic: true and define fields.


It looks like synonyms is missing here. See: https://www.mongodb.com/docs/atlas/atlas-search/synonyms/#std-label-synonyms-ref

Copy link
Member Author

@GromNaN GromNaN Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's allowed by the server, I think it's a bad idea to be more restrictive. In fact, dynamic: true and a list of fields is exactly what I do for Laravel Scout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what file/line you were linking to above, as I have a lot of condensed files in that view. Was it the DEFAULT_DEFINITION using dynamic: true?

Even if you leave this as-is, I would inquire with the Atlas team about this as it's not clear from the documentation what dynamic: true with field mappings would actually imply. That it's accepted by the server might have been an oversight, or something they neglected to address in the docs.

src/Schema/Blueprint.php Outdated Show resolved Hide resolved
@@ -16,6 +16,14 @@
use function is_string;
use function key;

/**
* @phpstan-type SearchIndexField array{type: 'boolean'|'date'|'dateFacet'|'objectId'|'stringFacet'|'uuid'} | array{type: 'autocomplete', analyzer?: string, maxGrams?: int, minGrams?: int, tokenization?: 'edgeGram'|'rightEdgeGram'|'nGram', foldDiacritics?: bool} | array{type: 'document'|'embeddedDocuments', dynamic?:bool, fields: array<string, array>} | array{type: 'geo', indexShapes?: bool} | array{type: 'number'|'numberFacet', representation?: 'int64'|'double', indexIntegers?: bool, indexDoubles?: bool} | array{type: 'token', normalizer?: 'lowercase'|'none'} | array{type: 'string', analyzer?: string, searchAnalyzer?: string, indexOptions?: 'docs'|'freqs'|'positions'|'offsets', store?: bool, ignoreAbove?: int, multi?: array<string, array>, norms?: 'include'|'omit'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corresponding docs for this type would be: https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings/#std-label-fts-field-mappings

I'm holding off reviewing this in case you're able to break it apart into multiple lines for readability.

tests/SchemaTest.php Outdated Show resolved Hide resolved
src/Schema/Blueprint.php Outdated Show resolved Hide resolved
src/Schema/Blueprint.php Outdated Show resolved Hide resolved
src/Schema/Blueprint.php Outdated Show resolved Hide resolved
* indexShapes?: bool,
* } | array{
* type: 'number'|'numberFacet',
* representation?: 'int64'|'double',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there's a reason the docs used string instead of enum for this type: https://www.mongodb.com/docs/atlas/atlas-search/field-types/number-facet-type/#configure-fts-field-type-field-properties

Seems inconsistent with the autocomplete tokenization definition. Maybe something to alert the Atlas docs teams about. I'll defer to you if you'd rather relax this to string or restrict the values here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this thread open for the time being but I'm going to close my other comments about string and enum consistency.

src/Schema/Blueprint.php Outdated Show resolved Hide resolved
src/Schema/Blueprint.php Outdated Show resolved Hide resolved
src/Schema/Blueprint.php Outdated Show resolved Hide resolved
src/Schema/Blueprint.php Outdated Show resolved Hide resolved
src/Schema/Blueprint.php Outdated Show resolved Hide resolved
@@ -514,4 +558,16 @@ protected function getIndex(string $collection, string $name)

return false;
}

protected function getSearchIndex(string $collection, string $name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is supposed to be hasSearchIndex(), shouldn't it then just return a bool?

If there is actually a need to return the index definition, I'd suggest always returning a nullable BSONDocument and working with the return value accordingly in the calling context. But it looks like something should still be changed with this method.

tests/TestCase.php Outdated Show resolved Hide resolved
Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all of my previous comments have been addressed.

@GromNaN GromNaN merged commit 6cb3838 into mongodb:5.x Jan 2, 2025
27 checks passed
@GromNaN GromNaN deleted the PHPORM-273 branch January 2, 2025 17:13
@GromNaN GromNaN added the feature label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants