Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add contraints and embedding directives #3405

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

fredcarle
Copy link
Collaborator

Relevant issue(s)

Resolves #3350
Resolves #3351

Description

Sorry for having 3 technically separate features as part of 1 PR. The reason for this is that I started working on embeddings and the size contraint was initially part of the embedding. Since we discussed that it could be applied to all array fields (and later even to String and Blob), I extracted it into a contraints directive that has a size parameter (more contraints can be added in the future). Furthermore, embeddings returned from ML models are arrays of float32. This caused some precision issues because we only supported float64. When saving the float32 array, querying it would return an float64 array with slight precision issues. I decided to add the float32 type.

You can review the first commit for contraint and embedding related code and the second commit for the float related changes. Some float stuff might have leaked in the first commit. Sorry for this. I tried hard to separate the float32 related changes.

Note that the gql.Float type is now Float64 internally.

type User {
  points: Float
}

is the same as

type User {
  points: Float64
}

The embedding generation relies on a 3rd party package called chromem-go to call the model provider API. As long as one of the supported provider API is configured and accessible, the embeddings will be generated when adding new documents. I've added a step in the test workflow that will run the embedding specific tests on linux only (this is because installation on windows and mac is less straight forward) using Ollama (because it runs locally).

If you're interested in running it locally, install Ollama and define a schema like so

type User {
    name: String
    about: String
    name_v: [Float32!] @constraints(size: 768) @embedding(fields: ["name", "about"], provider: "ollama", model: "nomic-embed-text",  url: "http://localhost:11434/api") // contraint is optional and localhost:11434 is the default port for Ollama
}

Next steps:

  • Support templates for the content sent to the model.
  • Add the _similarity operation to calculate the cosine similarity between two arrays.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

make test and manual testing

Specify the platform(s) on which this was tested:

  • MacOS

@fredcarle fredcarle added feature New feature or request area/query Related to the query component area/schema Related to the schema system labels Jan 25, 2025
@fredcarle fredcarle added this to the DefraDB v0.16 milestone Jan 25, 2025
@fredcarle fredcarle requested a review from a team January 25, 2025 00:35
@fredcarle fredcarle self-assigned this Jan 25, 2025
@fredcarle fredcarle changed the title feat: Add contraints and embedding directives and Float32 type support feat: Add contraints and embedding directives Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/query Related to the query component area/schema Related to the schema system feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add size constraint to array types Create embeddings based on document fields
1 participant