Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Team/hypothesis tests #474

Merged
merged 197 commits into from
May 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
197 commits
Select commit Hold shift + click to select a range
0f80148
initial cut of hypothesis-based property tests
levand Apr 6, 2023
d3e17ea
Allow capital letters in collection names
levand Apr 7, 2023
b45e2cf
WIP on collection state machine test
levand Apr 10, 2023
217502b
add clean failing minimal examples
levand Apr 10, 2023
86888f9
fix incorrect test logic
levand Apr 11, 2023
785d3c1
Fix collection name validation
levand Apr 10, 2023
f48a07a
only construct default embedding function once
levand Apr 11, 2023
69c6822
update metadata when doing 'upsert' on collection
levand Apr 11, 2023
9e9e97c
re-enable all test api fixtures
levand Apr 11, 2023
470c0cb
Merge pull request #302 from chroma-core/lukev/fix-collection-name-va…
levand Apr 11, 2023
02bb481
Update docstrings to reflect metadata upsert behavior
levand Apr 11, 2023
f23ba4b
Revert "only construct default embedding function once".
levand Apr 12, 2023
34006bc
Use class var to store SentenceTransformer instances
levand Apr 12, 2023
9b4f003
Minimal ANN Accuracy Invariant and Collection.add() (#329)
atroyn Apr 12, 2023
2558481
Merge pull request #328 from chroma-core/lukev/fix-collections
levand Apr 13, 2023
d6a308a
Merge branch 'team/hypothesis-tests' into lukev/collection-state-mach…
levand Apr 13, 2023
214c06c
Merge pull request #324 from chroma-core/lukev/collection-state-machi…
levand Apr 13, 2023
6e20759
state machine tests for embeddings
levand Apr 15, 2023
7fd4233
remember to reset before each unit test
levand Apr 15, 2023
3707a35
if creation fails, finish step
levand Apr 15, 2023
7b213ac
temporarily generate IDs that we know won't cause SQL issues
levand Apr 15, 2023
682445d
Merge branch 'lukev/hypothesis-test-fixes' into lukev/embeddings-stat…
levand Apr 15, 2023
5e7940d
add failing tests for duplicate embeddings
levand Apr 16, 2023
f7e3874
add update to embedding stateful tests
levand Apr 17, 2023
d036595
valiation to prevent dup ID inserts
levand Apr 16, 2023
1a46040
add JS validation & tests
levand Apr 16, 2023
8671504
use unique IDs in unit tests
levand Apr 17, 2023
c5b096e
fix js test to handle local validation
levand Apr 17, 2023
25d451f
ensure that documents are populated for updates
levand Apr 17, 2023
87802f0
clean unused code
levand Apr 17, 2023
e31d240
Revert "fix js test to handle local validation"
levand Apr 17, 2023
b136933
Revert "add JS validation & tests"
levand Apr 17, 2023
c43051a
don't convert existing IDs to set
levand Apr 17, 2023
d6fd0c6
use and check for specific error types
levand Apr 17, 2023
693a53a
avoid operator overloading
levand Apr 17, 2023
252e56d
Merge pull request #363 from chroma-core/lukev/validate_add
levand Apr 17, 2023
e02deb5
avoid shadowing name from strategies module
levand Apr 17, 2023
ff66ce3
remove extra type annotations to avoid confusion
levand Apr 17, 2023
6b54ca9
simplify code as discussed in review
levand Apr 17, 2023
5139c39
remove precondition; start updating right away
levand Apr 17, 2023
4b338fd
all invariants in their own module
levand Apr 17, 2023
554c003
Added xfail overrides on tests expected to be failing.
atroyn Apr 18, 2023
eb57b56
Merge pull request #355 from chroma-core/lukev/embeddings-stateful-tests
levand Apr 18, 2023
a2e219f
fix updates by ensuring correct ID ordering
levand Apr 18, 2023
b377d7f
Revert "Revert "add JS validation & tests""
levand Apr 18, 2023
a5cef8a
Revert "Revert "fix js test to handle local validation""
levand Apr 18, 2023
aa305c9
remove stub test we don't plan on implementing
levand Apr 18, 2023
0733afd
enable CI for team/hypothesis-tests branch
levand Apr 18, 2023
12f83da
Whitespace change to test CI.
levand Apr 18, 2023
bd1a871
Whitespace change to provoke CI
levand Apr 18, 2023
e5d6090
Upsert test
atroyn Mar 29, 2023
498b5bb
More info about the test
atroyn Mar 29, 2023
ddc4aef
Cleaned up what I actually meant in the test
atroyn Mar 29, 2023
d2d454c
API and tests
atroyn Apr 1, 2023
b4c74b7
Collection and APIs
atroyn Apr 1, 2023
cc91113
Pytest on by default in vscode
atroyn Apr 1, 2023
93d9dde
Updated tests
atroyn Apr 5, 2023
615a5b0
Removed prints
atroyn Apr 5, 2023
1bbc2b2
factor out dup code in add/update/upsert
levand Apr 17, 2023
e7e4fab
clean up docstrings
levand Apr 17, 2023
a0d017a
fix invalid regex
levand Apr 17, 2023
f75abf1
fix argument order
levand Apr 17, 2023
f7a6636
updates do not require embeddings
levand Apr 17, 2023
5109895
add upsert to js client
levand Apr 17, 2023
391c4b2
fix occasional test failure
levand Apr 18, 2023
088fa88
remove stub test we don't plan on implementing
levand Apr 18, 2023
2acf3cd
enable CI for team/hypothesis-tests branch
levand Apr 18, 2023
4ffeee2
Upsert test
atroyn Mar 29, 2023
8f456a3
More info about the test
atroyn Mar 29, 2023
667c595
Cleaned up what I actually meant in the test
atroyn Mar 29, 2023
9394cab
API and tests
atroyn Apr 1, 2023
eaf8de2
Collection and APIs
atroyn Apr 1, 2023
2c6878f
Pytest on by default in vscode
atroyn Apr 1, 2023
1af761b
Updated tests
atroyn Apr 5, 2023
12d666a
Removed prints
atroyn Apr 5, 2023
6f6f697
factor out dup code in add/update/upsert
levand Apr 17, 2023
be7ed6d
clean up docstrings
levand Apr 17, 2023
a236b73
fix invalid regex
levand Apr 17, 2023
da8d6b4
fix argument order
levand Apr 17, 2023
bbbe737
updates do not require embeddings
levand Apr 17, 2023
25870d7
Basic Persistence Tests (#372)
HammadB Apr 19, 2023
58e88fd
state machine tests for upsert
levand Apr 19, 2023
dbd7358
Merge branch 'team/hypothesis-tests' into lukev/anton-upsert
levand Apr 19, 2023
1af0f09
Add explanatory comment.
levand Apr 19, 2023
0ad2b40
Merge pull request #375 from chroma-core/lukev/fix-update-ids
levand Apr 19, 2023
be1ee89
add unit test to cover case removed from state machine
levand Apr 19, 2023
653184f
unit test duplicate checks in upsert
levand Apr 19, 2023
21a8722
fix: call right method
levand Apr 19, 2023
867ca06
rename argument for clarity
levand Apr 19, 2023
29dff1a
Merge pull request #379 from chroma-core/lukev/fix-coll-upsert-failure
levand Apr 19, 2023
869996b
use generators to allow cleanup for fixtures
levand Apr 19, 2023
d1d7c4c
fixtures for local fastapi server
levand Apr 19, 2023
48d7296
propagate HTTP errors as correct type
levand Apr 19, 2023
b8f4db4
Merge branch 'team/hypothesis-tests' into lukev/anton-upsert
levand Apr 20, 2023
d461c9b
update .dockerignore to improve build times
levand Apr 20, 2023
aed8fde
run integration tests only from bin/integration-tests
levand Apr 20, 2023
de31abe
parameterize SQL
levand Apr 20, 2023
e0c1fd6
add integration tests to hypothesis fixtures
levand Apr 20, 2023
d205f87
Merge pull request #385 from chroma-core/lukev/anton-upsert
levand Apr 20, 2023
196959d
Merge branch 'team/hypothesis-tests' into lukev/upsert-js
levand Apr 20, 2023
81cb8a3
Delete clickhouse-run
levand Apr 20, 2023
4301535
remove unused function
levand Apr 21, 2023
32b3957
clean up and simplify strategies
levand Apr 22, 2023
dd3beb6
update tests to use new strategies
levand Apr 22, 2023
a42fefd
Cross-version persistence tests (#386)
HammadB Apr 24, 2023
abff57f
remove not doing TODOs
HammadB Apr 24, 2023
887d466
Generalized ANN Tests (#414)
atroyn Apr 25, 2023
2468acc
WIP unsatisfiable errors
levand Apr 25, 2023
c088086
cleanup & tweaks to avoid UnsatisfiableErrors
levand Apr 25, 2023
1265a6d
Merge branch 'team/hypothesis-tests' into lukev/collection-contents-s…
levand Apr 25, 2023
c9faacb
update persist tests to use new strategies
levand Apr 25, 2023
0fc6775
WIP filtering
levand Apr 21, 2023
d6d4a35
where-clause filtering working
levand Apr 25, 2023
4317c85
combo with id-based filter
levand Apr 25, 2023
86c4680
add doc generation and keyword filtering
levand Apr 25, 2023
4336f3d
add flag to omit filter data unless needed
levand Apr 25, 2023
2f2f579
move module fixtures to conftest level
levand Apr 26, 2023
d183872
Persist state machine (#401)
HammadB Apr 26, 2023
512d3f8
use common fixtures for all tests
levand Apr 26, 2023
84cf64c
type hints on fixture generators
levand Apr 26, 2023
b2fc5a6
Merge branch 'team/hypothesis-tests' into lukev/hypothesis-integratio…
levand Apr 26, 2023
683b601
cleanup whitespace
levand Apr 26, 2023
0f062e7
restrict tests & enable full logging
levand Apr 26, 2023
c866e55
Seperate integration tests into their own github actions (#427)
HammadB Apr 26, 2023
4f6b82b
split out test matrix
levand Apr 26, 2023
65e8cf7
Merge branch 'team/hypothesis-tests' into lukev/hypothesis-integratio…
levand Apr 26, 2023
309edf4
add explicit timeout to avoid timeout cache bug
levand Apr 27, 2023
820bb5d
cleanup assertion messages
levand Apr 27, 2023
096f273
updates in response to PR feedback
levand Apr 27, 2023
57c8695
only query for a fraction of results
levand Apr 27, 2023
bf83b07
Incorporate tweaks from PR feedback
levand Apr 27, 2023
84802d7
Merge pull request #425 from chroma-core/lukev/hypothesis-filtering
levand Apr 27, 2023
0e1cbf6
cleanup based on PR feadback
levand Apr 27, 2023
2cf7363
Merge branch 'team/hypothesis-tests' into lukev/collection-contents-s…
levand Apr 27, 2023
b3acc46
small change to test CI
levand Apr 27, 2023
9459baf
demonstrate a bug with ANN accuracy
levand Apr 27, 2023
816a3ad
Combine fixtures (#431)
HammadB Apr 27, 2023
5d229bd
Merge branch 'team/hypothesis-tests' into lukev/hypothesis-integratio…
levand Apr 27, 2023
a429db3
Merge branch 'team/hypothesis-tests' into lukev/collection-contents-s…
levand Apr 27, 2023
3bb54aa
ensure ann results are sorted (#434)
HammadB Apr 28, 2023
3fec500
Increase teset hnsw settings
HammadB Apr 28, 2023
db36086
merge
HammadB Apr 28, 2023
2c6cee5
Merge branch 'lukev/ann_invariant_bug_demo' into lukev/collection-con…
levand Apr 28, 2023
25096ed
Merge branch 'lukev/ann_invariant_bug_demo' into lukev/collection-con…
levand Apr 28, 2023
8b5f62a
Merge branch 'lukev/ann_invariant_bug_demo' into lukev/hypothesis-int…
levand Apr 28, 2023
551f69a
Ann invariant increase hnsw params (#446)
HammadB Apr 28, 2023
e05e7b5
clean up configurations
levand Apr 28, 2023
2d23fa7
remove verbose logging
levand Apr 28, 2023
89aeab7
more type hints
levand Apr 28, 2023
f07abc0
Merge branch 'team/hypothesis-tests' into lukev/collection-contents-s…
levand Apr 28, 2023
b042279
Merge pull request #423 from chroma-core/lukev/collection-contents-st…
levand Apr 28, 2023
48b6143
Update test_persist.py
levand Apr 28, 2023
df1cc76
Merge branch 'team/hypothesis-tests' into lukev/hypothesis-integratio…
levand Apr 28, 2023
932567c
add filtering tests to CI
levand Apr 28, 2023
feda987
fix merge errors
levand Apr 28, 2023
748c9cc
add workaround for FastAPI quirk
levand Apr 30, 2023
7ed736e
fix filter semantics in ClickHouse
levand May 1, 2023
f28fcc2
update strategies and invariants to handle unwrapped values
levand May 1, 2023
dd350b8
don't convert empty dicts to None
levand May 1, 2023
ced7357
expose invariant wrapping for consistency
levand May 1, 2023
1161a7e
add standardized mechanism for cross-version changes
levand May 1, 2023
4f4e600
normalize embeddings before adding IDs to bundle
levand May 1, 2023
a8afde6
require key to be present for all operators
levand May 1, 2023
20583c0
require key to be present for implicit ops too
levand May 1, 2023
d7cab42
make guard logic compose within OR clauses
levand May 1, 2023
5906935
cleanup simple duplicative merge
HammadB May 1, 2023
e3ba284
constrain JSONd values to float32 for Clickhouse compatibility
levand May 2, 2023
0a0f66e
Merge branch 'team/hypothesis-tests' into lukev/hypothesis-integratio…
levand May 2, 2023
53e93d2
Revert "constrain JSONd values to float32 for Clickhouse compatibility"
levand May 2, 2023
2e43c5f
prevent generation of subnormal flaots
levand May 2, 2023
f3e818b
parallelize integration tests using same approach as unit tests
levand May 2, 2023
5cd22fb
Split apart tests to match what's currently in main
levand May 2, 2023
016315c
factor out upsert tests to their own file
levand May 2, 2023
f732ede
Merge pull request #398 from chroma-core/lukev/hypothesis-integration…
levand May 2, 2023
746ce9e
Merge branch 'team/hypothesis-tests' into lukev/test-unwrapped-values
levand May 2, 2023
7ca1e7b
Merge pull request #451 from chroma-core/lukev/test-unwrapped-values
levand May 2, 2023
a1b9347
Merge branch 'team/hypothesis-tests' into lukev/validate-add-js
levand May 2, 2023
f39f049
Merge branch 'team/hypothesis-tests' into lukev/upsert-js
levand May 2, 2023
caa03d0
fix bug with intended test partition; actually exclude prop tests
levand May 2, 2023
824a406
poke CI
levand May 2, 2023
5c929c9
Merge pull request #454 from chroma-core/lukev/fix-glob-in-ci-config
levand May 3, 2023
e1de81f
Merge branch 'team/hypothesis-tests' into lukev/upsert-js
levand May 3, 2023
5904549
Merge branch 'team/hypothesis-tests' into lukev/validate-add-js
levand May 3, 2023
487a48e
python version matrix (#448)
HammadB May 3, 2023
9500e2a
Query filtering (#453)
HammadB May 3, 2023
3fdd908
Merge pull request #377 from chroma-core/lukev/validate-add-js
levand May 3, 2023
dec24e7
Merge branch 'team/hypothesis-tests' into lukev/upsert-js
levand May 3, 2023
be04dca
Merge pull request #399 from chroma-core/lukev/upsert-js
levand May 3, 2023
000a9d3
Add support for multiple spaces (#457)
HammadB May 4, 2023
833b89a
PR checklist (#459)
HammadB May 5, 2023
cfdf89c
Fix PR review checklist
HammadB May 5, 2023
891f637
Test embedding functions (#466)
atroyn May 5, 2023
8dfb223
merge main into team hypothesis test
HammadB May 5, 2023
2f52675
Merge branch 'team/merge_team_hypothesis' into team/hypothesis-tests
HammadB May 5, 2023
a19af5c
merge main
HammadB May 6, 2023
fefab56
inaccurate log
HammadB May 6, 2023
08d4fc0
Add epsilon for norms in cosine per hnswli
HammadB May 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
venv
.git
examples
examples
clients
.hypothesis
__pycache__
.vscode
*.egg-info
.pytest_cache
37 changes: 37 additions & 0 deletions .github/workflows/chroma-integration-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: Chroma Integration Tests

on:
push:
branches:
- main
- team/hypothesis-tests
pull_request:
branches:
- main
- team/hypothesis-tests

jobs:
test:
strategy:
matrix:
python: ['3.7']
platform: [ubuntu-latest]
testfile: ["--ignore-glob 'chromadb/test/property/*'",
"chromadb/test/property/test_add.py",
"chromadb/test/property/test_collections.py",
"chromadb/test/property/test_cross_version_persist.py",
"chromadb/test/property/test_embeddings.py",
"chromadb/test/property/test_filtering.py",
"chromadb/test/property/test_persist.py"]
runs-on: ${{ matrix.platform }}
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python }}
- name: Install test dependencies
run: python -m pip install -r requirements.txt && python -m pip install -r requirements_dev.txt
- name: Integration Test
run: bin/integration-test ${{ matrix.testfile }}
16 changes: 12 additions & 4 deletions .github/workflows/chroma-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,26 @@ on:
push:
branches:
- main
- team/hypothesis-tests
pull_request:
branches:
- main
- team/hypothesis-tests

jobs:
test:
timeout-minutes: 90
strategy:
matrix:
python: ['3.10']
python: ['3.7', '3.8', '3.9', '3.10']
platform: [ubuntu-latest]
testfile: ["--ignore-glob 'chromadb/test/property/*'",
"chromadb/test/property/test_add.py",
"chromadb/test/property/test_collections.py",
"chromadb/test/property/test_cross_version_persist.py",
"chromadb/test/property/test_embeddings.py",
"chromadb/test/property/test_filtering.py",
"chromadb/test/property/test_persist.py"]
runs-on: ${{ matrix.platform }}
steps:
- name: Checkout
Expand All @@ -25,6 +35,4 @@ jobs:
- name: Install test dependencies
run: python -m pip install -r requirements.txt && python -m pip install -r requirements_dev.txt
- name: Test
run: python -m pytest
- name: Integration Test
run: bin/integration-test
run: python -m pytest ${{ matrix.testfile }}
37 changes: 37 additions & 0 deletions .github/workflows/pr-review-checklist.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: PR Review Checklist

on:
pull_request_target:
types:
- opened

jobs:
PR-Comment:
runs-on: ubuntu-latest
steps:
- name: PR Comment
uses: actions/github-script@v2
with:
github-token: ${{secrets.GITHUB_TOKEN}}
script: |
github.issues.createComment({
issue_number: ${{ github.event.number }},
owner: context.repo.owner,
repo: context.repo.repo,
body: `# Reviewer Checklist
Please leverage this checklist to ensure your code review is thorough before approving
## Testing, Bugs, Errors, Logs, Documentation
- [ ] Can you think of any use case in which the code does not behave as intended? Have they been tested?
- [ ] Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
- [ ] If appropriate, are there adequate property based tests?
- [ ] If appropriate, are there adequate unit tests?
- [ ] Should any logging, debugging, tracing information be added or removed?
- [ ] Are error messages user-friendly?
- [ ] Have all documentation changes needed been made?
- [ ] Have all non-obvious changes been commented?
## System Compatibility
- [ ] Are there any potential impacts on other parts of the system or backward compatibility?
- [ ] Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?
## Quality
- [ ] Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)`
})
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ dist
.terraform.lock.hcl
terraform.tfstate
.hypothesis/
.idea
.idea
49 changes: 27 additions & 22 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,23 +1,28 @@
{
"git.ignoreLimitWarning": true,
"editor.rulers": [
120
],
"editor.formatOnSave": true,
"python.formatting.provider": "black",
"files.exclude": {
"**/__pycache__": true,
"**/.ipynb_checkpoints": true,
"**/.pytest_cache": true,
"**/chroma.egg-info": true
},
"python.analysis.typeCheckingMode": "basic",
"python.linting.flake8Enabled": true,
"python.linting.enabled": true,
"python.linting.flake8Args": [
"--extend-ignore=E203",
"--extend-ignore=E501",
"--extend-ignore=E503",
"--max-line-length=88",
],
}
"git.ignoreLimitWarning": true,
"editor.rulers": [
120
],
"editor.formatOnSave": true,
"python.formatting.provider": "black",
"files.exclude": {
"**/__pycache__": true,
"**/.ipynb_checkpoints": true,
"**/.pytest_cache": true,
"**/chroma.egg-info": true
},
"python.analysis.typeCheckingMode": "basic",
"python.linting.flake8Enabled": true,
"python.linting.enabled": true,
"python.linting.flake8Args": [
"--extend-ignore=E203",
"--extend-ignore=E501",
"--extend-ignore=E503",
"--max-line-length=88"
],
"python.testing.pytestArgs": [
"."
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true
}
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@
</a> |
<a href="https://github.com/chroma-core/chroma/blob/master/LICENSE" target="_blank">
<img src="https://img.shields.io/static/v1?label=license&message=Apache 2.0&color=white" alt="License">
</a> |
</a> |
<a href="https://docs.trychroma.com/" target="_blank">
Docs
</a> |
</a> |
<a href="https://www.trychroma.com/" target="_blank">
Homepage
</a>
Expand All @@ -30,19 +30,19 @@ pip install chromadb # python client

The core API is only 4 functions (run our [💡 Google Colab](https://colab.research.google.com/drive/1QEzFyqnoFxq7LUGyP1vzR4iLt9PpCDXv?usp=sharing) or [Replit template](https://replit.com/@swyx/BasicChromaStarter?v=1)):

```python
```python
import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()

# Create collection. get_collection, get_or_create_collection, delete_collection also available!
collection = client.create_collection("all-my-documents")
collection = client.create_collection("all-my-documents")

# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on these!
ids=["doc1", "doc2"], # unique for each doc
ids=["doc1", "doc2"], # unique for each doc
)

# Query/search 2 most similar results. You can also .get by id
Expand All @@ -66,23 +66,23 @@ results = collection.query(
For example, the `"Chat your data"` use case:
1. Add documents to your database. You can pass in your own embeddings, embedding function, or let Chroma embed them for you.
2. Query relevant documents with natural language.
3. Compose documents into the context window of an LLM like `GPT3` for additional summarization or analysis.
3. Compose documents into the context window of an LLM like `GPT3` for additional summarization or analysis.

## Embeddings?

What are embeddings?

- [Read the guide from OpenAI](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)
- __Literal__: Embedding something turns it from image/text/audio into a list of numbers. 🖼️ or 📄 => `[1.2, 2.1, ....]`. This process makes documents "understandable" to a machine learning model.
- __By analogy__: An embedding represents the essence of a document. This enables documents and queries with the same essence to be "near" each other and therefore easy to find.
- __Literal__: Embedding something turns it from image/text/audio into a list of numbers. 🖼️ or 📄 => `[1.2, 2.1, ....]`. This process makes documents "understandable" to a machine learning model.
- __By analogy__: An embedding represents the essence of a document. This enables documents and queries with the same essence to be "near" each other and therefore easy to find.
- __Technical__: An embedding is the latent-space position of a document at a layer of a deep neural network. For models trained specifically to embed data, this is the last layer.
- __A small example__: If you search your photos for "famous bridge in San Francisco". By embedding this query and comparing it to the embeddings of your photos and their metadata - it should return photos of the Golden Gate Bridge.

Embeddings databases (also known as **vector databases**) store embeddings and allow you to search by nearest neighbors rather than by substrings like a traditional database. By default, Chroma uses [Sentence Transformers](https://docs.trychroma.com/embeddings#default-sentence-transformers) to embed for you but you can also use OpenAI embeddings, Cohere (multilingual) embeddings, or your own.

## Get involved

Chroma is a rapidly developing project. We welcome PR contributors and ideas for how to improve the project.
Chroma is a rapidly developing project. We welcome PR contributors and ideas for how to improve the project.
- [Join the conversation on Discord](https://discord.gg/MMeYNTmh3x)
- [Review the roadmap and contribute your ideas](https://docs.trychroma.com/roadmap)
- [Grab an issue and open a PR](https://github.com/chroma-core/chroma/issues)
Expand Down
8 changes: 4 additions & 4 deletions bin/integration-test
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@ trap cleanup EXIT

docker compose -f docker-compose.test.yml up --build -d

export CHROMA_INTEGRATION_TEST=1
export CHROMA_INTEGRATION_TEST_ONLY=1
export CHROMA_API_IMPL=rest
export CHROMA_SERVER_HOST=localhost
export CHROMA_SERVER_HTTP_PORT=8000

python -m pytest
echo testing: python -m pytest "$@"
python -m pytest "$@"

cd clients/js
yarn
yarn test:run
cd ../..

cd ../..
48 changes: 40 additions & 8 deletions chromadb/api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,8 @@ def create_collection(
Args:
name (str): The name of the collection to create. The name must be unique.
metadata (Optional[Dict], optional): A dictionary of metadata to associate with the collection. Defaults to None.
get_or_create (bool, optional): If True, will return the collection if it already exists. Defaults to False.
get_or_create (bool, optional): If True, will return the collection if it already exists,
and update the metadata (if applicable). Defaults to False.
embedding_function (Optional[Callable], optional): A function that takes documents and returns an embedding. Defaults to None.

Returns:
Expand All @@ -82,8 +83,11 @@ def delete_collection(
"""

@abstractmethod
def get_or_create_collection(self, name: str, metadata: Optional[Dict] = None) -> Collection:
"""Calls create_collection with get_or_create=True
def get_or_create_collection(
self, name: str, metadata: Optional[Dict] = None
) -> Collection:
"""Calls create_collection with get_or_create=True.
If the collection exists, but with different metadata, the metadata will be replaced.

Args:
name (str): The name of the collection to create. The name must be unique.
Expand Down Expand Up @@ -141,7 +145,7 @@ def _add(
⚠️ It is recommended to use the more specific methods below when possible.

Args:
collection_name (Union[str, Sequence[str]]): The model space(s) to add the embeddings to
collection_name (Union[str, Sequence[str]]): The collection(s) to add the embeddings to
embedding (Sequence[Sequence[float]]): The sequence of embeddings to add
metadata (Optional[Union[Dict, Sequence[Dict]]], optional): The metadata to associate with the embeddings. Defaults to None.
documents (Optional[Union[str, Sequence[str]]], optional): The documents to associate with the embeddings. Defaults to None.
Expand All @@ -162,17 +166,40 @@ def _update(
⚠️ It is recommended to use the more specific methods below when possible.

Args:
collection_name (Union[str, Sequence[str]]): The model space(s) to add the embeddings to
collection_name (Union[str, Sequence[str]]): The collection(s) to add the embeddings to
embedding (Sequence[Sequence[float]]): The sequence of embeddings to add
"""
pass

@abstractmethod
def _upsert(
self,
collection_name: str,
ids: IDs,
embeddings: Optional[Embeddings] = None,
metadatas: Optional[Metadatas] = None,
documents: Optional[Documents] = None,
increment_index: bool = True,
):
"""Add or update entries in the embedding store.
If an entry with the same id already exists, it will be updated, otherwise it will be added.

Args:
collection_name (str): The collection to add the embeddings to
ids (Optional[Union[str, Sequence[str]]], optional): The ids to associate with the embeddings. Defaults to None.
embeddings (Sequence[Sequence[float]]): The sequence of embeddings to add
metadatas (Optional[Union[Dict, Sequence[Dict]]], optional): The metadata to associate with the embeddings. Defaults to None.
documents (Optional[Union[str, Sequence[str]]], optional): The documents to associate with the embeddings. Defaults to None.
increment_index (bool, optional): If True, will incrementally add to the ANN index of the collection. Defaults to True.
"""
pass

@abstractmethod
def _count(self, collection_name: str) -> int:
"""Returns the number of embeddings in the database

Args:
collection_name (str): The model space to count the embeddings in.
collection_name (str): The collection to count the embeddings in.

Returns:
int: The number of embeddings in the collection
Expand Down Expand Up @@ -278,14 +305,19 @@ def raw_sql(self, sql: str) -> pd.DataFrame:

@abstractmethod
def create_index(self, collection_name: Optional[str] = None) -> bool:
"""Creates an index for the given model space
"""Creates an index for the given collection
⚠️ This method should not be used directly.

Args:
collection_name (Optional[str], optional): The model space to create the index for. Uses the client's model space if None. Defaults to None.
collection_name (Optional[str], optional): The collection to create the index for. Uses the client's collection if None. Defaults to None.

Returns:
bool: True if the index was created successfully

"""
pass

@abstractmethod
def persist(self):
"""Persist the database to disk"""
pass
Loading