Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Training step crashes with out any error for FAISS IVF with quantisation step #465

Closed
layavadi opened this issue Oct 24, 2024 · 1 comment
Labels
bug Something isn't working untriaged

Comments

@layavadi
Copy link

What is the bug?

While performing training on with Cohere 100k corpus and 30k training vectors and encorder pq ( m=8 , codesize=8), training step fails silently with out any error

Running train-knn-model [ 0% done]
[ERROR] Cannot execute-test. Error in load generator [0]
Cannot run task [train-knn-model]: Failed to create model: test-model within 100 retries

How can one reproduce the bug?

PARAM file used in the benchmark
cat train-faiss-cohere-100k-768-ip.json
{
"target_index_name": "target_index",
"target_field_name": "target_field",
"target_index_body": "indices/faiss-index.json",
"target_index_primary_shards": 1,
"target_index_replica_shards": 0,
"target_index_dimension": 768,
"target_index_space_type": "innerproduct",
"target_index_bulk_size": 100,
"target_index_bulk_index_data_set_format": "hdf5",
"target_index_bulk_index_data_set_corpus": "cohere-100k",
"target_index_bulk_indexing_clients": 10,

"train_index_name": "train_index",
"train_field_name": "train_field",
"train_method_engine": "faiss",
"train_index_body": "indices/train-index.json",
"train_index_primary_shards": 1,
"train_index_replica_shards": 0,

"train_index_bulk_size": 100,
"train_index_bulk_index_data_set_format": "hdf5",
"train_index_bulk_index_data_set_corpus": "cohere-100k",
"train_index_bulk_indexing_clients": 5,
"train_index_num_vectors": 30000,
"nlist": 128,
"nprobes": 128,
"pq_encoder_code_size": 8,
"pq_encoder_m": 8,
"encoder": "pq",

"train_model_id": "test-model",
"train_operation_retries": 100,
"train_operation_poll_period": 0.5,
"train_search_size": 10000,

"target_index_max_num_segments": 1,
"target_index_force_merge_timeout": 300,
"hnsw_ef_search": 100,
"hnsw_ef_construction": 100,
"query_k": 100,
"query_body": {
     "docvalue_fields" : ["_id"],
     "stored_fields" : "_none_"
},

"query_data_set_format": "hdf5",
"query_data_set_corpus": "cohere-100k",
"query_count": 10000

}
Running

opensearch-benchmark execute-test --target-hosts ${ENDPOINT} --workload vectorsearch --workload-params ${PARAMS_FILE} --pipeline benchmark-only --test-procedure train-test --kill-running-processes

Also set the
"knn.model.cache.size.limit" : "25%",

What is the expected behavior?

Training should have gone through with out any failure

What is your host/environment?

Running on r6i.4xlarge node ( SIngle data node)
with a pod
opensearchJavaOpts: "-Xmx12G -Xms12G"
resources:
requests:
cpu: "2000m"
memory: "8Gi"
limits:
memory: "32Gi"
cpu: "4"
OS version 15.0

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

@layavadi layavadi added bug Something isn't working untriaged labels Oct 24, 2024
@layavadi layavadi changed the title [BUG] Training step crashes with out any error for FAISS IVF with quantisation strep [BUG] Training step crashes with out any error for FAISS IVF with quantisation step Oct 24, 2024
@layavadi
Copy link
Author

It was failing due to time out. Default polling period was not sufficient. After increasing the polling period the training went fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
None yet
Development

No branches or pull requests

1 participant