[BUG] Training step crashes with out any error for FAISS IVF with quantisation step #465

layavadi · 2024-10-24T05:29:14Z

What is the bug?

While performing training on with Cohere 100k corpus and 30k training vectors and encorder pq ( m=8 , codesize=8), training step fails silently with out any error

Running train-knn-model [ 0% done]
[ERROR] Cannot execute-test. Error in load generator [0]
Cannot run task [train-knn-model]: Failed to create model: test-model within 100 retries

How can one reproduce the bug?

PARAM file used in the benchmark
cat train-faiss-cohere-100k-768-ip.json
{
"target_index_name": "target_index",
"target_field_name": "target_field",
"target_index_body": "indices/faiss-index.json",
"target_index_primary_shards": 1,
"target_index_replica_shards": 0,
"target_index_dimension": 768,
"target_index_space_type": "innerproduct",
"target_index_bulk_size": 100,
"target_index_bulk_index_data_set_format": "hdf5",
"target_index_bulk_index_data_set_corpus": "cohere-100k",
"target_index_bulk_indexing_clients": 10,

"train_index_name": "train_index",
"train_field_name": "train_field",
"train_method_engine": "faiss",
"train_index_body": "indices/train-index.json",
"train_index_primary_shards": 1,
"train_index_replica_shards": 0,

"train_index_bulk_size": 100,
"train_index_bulk_index_data_set_format": "hdf5",
"train_index_bulk_index_data_set_corpus": "cohere-100k",
"train_index_bulk_indexing_clients": 5,
"train_index_num_vectors": 30000,
"nlist": 128,
"nprobes": 128,
"pq_encoder_code_size": 8,
"pq_encoder_m": 8,
"encoder": "pq",

"train_model_id": "test-model",
"train_operation_retries": 100,
"train_operation_poll_period": 0.5,
"train_search_size": 10000,

"target_index_max_num_segments": 1,
"target_index_force_merge_timeout": 300,
"hnsw_ef_search": 100,
"hnsw_ef_construction": 100,
"query_k": 100,
"query_body": {
     "docvalue_fields" : ["_id"],
     "stored_fields" : "_none_"
},

"query_data_set_format": "hdf5",
"query_data_set_corpus": "cohere-100k",
"query_count": 10000

}
Running

opensearch-benchmark execute-test --target-hosts ${ENDPOINT} --workload vectorsearch --workload-params ${PARAMS_FILE} --pipeline benchmark-only --test-procedure train-test --kill-running-processes

Also set the
"knn.model.cache.size.limit" : "25%",

What is the expected behavior?

Training should have gone through with out any failure

What is your host/environment?

Running on r6i.4xlarge node ( SIngle data node)
with a pod
opensearchJavaOpts: "-Xmx12G -Xms12G"
resources:
requests:
cpu: "2000m"
memory: "8Gi"
limits:
memory: "32Gi"
cpu: "4"
OS version 15.0

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

The text was updated successfully, but these errors were encountered:

layavadi · 2024-10-24T10:27:34Z

It was failing due to time out. Default polling period was not sufficient. After increasing the polling period the training went fine.

layavadi added bug Something isn't working untriaged labels Oct 24, 2024

layavadi changed the title ~~[BUG] Training step crashes with out any error for FAISS IVF with quantisation strep~~ [BUG] Training step crashes with out any error for FAISS IVF with quantisation step Oct 24, 2024

layavadi closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Training step crashes with out any error for FAISS IVF with quantisation step #465

[BUG] Training step crashes with out any error for FAISS IVF with quantisation step #465

layavadi commented Oct 24, 2024

layavadi commented Oct 24, 2024

[BUG] Training step crashes with out any error for FAISS IVF with quantisation step #465

[BUG] Training step crashes with out any error for FAISS IVF with quantisation step #465

Comments

layavadi commented Oct 24, 2024

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

layavadi commented Oct 24, 2024