-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Add z-score for the normalization processor #376 #470
base: feature/z-score-normalization
Are you sure you want to change the base?
[FEATURE] Add z-score for the normalization processor #376 #470
Conversation
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
re opening the PR previously at #468, but this time against the feature branch instead of main. |
Hi @navneet1v @martin-gaievski @heemin32 this is the new PR that opened this time against the feature branch, feel free to continue providing your feedback here as I closed the original PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of generic things:
- did you test accuracy and performance of your solution? For our implementation we used beir challenge framework with some custom scripts, ideally results should look something like in a blog post where the feature has been announced.
- please fix all CI checks, you can simulate them by running
gradle check
locally.
new TopDocs(new TotalHits(0, TotalHits.Relation.EQUAL_TO), new ScoreDoc[0]), | ||
new TopDocs( | ||
new TotalHits(3, TotalHits.Relation.EQUAL_TO), | ||
new ScoreDoc[] { new ScoreDoc(3, 0.98058068f), new ScoreDoc(4, 0.39223227f), new ScoreDoc(2, -1.37281295f) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a simple formula or a method as part of the code comments, so we can understand how that score calculated out of provided individual scores. Having a reference to a method description is good, but not the same. Something like you added for integ test assertions will be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@martin-gaievski my results look as follows, overall z-score is showing best even among hybrid query results. For BM25 as baseline
For Neural search
For min-max hybrid (weights 0.4, 0.3, 0.3):
For Zscore Hybrid (weights 0.4, 0.3, 0.3):
Note: Edited to reformat the results into a table
|
@samuel-oci That looks reasonable. Can you please add more info:
|
Sure @martin-gaievski
Dataset: Scifact scripts: PORT=50365
HOST=localhost
URL="$HOST:$PORT"
curl -XPUT -H "Content-Type: application/json" $URL/_ingest/pipeline/nlp-pipeline -d '
{
"description": "An example neural search pipeline",
"processors" : [
{
"text_embedding": {
"model_id": "AXA30IsByAqY8FkWHdIF",
"field_map": {
"passage_text": "passage_embedding"
}
}
}
]
}'
curl -XDELETE $URL/scifact
curl -XPUT -H "Content-Type: application/json" $URL/scifact -d '
{
"settings": {
"index.knn": true,
"default_pipeline": "nlp-pipeline"
},
"mappings": {
"properties": {
"passage_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name":"hnsw",
"engine":"lucene",
"space_type": "l2",
"parameters":{
"m":16,
"ef_construction": 512
}
}
},
"passage_text": {
"type": "text"
},
"passage_key": {
"type": "text"
},
"passage_title": {
"type": "text"
}
}
}
}'
curl -XPUT -H "Content-Type: application/json" $URL/_search/pipeline/norm-minmax-pipeline-hybrid -d '
{
"description": "Post processor for hybrid search",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [
0.4,
0.3,
0.3
]
}
}
}
}
]
}'
curl -XPUT -H "Content-Type: application/json" $URL/_search/pipeline/norm-zscore-pipeline-hybrid -d '
{
"description": "Post processor for hybrid search",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "z_score"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [
0.4,
0.3,
0.3
]
}
}
}
}
]
}'
To use later with PORT=50365
MODEL_ID="AXA30IsByAqY8FkWHdIF"
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=ingest
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=bm25
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=neural --pipelines=norm-minmax-pipeline --os_model_id=$MODEL_ID
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=hybrid --pipelines=norm-minmax-pipeline-hybrid --os_model_id=$MODEL_ID
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=hybrid --pipelines=norm-zscore-pipeline-hybrid --os_model_id=$MODEL_ID |
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
c5ad3c8
to
8d7c3d9
Compare
Signed-off-by: samuel-oci <[email protected]>
FYI: I noticed a few of the IT tests (which were not changed in this PR) are broken after merge with the upstream feature branch. |
@samuel-oci thank you for sharing details of the benchmark. It's not exactly what we have used to run benchmarks from our side. Is it possible for you to adjust some things and run one more round? This way we can compare apples to apples your numbers with those we've got before. Here is the list of what needs to be adjusted:
|
Sure thing @martin-gaievski , I reproduced the results with the settings you suggested above and used trec-covid for dataset. Moreover, I included L2 normalization this time as well for additional reference with min-max and z-score. Got the following results:
|
Signed-off-by: Samuel Herman <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## feature/z-score-normalization #470 +/- ##
===================================================================
- Coverage 84.37% 84.34% -0.03%
- Complexity 498 523 +25
===================================================================
Files 40 41 +1
Lines 1491 1559 +68
Branches 228 247 +19
===================================================================
+ Hits 1258 1315 +57
- Misses 133 138 +5
- Partials 100 106 +6 ☔ View full report in Codecov by Sentry. |
@samuel-oci from the data you've provided for We need more information/datapoints to understand z-score performance better. Can you please run same test for other datasets mentioned in the blog https://opensearch.org/blog/hybrid-search/? Idea is to find if z-score performing better than This is the list of datasets we're used:
I think DBPedia can be a problem due to large size => longest time to ingest data, so you can skip it, the rest should be doable. Another point - I reviews the configuration I've shared with you previously, there is one adjustment you'll need to make. This is the mapping we used in our benchmarking:
there are 12 shards in the configuration, but we also used 3 data nodes, so I've gave number of shards to 4. If you want to recreate our setup exactly, then you need 12 shards on 3 nodes. But in our case we were doing that to measure latencies, seems that you have good numbers there so it's not a concern. We're working on adding all these details to a separate issue to formalize the intake process for new techniques, that's work in progress now: #444 |
@martin-gaievski , while the PR is mostly trivial I think the main issue that I see here is actually the time and effort it takes to setup and reproduce results for overall quite small datasets and workloads that shouldn't require an external environment. Regarding the benchmark itself, we would also require to change combiner logic to something more z-score friendly. current combination techniques have some limitations because they only support greater than 0 score. here is an example for the same benchmark on scifact dataset, this time I also added the combiner that can take into account negative values for z-score in arithmetic mean (proper z-score combiner should not be arithmetic mean whether with negatives or not but we can use it as an approximation for now).
|
@samuel-oci I agree that testing for such change is the main effort timewise, but that is absolute necessity. Main point for it is - we need to have a strong point of why we're adding it. My view is that what we do have now is a baseline, and anything we're adding after that should be compared and added if it works better for some/all cases. As soon as it became part of the codebase it can be used by any customer and it's a maintainer's repressibility to respond on requests such: "is/when this technique is better than technique X?" Main datapoint we're looking for looking now is: how z-score results are better/worse than min/max and l2 on different datasets. For scifact it was shown before that z-score gives better NDCG, there isn't a new thing. I would love to see results for Did you check how combiner for negative scores affect the score from min-max and l2? If those scores remain same we can make such combiner a default one, or you can make it flexible if z-score only shows good results with such combiner. |
Hi @martin-gaievski that makes sense to me, if I understand your point here there are two things you are looking for:
For how many data sets should we benchmark this? Or is it more fluid limit of as long as it takes to find the answer to the previous two questions?
I didn't check new combiner code yet on l2, min-max I was hoping to contribute in parts and leave combiner for later. |
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
@samuel-oci are you planning to continue work on this feature? Checking as there were no activity for last couple of weeks. |
@martin-gaievski yes, just added the commits that include the scripts used for testing as well, hopefully those will benefit others too when using BEIR for neural-search testing. It's been a bit of a challenge to get my hands on hardware that will allow me to run the tests on the larger datasets in BEIR, but I think I should be able to get those numbers soon. |
@samuel-oci is this still being worked on? |
Hi @samuel-oci is this still being worked on? |
No not working at this anymore, feel free to close this. |
Next action items: Deep dive needed on search relevance and perform benchmarking results on some more datasets. |
Description
This change implements #376
Issues Resolved
Resolving #376
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.