-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate id in index #8788
Comments
Hi @bobrik Is there any chance this index was written with Elasticsearch 1.2.0? Please could you provide the output of this request:
|
Routing is automatically inferred from {
"took" : 1744,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_shard" : 3,
"_node" : "YOK_20U7Qee-XSasg0J8VA",
"_index" : "statistics-20141110",
"_type" : "events",
"_id" : "1jC2LxTjTMS1KHCn0Prf1w",
"_score" : 1.0,
"_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
"fields" : {
"_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
},
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
"details" : [ {
"value" : 1.0,
"description" : "boost"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}
}, {
"_shard" : 3,
"_node" : "YOK_20U7Qee-XSasg0J8VA",
"_index" : "statistics-20141110",
"_type" : "events",
"_id" : "1jC2LxTjTMS1KHCn0Prf1w",
"_score" : 1.0,
"_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
"fields" : {
"_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
},
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
"details" : [ {
"value" : 1.0,
"description" : "boost"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}
} ]
}
} Index was created on 1.3.4, we upgraded from 1.0.1 to 1.3.2 on 2014-09-22 |
Hi @bobrik Hmm, these two docs are on the same shard! Do you ever run updates on these docs? Could you send the output of this command please?
|
Of course they are, that's how routing works :) I didn't run any updates, because my code only does indexing. It doesn't even know ids that are assigned by elasticsearch. {
"took" : 51,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_shard" : 3,
"_node" : "YOK_20U7Qee-XSasg0J8VA",
"_index" : "statistics-20141110",
"_type" : "events",
"_id" : "1jC2LxTjTMS1KHCn0Prf1w",
"_score" : 1.0,
"_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
"fields" : {
"_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
},
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
"details" : [ {
"value" : 1.0,
"description" : "boost"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}
}, {
"_shard" : 3,
"_node" : "YOK_20U7Qee-XSasg0J8VA",
"_index" : "statistics-20141110",
"_type" : "events",
"_id" : "1jC2LxTjTMS1KHCn0Prf1w",
"_score" : 1.0,
"_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
"fields" : {
"_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
},
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
"details" : [ {
"value" : 1.0,
"description" : "boost"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}
} ]
}
} |
Sorry @bobrik - I gave you the wrong request, it should be:
And so you're using auto-assigned IDs? Did any of your shards migrate to other nodes, or did a primary fail during optimization? |
{
"took" : 46,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_shard" : 3,
"_node" : "YOK_20U7Qee-XSasg0J8VA",
"_index" : "statistics-20141110",
"_type" : "events",
"_id" : "1jC2LxTjTMS1KHCn0Prf1w",
"_score" : 1.0,
"_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
"fields" : {
"_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
},
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
"details" : [ {
"value" : 1.0,
"description" : "boost"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}
}, {
"_shard" : 3,
"_node" : "YOK_20U7Qee-XSasg0J8VA",
"_index" : "statistics-20141110",
"_type" : "events",
"_id" : "1jC2LxTjTMS1KHCn0Prf1w",
"_score" : 1.0,
"_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
"fields" : {
"_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
},
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
"details" : [ {
"value" : 1.0,
"description" : "boost"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}
} ]
}
} I bet you wanted this:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_shard" : 3,
"_node" : "YOK_20U7Qee-XSasg0J8VA",
"_index" : "statistics-20141110",
"_type" : "events",
"_id" : "1jC2LxTjTMS1KHCn0Prf1w",
"_version" : 1,
"_score" : 1.0,
"_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
"fields" : {
"_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
},
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
"details" : [ {
"value" : 1.0,
"description" : "boost"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}
}, {
"_shard" : 3,
"_node" : "YOK_20U7Qee-XSasg0J8VA",
"_index" : "statistics-20141110",
"_type" : "events",
"_id" : "1jC2LxTjTMS1KHCn0Prf1w",
"_version" : 1,
"_score" : 1.0,
"_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
"fields" : {
"_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
},
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
"details" : [ {
"value" : 1.0,
"description" : "boost"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}
} ]
}
} There were many migrations, but not during optimization, unless es moves shards after new index is created. Basically at 00:00 new index is created and at 00:45 optimization for old indices starts. |
do you have client nodes that are pre 1.3.3? |
@s1monw index is created on 1.3.4:
Nov 11 is definitely after Sep 30. Shouldn't be #7729 then. We don't have client nodes, everything is over http. But yeah, we use bulk indexing and automatically assigned ids. |
Hi @bobrik (you guessed right about OK - we're going to need more info. Please could you send:
|
{
"statistics-20141110" : {
"settings" : {
"index" : {
"codec" : {
"bloom" : {
"load" : "false"
}
},
"uuid" : "JZXC-8C3TFC71EnMGMHSWw",
"number_of_replicas" : "0",
"number_of_shards" : "5",
"version" : {
"created" : "1030499"
}
}
}
}
} {
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"indices" : {
"statistics-20141110" : {
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "hBg3FpLGQw6B9l-Hil2c8Q"
},
"num_committed_segments" : 2,
"num_search_segments" : 2,
"segments" : {
"_gga" : {
"generation" : 21322,
"num_docs" : 14939669,
"deleted_docs" : 0,
"size_in_bytes" : 1729206228,
"memory_in_bytes" : 4943008,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
},
"_isc" : {
"generation" : 24348,
"num_docs" : 10913518,
"deleted_docs" : 0,
"size_in_bytes" : 1254410507,
"memory_in_bytes" : 4101712,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
}
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "ajMe-w2lSIO0Tz5WEUs4qQ"
},
"num_committed_segments" : 2,
"num_search_segments" : 2,
"segments" : {
"_7i7" : {
"generation" : 9727,
"num_docs" : 7023269,
"deleted_docs" : 0,
"size_in_bytes" : 803299557,
"memory_in_bytes" : 2264472,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
},
"_i01" : {
"generation" : 23329,
"num_docs" : 14689581,
"deleted_docs" : 0,
"size_in_bytes" : 1659303375,
"memory_in_bytes" : 4788872,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
}
}
} ],
"2" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "hyUu93q7SRehHBVZfSmvOg"
},
"num_committed_segments" : 2,
"num_search_segments" : 2,
"segments" : {
"_9wx" : {
"generation" : 12849,
"num_docs" : 8995444,
"deleted_docs" : 0,
"size_in_bytes" : 1035711205,
"memory_in_bytes" : 3326288,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
},
"_il1" : {
"generation" : 24085,
"num_docs" : 13205585,
"deleted_docs" : 0,
"size_in_bytes" : 1510021893,
"memory_in_bytes" : 4343736,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
}
}
} ],
"3" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "hyUu93q7SRehHBVZfSmvOg"
},
"num_committed_segments" : 2,
"num_search_segments" : 2,
"segments" : {
"_8pc" : {
"generation" : 11280,
"num_docs" : 10046395,
"deleted_docs" : 0,
"size_in_bytes" : 1143637974,
"memory_in_bytes" : 4003824,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
},
"_hwt" : {
"generation" : 23213,
"num_docs" : 13226096,
"deleted_docs" : 0,
"size_in_bytes" : 1485110397,
"memory_in_bytes" : 4287544,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
}
}
} ],
"4" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "hyUu93q7SRehHBVZfSmvOg"
},
"num_committed_segments" : 2,
"num_search_segments" : 2,
"segments" : {
"_91i" : {
"generation" : 11718,
"num_docs" : 8328558,
"deleted_docs" : 0,
"size_in_bytes" : 953452801,
"memory_in_bytes" : 2822712,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
},
"_hms" : {
"generation" : 22852,
"num_docs" : 14848927,
"deleted_docs" : 0,
"size_in_bytes" : 1673336536,
"memory_in_bytes" : 4777472,
"committed" : true,
"search" : true,
"version" : "4.9.0",
"compound" : false
}
}
} ]
}
}
}
} |
If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes elastic#8788
If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes #8788
If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes #8788
If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes #8788
Reopen because the test added with #9125 just failed and the failure is reproducible (about 1/10 runs with same seed and added stress), see http://build-us-00.elasticsearch.org/job/es_core_master_window-2012/725/ |
When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes elastic#8788
This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes #8788 closes #9468
This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes #8788 closes #9468
This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes #8788 closes #9468
We've just seen this issue for the second time. The first time produced only a single duplicate; this time produced over 16000, across a comparatively tiny index (< 300k docs). We're using 1.3.4, doing bulk indexing with the Java client API's However, we're not using autogenerated ids, so from my reading of the fix for this issue it's unlikely to help us. Should I open a separate issue, or should this one be reopened? Miscellanous other info:
|
@mrec It would be great if you could open a new issue. Please also add a query that finds duplicates together with
Is there a way that you can make available the elasticsearch logs from the time where you had the network issues? |
If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes elastic#8788
This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes elastic#8788 closes elastic#9468
If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes elastic#8788
This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes elastic#8788 closes elastic#9468
I decided to reindex my data to take advantage of
doc_values
, but one of 30 indices (~120m docs in each) got less documents after reindexing. I reindexed again and docs disappeared again.Then I bisected the problem to specific docs and found that some docs in source index has duplicate ids.
Here are two indices, source and destination:
Segments of problematic index:
The only thing that happened with index besides indexing is optimizing to 2 segments per shard.
The text was updated successfully, but these errors were encountered: