Duplicate id in index #8788

bobrik · 2014-12-05T08:10:06Z

I decided to reindex my data to take advantage of doc_values, but one of 30 indices (~120m docs in each) got less documents after reindexing. I reindexed again and docs disappeared again.

Then I bisected the problem to specific docs and found that some docs in source index has duplicate ids.

curl -s "http://web245:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w"

{
  "took" : 1156,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"}
    }, {
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"}
    } ]
  }
}

Here are two indices, source and destination:

health status index                  pri rep docs.count docs.deleted store.size pri.store.size
green  open   statistics-20141110      5   0  116217042            0     12.3gb         12.3gb
green  open   statistics-20141110-dv   5   1  116216507            0     32.3gb         16.1gb

Segments of problematic index:

index               shard prirep ip            segment generation docs.count docs.deleted    size size.memory committed searchable version compound
statistics-20141110 0     p      192.168.0.190 _gga         21322   14939669            0   1.6gb     4943008 true      true       4.9.0   false
statistics-20141110 0     p      192.168.0.190 _isc         24348   10913518            0   1.1gb     4101712 true      true       4.9.0   false
statistics-20141110 1     p      192.168.0.245 _7i7          9727    7023269            0   766mb     2264472 true      true       4.9.0   false
statistics-20141110 1     p      192.168.0.245 _i01         23329   14689581            0   1.5gb     4788872 true      true       4.9.0   false
statistics-20141110 2     p      192.168.1.212 _9wx         12849    8995444            0 987.7mb     3326288 true      true       4.9.0   false
statistics-20141110 2     p      192.168.1.212 _il1         24085   13205585            0   1.4gb     4343736 true      true       4.9.0   false
statistics-20141110 3     p      192.168.1.212 _8pc         11280   10046395            0     1gb     4003824 true      true       4.9.0   false
statistics-20141110 3     p      192.168.1.212 _hwt         23213   13226096            0   1.3gb     4287544 true      true       4.9.0   false
statistics-20141110 4     p      192.168.2.88  _91i         11718    8328558            0 909.2mb     2822712 true      true       4.9.0   false
statistics-20141110 4     p      192.168.2.88  _hms         22852   14848927            0   1.5gb     4777472 true      true       4.9.0   false

The only thing that happened with index besides indexing is optimizing to 2 segments per shard.

The text was updated successfully, but these errors were encountered:

clintongormley · 2014-12-09T12:14:51Z

Hi @bobrik

Is there any chance this index was written with Elasticsearch 1.2.0?

Please could you provide the output of this request:

curl -s "http://web245:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing"

bobrik · 2014-12-09T12:33:09Z

Routing is automatically inferred from @key

{
  "took" : 1744,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    }, {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    } ]
  }
}

Index was created on 1.3.4, we upgraded from 1.0.1 to 1.3.2 on 2014-09-22

clintongormley · 2014-12-09T13:44:59Z

Hi @bobrik

Hmm, these two docs are on the same shard! Do you ever run updates on these docs? Could you send the output of this command please?

curl -s "http://web245:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing,_version"

bobrik · 2014-12-09T13:48:41Z

Of course they are, that's how routing works :)

I didn't run any updates, because my code only does indexing. It doesn't even know ids that are assigned by elasticsearch.

{
  "took" : 51,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    }, {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    } ]
  }
}

clintongormley · 2014-12-09T16:56:22Z

Sorry @bobrik - I gave you the wrong request, it should be:

curl -s "http://web245:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing,version"

And so you're using auto-assigned IDs? Did any of your shards migrate to other nodes, or did a primary fail during optimization?

s1monw · 2014-12-09T17:05:25Z

I think this is caused by #7729 @bobrik are you coming from < 1.3.3 with this index and are you using bulk?

bobrik · 2014-12-09T17:06:19Z

curl -s 'http://web605:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing,version'

{
  "took" : 46,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    }, {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    } ]
  }
}

I bet you wanted this:

curl -s 'http://web605:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing' -d '{"version":true}'

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_version" : 1,
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    }, {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_version" : 1,
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    } ]
  }
}

There were many migrations, but not during optimization, unless es moves shards after new index is created. Basically at 00:00 new index is created and at 00:45 optimization for old indices starts.

s1monw · 2014-12-09T17:09:22Z

do you have client nodes that are pre 1.3.3?

bobrik · 2014-12-09T17:12:53Z

@s1monw index is created on 1.3.4:

[2014-09-30 12:03:49,991][INFO ][node                     ] [statistics04] version[1.3.3], pid[17937], build[ddf796d/2014-09-29T13:39:00Z]

[2014-09-30 14:03:19,205][INFO ][node                     ] [statistics04] version[1.3.4], pid[89485], build[a70f3cc/2014-09-30T09:07:17Z]

Nov 11 is definitely after Sep 30. Shouldn't be #7729 then.

We don't have client nodes, everything is over http. But yeah, we use bulk indexing and automatically assigned ids.

clintongormley · 2014-12-09T17:25:10Z

Hi @bobrik

(you guessed right about version=true :) )

OK - we're going to need more info. Please could you send:

curl -s 'http://web605:9200/statistics-20141110/_settings?pretty'
curl -s 'http://web605:9200/statistics-20141110/_segments?pretty'

bobrik · 2014-12-09T18:18:44Z

{
  "statistics-20141110" : {
    "settings" : {
      "index" : {
        "codec" : {
          "bloom" : {
            "load" : "false"
          }
        },
        "uuid" : "JZXC-8C3TFC71EnMGMHSWw",
        "number_of_replicas" : "0",
        "number_of_shards" : "5",
        "version" : {
          "created" : "1030499"
        }
      }
    }
  }
}

{
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "indices" : {
    "statistics-20141110" : {
      "shards" : {
        "0" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "hBg3FpLGQw6B9l-Hil2c8Q"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_gga" : {
              "generation" : 21322,
              "num_docs" : 14939669,
              "deleted_docs" : 0,
              "size_in_bytes" : 1729206228,
              "memory_in_bytes" : 4943008,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_isc" : {
              "generation" : 24348,
              "num_docs" : 10913518,
              "deleted_docs" : 0,
              "size_in_bytes" : 1254410507,
              "memory_in_bytes" : 4101712,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ],
        "1" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "ajMe-w2lSIO0Tz5WEUs4qQ"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_7i7" : {
              "generation" : 9727,
              "num_docs" : 7023269,
              "deleted_docs" : 0,
              "size_in_bytes" : 803299557,
              "memory_in_bytes" : 2264472,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_i01" : {
              "generation" : 23329,
              "num_docs" : 14689581,
              "deleted_docs" : 0,
              "size_in_bytes" : 1659303375,
              "memory_in_bytes" : 4788872,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ],
        "2" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "hyUu93q7SRehHBVZfSmvOg"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_9wx" : {
              "generation" : 12849,
              "num_docs" : 8995444,
              "deleted_docs" : 0,
              "size_in_bytes" : 1035711205,
              "memory_in_bytes" : 3326288,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_il1" : {
              "generation" : 24085,
              "num_docs" : 13205585,
              "deleted_docs" : 0,
              "size_in_bytes" : 1510021893,
              "memory_in_bytes" : 4343736,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ],
        "3" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "hyUu93q7SRehHBVZfSmvOg"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_8pc" : {
              "generation" : 11280,
              "num_docs" : 10046395,
              "deleted_docs" : 0,
              "size_in_bytes" : 1143637974,
              "memory_in_bytes" : 4003824,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_hwt" : {
              "generation" : 23213,
              "num_docs" : 13226096,
              "deleted_docs" : 0,
              "size_in_bytes" : 1485110397,
              "memory_in_bytes" : 4287544,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ],
        "4" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "hyUu93q7SRehHBVZfSmvOg"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_91i" : {
              "generation" : 11718,
              "num_docs" : 8328558,
              "deleted_docs" : 0,
              "size_in_bytes" : 953452801,
              "memory_in_bytes" : 2822712,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_hms" : {
              "generation" : 22852,
              "num_docs" : 14848927,
              "deleted_docs" : 0,
              "size_in_bytes" : 1673336536,
              "memory_in_bytes" : 4777472,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ]
      }
    }
  }
}

If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes elastic#8788

If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes #8788

brwe · 2015-01-02T18:23:22Z

Reopen because the test added with #9125 just failed and the failure is reproducible (about 1/10 runs with same seed and added stress), see http://build-us-00.elasticsearch.org/job/es_core_master_window-2012/725/

When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes elastic#8788

This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes #8788 closes #9468

mrec · 2015-04-08T14:59:55Z

We've just seen this issue for the second time. The first time produced only a single duplicate; this time produced over 16000, across a comparatively tiny index (< 300k docs). We're using 1.3.4, doing bulk indexing with the Java client API's BulkProcessor and TransportClient.

However, we're not using autogenerated ids, so from my reading of the fix for this issue it's unlikely to help us. Should I open a separate issue, or should this one be reopened?

Miscellanous other info:

The index has not been migrated from an earlier version.
Around the time the duplicates appeared, we saw problems in other (non-Elastic) parts of the system. I can't see any way that they could directly cause the duplication, but it's possible that network issues were the common cause of both.
We still have the index containing duplicates for now, though it may not last long; this is on an alpha cluster that gets reset fairly often.
I'm very much a newbie to Elastic, so may be missing something obvious.

brwe · 2015-04-08T16:26:55Z

@mrec It would be great if you could open a new issue. Please also add a query that finds duplicates together with ?explain=true option set and the output of that query like above.
Something like:

curl -s 'http://HOST:PORT/YOURINDEX/_search?pretty&q=_id:A_DUPLICATE_ID&explain&fields=_source,_routing' -d '{"version":true}'

Is there a way that you can make available the elasticsearch logs from the time where you had the network issues?
Also, the output of
curl -s 'http://web605:9200/statistics-20141110/_segments?pretty' might be helpful.

If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes elastic#8788

This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes elastic#8788 closes elastic#9468

If bulk index request fails due to a disconnect, unavailable shard etc, the request is retried once before actually failing. However, even in case of failure the documents might already be indexed. For autogenerated ids the request must not add the documents again and therfore canHaveDuplicates must be set to true. closes elastic#8788

This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes elastic#8788 closes elastic#9468

clintongormley added the feedback_needed label Dec 9, 2014

clintongormley self-assigned this Dec 9, 2014

clintongormley added the >bug label Dec 9, 2014

clintongormley removed the feedback_needed label Dec 31, 2014

clintongormley assigned brwe and unassigned clintongormley Dec 31, 2014

brwe mentioned this issue Jan 2, 2015

Prevent creation of duplicates on bulk indexing with auto generated ids and retry #9125

Closed

brwe added v2.0.0-beta1 v1.5.0 v1.4.3 v1.3.8 :Engine critical :Core/Infra/Core Core issues without another label labels Jan 2, 2015

brwe closed this as completed in f45e6ae Jan 2, 2015

brwe reopened this Jan 2, 2015

brwe mentioned this issue Jan 28, 2015

Disable auto gen id optimization #9468

Closed

brwe closed this as completed in 0a07ce8 Jan 29, 2015

clintongormley removed the v1.3.8 label Feb 10, 2015

This was referenced Feb 28, 2015

Duplicate document found while searching with parent/child setup - ES 1.0 #8003

Closed

Duplicate results when using a Scrolled Search elastic/elasticsearch-perl#60

Closed

mrec mentioned this issue Apr 9, 2015

Duplicate ids in index (without autogeneration) #10511

Closed

bleskes mentioned this issue Aug 4, 2016

Optimize indexing in create once and never update scenarios #19813

Closed

10 tasks

fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021

Create TestFixMe.md

a9fae03

fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021

Create Helloworld.md

1398a04

fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021

Update Helloworld.md

f68abab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate id in index #8788

Duplicate id in index #8788

bobrik commented Dec 5, 2014

clintongormley commented Dec 9, 2014

bobrik commented Dec 9, 2014

clintongormley commented Dec 9, 2014

bobrik commented Dec 9, 2014

clintongormley commented Dec 9, 2014

s1monw commented Dec 9, 2014

bobrik commented Dec 9, 2014

s1monw commented Dec 9, 2014

bobrik commented Dec 9, 2014

clintongormley commented Dec 9, 2014

bobrik commented Dec 9, 2014

brwe commented Jan 2, 2015

mrec commented Apr 8, 2015

brwe commented Apr 8, 2015

Duplicate id in index #8788

Duplicate id in index #8788

Comments

bobrik commented Dec 5, 2014

clintongormley commented Dec 9, 2014

bobrik commented Dec 9, 2014

clintongormley commented Dec 9, 2014

bobrik commented Dec 9, 2014

clintongormley commented Dec 9, 2014

s1monw commented Dec 9, 2014

bobrik commented Dec 9, 2014

s1monw commented Dec 9, 2014

bobrik commented Dec 9, 2014

clintongormley commented Dec 9, 2014

bobrik commented Dec 9, 2014

brwe commented Jan 2, 2015

mrec commented Apr 8, 2015

brwe commented Apr 8, 2015