Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate id in index #8788

Closed
bobrik opened this issue Dec 5, 2014 · 14 comments
Closed

Duplicate id in index #8788

bobrik opened this issue Dec 5, 2014 · 14 comments
Assignees
Labels
>bug :Core/Infra/Core Core issues without another label critical :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v1.4.3 v1.5.0 v2.0.0-beta1

Comments

@bobrik
Copy link
Contributor

bobrik commented Dec 5, 2014

I decided to reindex my data to take advantage of doc_values, but one of 30 indices (~120m docs in each) got less documents after reindexing. I reindexed again and docs disappeared again.

Then I bisected the problem to specific docs and found that some docs in source index has duplicate ids.

curl -s "http://web245:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w"
{
  "took" : 1156,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"}
    }, {
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"}
    } ]
  }
}

Here are two indices, source and destination:

health status index                  pri rep docs.count docs.deleted store.size pri.store.size
green  open   statistics-20141110      5   0  116217042            0     12.3gb         12.3gb
green  open   statistics-20141110-dv   5   1  116216507            0     32.3gb         16.1gb

Segments of problematic index:

index               shard prirep ip            segment generation docs.count docs.deleted    size size.memory committed searchable version compound
statistics-20141110 0     p      192.168.0.190 _gga         21322   14939669            0   1.6gb     4943008 true      true       4.9.0   false
statistics-20141110 0     p      192.168.0.190 _isc         24348   10913518            0   1.1gb     4101712 true      true       4.9.0   false
statistics-20141110 1     p      192.168.0.245 _7i7          9727    7023269            0   766mb     2264472 true      true       4.9.0   false
statistics-20141110 1     p      192.168.0.245 _i01         23329   14689581            0   1.5gb     4788872 true      true       4.9.0   false
statistics-20141110 2     p      192.168.1.212 _9wx         12849    8995444            0 987.7mb     3326288 true      true       4.9.0   false
statistics-20141110 2     p      192.168.1.212 _il1         24085   13205585            0   1.4gb     4343736 true      true       4.9.0   false
statistics-20141110 3     p      192.168.1.212 _8pc         11280   10046395            0     1gb     4003824 true      true       4.9.0   false
statistics-20141110 3     p      192.168.1.212 _hwt         23213   13226096            0   1.3gb     4287544 true      true       4.9.0   false
statistics-20141110 4     p      192.168.2.88  _91i         11718    8328558            0 909.2mb     2822712 true      true       4.9.0   false
statistics-20141110 4     p      192.168.2.88  _hms         22852   14848927            0   1.5gb     4777472 true      true       4.9.0   false

The only thing that happened with index besides indexing is optimizing to 2 segments per shard.

@clintongormley
Copy link
Contributor

Hi @bobrik

Is there any chance this index was written with Elasticsearch 1.2.0?

Please could you provide the output of this request:

curl -s "http://web245:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing"

@bobrik
Copy link
Contributor Author

bobrik commented Dec 9, 2014

Routing is automatically inferred from @key

{
  "took" : 1744,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    }, {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    } ]
  }
}

Index was created on 1.3.4, we upgraded from 1.0.1 to 1.3.2 on 2014-09-22

@clintongormley
Copy link
Contributor

Hi @bobrik

Hmm, these two docs are on the same shard! Do you ever run updates on these docs? Could you send the output of this command please?

curl -s "http://web245:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing,_version"

@bobrik
Copy link
Contributor Author

bobrik commented Dec 9, 2014

Of course they are, that's how routing works :)

I didn't run any updates, because my code only does indexing. It doesn't even know ids that are assigned by elasticsearch.

{
  "took" : 51,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    }, {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    } ]
  }
}

@clintongormley
Copy link
Contributor

Sorry @bobrik - I gave you the wrong request, it should be:

curl -s "http://web245:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing,version"

And so you're using auto-assigned IDs? Did any of your shards migrate to other nodes, or did a primary fail during optimization?

@s1monw
Copy link
Contributor

s1monw commented Dec 9, 2014

I think this is caused by #7729 @bobrik are you coming from < 1.3.3 with this index and are you using bulk?

@bobrik
Copy link
Contributor Author

bobrik commented Dec 9, 2014

curl -s 'http://web605:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing,version'
{
  "took" : 46,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    }, {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    } ]
  }
}

I bet you wanted this:

curl -s 'http://web605:9200/statistics-20141110/_search?pretty&q=_id:1jC2LxTjTMS1KHCn0Prf1w&explain&fields=_source,_routing' -d '{"version":true}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_version" : 1,
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    }, {
      "_shard" : 3,
      "_node" : "YOK_20U7Qee-XSasg0J8VA",
      "_index" : "statistics-20141110",
      "_type" : "events",
      "_id" : "1jC2LxTjTMS1KHCn0Prf1w",
      "_version" : 1,
      "_score" : 1.0,
      "_source":{"@timestamp":"2014-11-10T14:30:00+0300","@key":"client_belarussia_msg_sended_from_mutual__22_1","@value":"149"},
      "fields" : {
        "_routing" : "client_belarussia_msg_sended_from_mutual__22_1"
      },
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:events#1jC2LxTjTMS1KHCn0Prf1w _uid:markers#1jC2LxTjTMS1KHCn0Prf1w _uid:precise#1jC2LxTjTMS1KHCn0Prf1w _uid:rfm_users#1jC2LxTjTMS1KHCn0Prf1w), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    } ]
  }
}

There were many migrations, but not during optimization, unless es moves shards after new index is created. Basically at 00:00 new index is created and at 00:45 optimization for old indices starts.

@s1monw
Copy link
Contributor

s1monw commented Dec 9, 2014

do you have client nodes that are pre 1.3.3?

@bobrik
Copy link
Contributor Author

bobrik commented Dec 9, 2014

@s1monw index is created on 1.3.4:

[2014-09-30 12:03:49,991][INFO ][node                     ] [statistics04] version[1.3.3], pid[17937], build[ddf796d/2014-09-29T13:39:00Z]
[2014-09-30 14:03:19,205][INFO ][node                     ] [statistics04] version[1.3.4], pid[89485], build[a70f3cc/2014-09-30T09:07:17Z]

Nov 11 is definitely after Sep 30. Shouldn't be #7729 then.

We don't have client nodes, everything is over http. But yeah, we use bulk indexing and automatically assigned ids.

@clintongormley
Copy link
Contributor

Hi @bobrik

(you guessed right about version=true :) )

OK - we're going to need more info. Please could you send:

curl -s 'http://web605:9200/statistics-20141110/_settings?pretty'
curl -s 'http://web605:9200/statistics-20141110/_segments?pretty'

@bobrik
Copy link
Contributor Author

bobrik commented Dec 9, 2014

{
  "statistics-20141110" : {
    "settings" : {
      "index" : {
        "codec" : {
          "bloom" : {
            "load" : "false"
          }
        },
        "uuid" : "JZXC-8C3TFC71EnMGMHSWw",
        "number_of_replicas" : "0",
        "number_of_shards" : "5",
        "version" : {
          "created" : "1030499"
        }
      }
    }
  }
}
{
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "indices" : {
    "statistics-20141110" : {
      "shards" : {
        "0" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "hBg3FpLGQw6B9l-Hil2c8Q"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_gga" : {
              "generation" : 21322,
              "num_docs" : 14939669,
              "deleted_docs" : 0,
              "size_in_bytes" : 1729206228,
              "memory_in_bytes" : 4943008,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_isc" : {
              "generation" : 24348,
              "num_docs" : 10913518,
              "deleted_docs" : 0,
              "size_in_bytes" : 1254410507,
              "memory_in_bytes" : 4101712,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ],
        "1" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "ajMe-w2lSIO0Tz5WEUs4qQ"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_7i7" : {
              "generation" : 9727,
              "num_docs" : 7023269,
              "deleted_docs" : 0,
              "size_in_bytes" : 803299557,
              "memory_in_bytes" : 2264472,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_i01" : {
              "generation" : 23329,
              "num_docs" : 14689581,
              "deleted_docs" : 0,
              "size_in_bytes" : 1659303375,
              "memory_in_bytes" : 4788872,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ],
        "2" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "hyUu93q7SRehHBVZfSmvOg"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_9wx" : {
              "generation" : 12849,
              "num_docs" : 8995444,
              "deleted_docs" : 0,
              "size_in_bytes" : 1035711205,
              "memory_in_bytes" : 3326288,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_il1" : {
              "generation" : 24085,
              "num_docs" : 13205585,
              "deleted_docs" : 0,
              "size_in_bytes" : 1510021893,
              "memory_in_bytes" : 4343736,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ],
        "3" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "hyUu93q7SRehHBVZfSmvOg"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_8pc" : {
              "generation" : 11280,
              "num_docs" : 10046395,
              "deleted_docs" : 0,
              "size_in_bytes" : 1143637974,
              "memory_in_bytes" : 4003824,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_hwt" : {
              "generation" : 23213,
              "num_docs" : 13226096,
              "deleted_docs" : 0,
              "size_in_bytes" : 1485110397,
              "memory_in_bytes" : 4287544,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ],
        "4" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "hyUu93q7SRehHBVZfSmvOg"
          },
          "num_committed_segments" : 2,
          "num_search_segments" : 2,
          "segments" : {
            "_91i" : {
              "generation" : 11718,
              "num_docs" : 8328558,
              "deleted_docs" : 0,
              "size_in_bytes" : 953452801,
              "memory_in_bytes" : 2822712,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            },
            "_hms" : {
              "generation" : 22852,
              "num_docs" : 14848927,
              "deleted_docs" : 0,
              "size_in_bytes" : 1673336536,
              "memory_in_bytes" : 4777472,
              "committed" : true,
              "search" : true,
              "version" : "4.9.0",
              "compound" : false
            }
          }
        } ]
      }
    }
  }
}

brwe added a commit to brwe/elasticsearch that referenced this issue Jan 2, 2015
If bulk index request fails due to a disconnect, unavailable shard etc, the request is
retried once before actually failing. However, even in case of failure the documents
might already be indexed. For autogenerated ids the request must not add the
documents again and therfore canHaveDuplicates must be set to true.

closes elastic#8788
@brwe brwe closed this as completed in f45e6ae Jan 2, 2015
brwe added a commit that referenced this issue Jan 2, 2015
If bulk index request fails due to a disconnect, unavailable shard etc, the request is
retried once before actually failing. However, even in case of failure the documents
might already be indexed. For autogenerated ids the request must not add the
documents again and therfore canHaveDuplicates must be set to true.

closes #8788
brwe added a commit that referenced this issue Jan 2, 2015
If bulk index request fails due to a disconnect, unavailable shard etc, the request is
retried once before actually failing. However, even in case of failure the documents
might already be indexed. For autogenerated ids the request must not add the
documents again and therfore canHaveDuplicates must be set to true.

closes #8788
brwe added a commit that referenced this issue Jan 2, 2015
If bulk index request fails due to a disconnect, unavailable shard etc, the request is
retried once before actually failing. However, even in case of failure the documents
might already be indexed. For autogenerated ids the request must not add the
documents again and therfore canHaveDuplicates must be set to true.

closes #8788
@brwe
Copy link
Contributor

brwe commented Jan 2, 2015

Reopen because the test added with #9125 just failed and the failure is reproducible (about 1/10 runs with same seed and added stress), see http://build-us-00.elasticsearch.org/job/es_core_master_window-2012/725/

@brwe brwe reopened this Jan 2, 2015
brwe added a commit to brwe/elasticsearch that referenced this issue Jan 28, 2015
When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes elastic#8788
brwe added a commit that referenced this issue Jan 29, 2015
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
closes #9468
brwe added a commit that referenced this issue Jan 29, 2015
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
closes #9468
@brwe brwe closed this as completed in 0a07ce8 Jan 29, 2015
brwe added a commit that referenced this issue Feb 4, 2015
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
closes #9468
@mrec
Copy link

mrec commented Apr 8, 2015

We've just seen this issue for the second time. The first time produced only a single duplicate; this time produced over 16000, across a comparatively tiny index (< 300k docs). We're using 1.3.4, doing bulk indexing with the Java client API's BulkProcessor and TransportClient.

However, we're not using autogenerated ids, so from my reading of the fix for this issue it's unlikely to help us. Should I open a separate issue, or should this one be reopened?

Miscellanous other info:

  • The index has not been migrated from an earlier version.
  • Around the time the duplicates appeared, we saw problems in other (non-Elastic) parts of the system. I can't see any way that they could directly cause the duplication, but it's possible that network issues were the common cause of both.
  • We still have the index containing duplicates for now, though it may not last long; this is on an alpha cluster that gets reset fairly often.
  • I'm very much a newbie to Elastic, so may be missing something obvious.

@brwe
Copy link
Contributor

brwe commented Apr 8, 2015

@mrec It would be great if you could open a new issue. Please also add a query that finds duplicates together with ?explain=true option set and the output of that query like above.
Something like:

curl -s 'http://HOST:PORT/YOURINDEX/_search?pretty&q=_id:A_DUPLICATE_ID&explain&fields=_source,_routing' -d '{"version":true}'

Is there a way that you can make available the elasticsearch logs from the time where you had the network issues?
Also, the output of
curl -s 'http://web605:9200/statistics-20141110/_segments?pretty' might be helpful.

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
If bulk index request fails due to a disconnect, unavailable shard etc, the request is
retried once before actually failing. However, even in case of failure the documents
might already be indexed. For autogenerated ids the request must not add the
documents again and therfore canHaveDuplicates must be set to true.

closes elastic#8788
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes elastic#8788
closes elastic#9468
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
If bulk index request fails due to a disconnect, unavailable shard etc, the request is
retried once before actually failing. However, even in case of failure the documents
might already be indexed. For autogenerated ids the request must not add the
documents again and therfore canHaveDuplicates must be set to true.

closes elastic#8788
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes elastic#8788
closes elastic#9468
@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Engine :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 13, 2018
fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label critical :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v1.4.3 v1.5.0 v2.0.0-beta1
Projects
None yet
5 participants