-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added force_document_id option to ES output enable resend data avoiding duplicated ES documents, fix #7891 #8019
added force_document_id option to ES output enable resend data avoiding duplicated ES documents, fix #7891 #8019
Conversation
…oiding duplicated ES documents, fix influxdata#7891
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather name force_document_id
as avoid_duplicates
or something, but is there any reason not to turn this on for everyone, and remove the config option? Under some circumstances, Telegraf assumes it can resend the same metrics with no ill effects downstream. I can't imagine a case where it'd be desirable to see all the duplicates.
Hello @ssoroka , thank you for the fast review. About the chosen name "force_document_id". I've chosen because "document_id" is something known for people used to play with ES and Logstash, and could help people to identify the property in both tools ( telegraf / logstash ).
About turn on the property by default. I put false by default , to make this change backwards compatible, I'm not in the head of the people working with telegraf sending to elastic, and set the property to true by default would be a breaking change in some rare cases , so I'd prefer maintain backwards compatibility right now and perhaps change it in the future. Anyway I'm open to consensus and change both things if other people could give us their opinion. Thank you very much. |
@lpic10 do you want to weigh in here before I merge? |
Concerning the default option, I don't know if there is a valid scenario in that the telegraf users would want the same data stored twice (maybe when there are really duplicated log lines or metric points?) I understand that InfluxDB does that deduplication by timestamp and tags/fields automatically, so maybe that is not really something expected by most people. It could make sense to have this enabled by default on ES output even if there is a potential performance impact on telegraf side. About the config name there seems to be no consistency on the other tools sending data to ES. (eg. fluentd calls it "hash_id" and beats/logstash calls it "fingerprint"). But in all cases this option is configurable. For me both |
Ok. I think I like the idea of changing this to avoid duplicates by default, since this is more likely the expected behavior. In that case, I'd make a new config option called |
IMHO in the elasticsearch world it's not natural to perform this kind of deduplication, probably elasticsearch users does not expect this behavior by default. Also, this behavior has a penalty in write throughput, from the official Elastic documentation: 'Use auto-generated ids |
That's good feedback @melodous. I'll change my position to say we should choose the faster option by default and allow users interested in this to turn it on. |
@toni-moreno I'd say it's good to merge whenever you want. let me know if you plan to rename or want to merge as is. |
…oiding duplicated ES documents, fix influxdata#7891 (influxdata#8019)
…oiding duplicated ES documents, fix influxdata#7891 (influxdata#8019)
This PR solves #7891
ID has been computed as hash computed with sha256(concat(timestamp,measurement,series-hash)),enables resend or update data avoiding ES duplicated documents.
Tested with several loads with the same data and no duplicated documents have been generated.
Hope this PR could help everybody.