-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prevent duplicate time, sequence_number on write #695
Conversation
Seems like a good catch. Is this just 'time', or if any column is duplicated, will this happen? I'm not sure, since I am new to this code base, but I think this kind of validation should happen higher up the API stack -- in src/api/http/api.go. Sanitize all incoming JSON there, removing all duplicates for example (without actually caring what column names are). Today it may be the 'time' column, tomorrow it will be something else. What about adding a unit test? |
I think sanitize all JSON leads to performance degradation, And the func ConvertToDataStoreSeries at src/common/serialize_series.go treats "time" and "sequence_number" exceptionally. So if the other column is duplicated, last column wins. Main objective of pull request above is to prevent panic which leads to broken shard data. You can test by send write request with duplicate "time" field. After that, influxdb panics and you cannot repair old series data. |
What I mean is can you add a unit test to your change which sends duplicate 'time' columns, and confirms that the system doesn't blow up? And that the later time wins? |
Oh, by "sanitize", I just mean the column field, removing any dupes, since it seems to have a fairly constrained specification. I agree that sanitizing the entire JSON blob doesn't make much sense. |
What about points? When you remove duplicate column, you have to remove obsolete point too because serialize series loops over Columns. (and that means "doesn't make sense" situation) Think about this situation.
If you remove "b" only, column "c" will have value 2, like this.
I don't think unit test helps. Also, tests on master is already failing on my macbook.
|
Good point about the corresponding data -- I hadn't thought of that. I was only thinking in terms of duplicate column names, no more. I still don't follow why, with your fix in place, a unit test couldn't be added to prove this issue was no longer an issue. It would prevent a regression in the future. Isn't that useful? |
Special case "time" and "sequence_number" is limited to func ConvertToDataStoreSeries. Anyone who will work on this source will notice that without reading unit test. And. as I said, test on master is failing already. Not working test is meaningless. Don't you think? I added tests anyway. |
I merged in the pr but changed the implementation to check for duplicates in the |
@jvshahid -- maybe I am missing something, but I don't see how the PR that was merged differs from what was first proposed. Where is the code that checks for dupes in Columns? |
I was so curious about those needless conversations... P.S. Your patch is much better than mine. |
Oh forgot to mention. Do you mind signing the CLA. You can find it on the website, http://influxdb.com/community/cla.html |
I have signed CLA. thank you. |
duplicate "time" or "sequence_number" column leads to panic. prevent this by filter out invalid request.
test request
result