-
Notifications
You must be signed in to change notification settings - Fork 105
Document schema conversion on data import #1594
Conversation
23f52dc
to
8d1e109
Compare
docs/whisper-data-import.md
Outdated
|
||
### Schema conversion | ||
|
||
The whisper importer reader requires the user to provide the path to the storage schemas used at the import destination, it then reads and parses those schemas. During the import it iterates over all the whisper files that need to be imported and processes each of them as following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it very clear this is NOT the schemas file used locally on the graphite setup.
docs/schema-conversion-logic.md
Outdated
|
||
This is how the schema conversion logic generates a destination archive: | ||
|
||
* In the set of source archives which it has received as input, it finds the archive with the shortest retention which still contains enough data to satisfy the retention of the destination archive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean in this case, that it will try to use the first archive? that seems suboptimal:
dest: 15s:1d
src: 1s:1d,15s:5d
worse what if the src is 10s:1d,15s:5d
? It can't convert 10s to 15s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you just found a bug. I was going through that scenario where 1s:1d,15s:5d
gets converted to 15s:1d
, and it doesn't produce the expected result. let me write a unit test for that and then fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll consider that high prio because we're planning to use the schema conversions in a prod instance soon. I don't think we'll have to do a conversion which hits this bug there, afaict, but i'd still rather get it fixed asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not to distract from the bug here, but 10s:1d,15s:5d
isnt a valid retention policy, at least not when using Whisper. https://graphite.readthedocs.io/en/latest/whisper.html#archives-retention-and-precision
the precision of a longer retention archive must be divisible by precision of next lower retention archive.
Should we also enforce this in Metrictank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch -> #1599
docs/schema-conversion-logic.md
Outdated
This is how the schema conversion logic generates a destination archive: | ||
|
||
* In the set of source archives which it has received as input, it finds the archive with the shortest retention which still contains enough data to satisfy the retention of the destination archive | ||
* In the same set of source archives, it finds the archive with the longest retention that still has a higher or equal resolution as the destination archive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly here, will it pick the 2nd one? the first one seems better suited
dest: 4h:1y
src: 2h:2y,3h:3y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if the 1st (high resolution) one has enough data, then the result is going to be better if we use it to re-generate the rollup with the adjusted aggregation span. but at the moment it would use the 2nd one and generate an approximation based on it. so this correctly documents the behavior, which we can change in the future with another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, does your existing PR fix it? link to it here. or does it not? -> can you create a new ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/schema-conversion-logic.md
Outdated
#### Increasing the interval | ||
|
||
In the above described process, it can happen that some of the used input archives have a lower interval than the destination archive. In this case the conversion process will use the same aggregation logic as Metrictank uses internally to generate rollups to decrease the resolution (increase interval) of the input archive(s) to make it match the destination archive. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when going from interval 10s to 15s?
or 4m to 5m ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still does an approximation which is certainly not perfect, but as good as possible. When #1609 is done then this should be a bit smarter and regenerate those archives from a higher resolution input, if possible. but in some situations there may not be any higher resolution input available, in these cases it will still do such an "unclean" conversion
It is not simple :D this looks pretty good. the examples help a lot |
Agreed, good call on the examples |
Is there something else that needs to be changed before merging this? I know that it currently documents a behavior which is not optimal, as concluded in the comments, but it does accurately document the current behavior. Once we improve the behavior, then we can update the doc to reflect the improved behavior accordingly. There is already an issue to do so #1609 |
e583517
to
3f71d6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something else that needs to be changed before merging this? I know that it currently documents a behavior which is not optimal, as concluded in the comments, but it does accurately document the current behavior. Once we improve the behavior, then we can update the doc to reflect the improved behavior accordingly. There is already an issue to do so #1609
I'll defer to @Dieterbe , but aside from my minor/probably skippable comment, I don't see any issues
README.md
Outdated
@@ -78,6 +78,8 @@ Otherwise data loss of current chunks will be incurred. See [operations guide]( | |||
* [Graphite](https://github.com/grafana/metrictank/blob/master/docs/graphite.md) | |||
* [Metadata](https://github.com/grafana/metrictank/blob/master/docs/metadata.md) | |||
* [Tags](https://github.com/grafana/metrictank/blob/master/docs/tags.md) | |||
* [Whisper data import](https://github.com/grafana/metrictank/blob/master/docs/whisper-data-import.md) | |||
* [Schema conversion logic](https://github.com/grafana/metrictank/blob/master/docs/schema-conversion-logic.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 are way too related to be split into separate features.
can't we just have 1 page about data importing, with a section about whisper imports, and a section about schema conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just split it because I figured it is likely that sooner or later the schema conversion logic would also be used for something other than the Whisper importer. But if that happens then we can just split it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have consolidated these two pages
Co-Authored-By: Dieter Plaetinck <[email protected]>
24cca18
to
42d877e
Compare
docs/data-importing.md
Outdated
* Generates chunks from the points which have been generated by the schema conversion logic | ||
* Chunks are then sent to the importer-writer, which writes them into the store used by the destination cluster | ||
|
||
Note that this conversion allows to convert from one storage-schema to another, but it does not support changes to the storage-aggregations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could use more clarification.
If src aggregation says "avg", then we can't create "max". does that mean it will error? what will happen?
what happens if we want, say "avg" and "sum", or "avg", "sum" and "cnt"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none of the tools or functions in the code allow the user to specify what aggregation functions they want to be used. all the whisper-importer-reader currently does is to read the whisper header (which includes the aggregation function) and then it uses that aggregation function.
if somebody would import data into a Metrictank which has a storage-aggregations.conf
that specifies different aggregation functions than were used in the whisper files, then the resulting situation would be the same as if another Metrictank with different storage-aggregations.conf
would have persisted its rollups in the store, the data just wouldn't be found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is super useful. can you please add this to the document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have added it bff5e58
docs/data-importing.md
Outdated
|
||
# Schema conversion logic | ||
|
||
Metrictank comes with schema conversion logic (`mdata/importer/converter.go`) which can be used to convert data from one storage schema into another one. This is currently only used by the Whisper importer, but may in the future also be used to import from other types of data sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our end user docs should not talk about specific source code.
if we want to describe our code in detail to developers, then it should go in devdocs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the reference to that file.
documents how the whisper importer and how the schema conversion logic works.
it's not simple, so i tried to illustrate it with examples.
fixes #1583