-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serialize DateTime As Long to improve json serde performance #4038
Conversation
does it compare to smile encoding, because that what it used by broker/historicals ? |
It works with smile encoding, not a replacement. It is based on jsonMapper and smileMapper. @b-slim |
@kaijianding this test is reporting the results for default Json ser/desr, and my question was would that be the same improvement if you use smile mapper |
yes, this improvement applies both to jsonMapper and smileMapper. I updated the benchmark code to use smileMapper, please check. @b-slim |
@kaijianding What is performance like if you serialize DateTimes as a timestamp + timezone offset, using a json array of |
usually broker and historical use same timezone, so the current implement is good enough for most cases. |
Hmm, I don't think we need to worry about brokers/historicals being in non-UTC time zones, we have always said this is not a supported configuration. I was more thinking of supporting cases where a historical would return non-UTC time zones for other reasons, like when the "granularity" is a PeriodGranularity with a non-UTC time zone. |
I tested timeSeries and group by query, the results are correct for non-UTC PeriodGranularity (we are in china timeZone, we use +08:00 PeriodGranularity all the time, the results are as expected), these 2 queries will do granularity.truncate(), so timeZone is not a problem. I update the benchmark code to test [timestamp, timezone offset] as [long, int] pair. In this format, it is slower than the pure timestamp format, but I think it is still good enough. @gianm |
Can the benchmark be added to https://github.com/druid-io/druid/tree/master/benchmarks in the same style as is used there? That way the testing is standardized. |
I just write some benchmark code to prove this PR can improve json ser/desr performance, but it is not really a benchmark against current druid code. |
That's something nice (again). Would it be possible to get that feature also when talking to historical directly? |
historical also uses QueryResource, so it works when talking to historical as long as you specify the serializeDateTimeAsLong flag in query context. @KenjiTakahashi |
Oh, it's in the context! I somehow thought it was option directly on the query. Works fine, thanks. |
It reduces 10%~20% time in huge group by query in my production environment. For |
10%~20% sounds about right. In #3740 we benchmarked a roughly 30% improvement for removing timestamps completely (set to null) from the historical -> broker communication when they aren't necessary. |
Should I serialize and deserialize timeZone in this PR? It may gain less improvement. My case uses PeriodGranularity with +08:00 timezone, groupby and timeseries query can handle long timestamp correctly by granularity.truncate(), so there is no need to serialize and deserialize timeZone for these two queries. @gianm |
@kaijianding hmm, maybe time zone isn't necessary then, if the granularity object on the broker "knows" the proper time zone and can reapply it. It sounds like that's what you're saying, in which case just using the millis should be fine. Really I'm just pushing to do something we can have on all the time, automatically, rather than something that requires users to set a special context flag. Then a lot more users will get the benefit, since most users just stick with defaults for things like this. |
Ok. Then I will make serializeDateTimeAsLongInner=true as default, and serialize and deserialize timeZone @gianm |
I've seen about 20% as well. Back when we were profiling the There are multiple problems with JSON as efficient serialization for larger sets of data, maybe at some point we'd be able to introduce something better for this purpose? |
Yeah, it makes sense to explore options other than JSON. Druid uses JSON since all the original query types (timeBoundary, timeseries, topN, search) return relatively small result sets and so serde overhead of results doesn't matter much. But for groupBy, select, and scan, it does matter. |
Can you fix conflicts and address the latest comment from @gianm about enabling the long serialization by default? Can you also verify that leaving out the time zone in the serialization works for queries other than GroupBy and Timeseries? We're planning on wrapping up the 0.10.1 release this week, this PR seems useful to include if possible. |
👍 |
@@ -18,6 +18,8 @@ The query context is used for various query configuration parameters. The follow | |||
|bySegment | `false` | Return "by segment" results. Primarily used for debugging, setting it to `true` returns results associated with the data segment they came from | | |||
|finalize | `true` | Flag indicating whether to "finalize" aggregation results. Primarily used for debugging. For instance, the `hyperUnique` aggregator will return the full HyperLogLog sketch instead of the estimated cardinality when this flag is set to `false` | | |||
|chunkPeriod | `P0D` (off) | At the broker node level, long interval queries (of any type) may be broken into shorter interval queries to parallelize merging more than normal. Broken up queries will use a larger share of cluster resources, but may be able to complete faster as a result. Use ISO 8601 periods. For example, if this property is set to `P1M` (one month), then a query covering a year would be broken into 12 smaller queries. The broker uses its query processing executor service to initiate processing for query chunks, so make sure "druid.processing.numThreads" is configured appropriately on the broker. [groupBy queries](groupbyquery.html) do not support chunkPeriod by default, although they do if using the legacy "v1" engine. | | |||
|serializeDateTimeAsLong| `false` | If true, DateTime is serialized as long in the result returned by broker and the data transportation between broker and compute node| | |||
|serializeDateTimeAsLongInner| `false` | If true, DateTime is serialized as long in the data transportation between broker and compute node| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this serializeDateTimeAsLongInternal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if setting this flag always improves performance I would just make the default 'true'
👍 |
It's extremely faster to serialize and deserialize DateTIme as long in the json transportation between broker and historical if user is sure there is no TimeZone problem.
It's very useful when issue a select query or big group by query whose granularity is small like PT10S
serializeDateTimeAsLongInner=true means only serialize DateTime between broker and historical
serializeDateTimeAsLong=true means also serialize DateTIme as long in the result if user need long result instead of string, eg: user wants to display the DateTIme in a different format
The benchmark code:
the result is
the improvement applies to both jsonMapper and smileMapper