Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default Vector timestamp output format is incompatible with default Clickhouse settings #5797

Open
nnz1024 opened this issue Jan 2, 2021 · 3 comments
Labels
sink: clickhouse Anything `clickhouse` sink related type: bug A code related bug.

Comments

@nnz1024
Copy link

nnz1024 commented Jan 2, 2021

Vector Version

vector 0.11.1 (v0.11.1 x86_64-unknown-linux-musl 2020-12-17)

Vector Configuration File

[sinks.haproxy_ch]
    type = "clickhouse"
    inputs = ["sortout.haproxy_access"]

    endpoint = "http://127.0.0.1:8123"

    database = "logs"
    table = "haproxy"

    encoding.only_fields = [ "timestamp",  "..."]

Field timestamp previously coerced to Vector timestamp

[transforms.json_coercer]
    type = "coercer"
    inputs = ["json_parser"]

    types."time" = "timestamp|%e/%B/%Y:%T %z"

Column timestamp in Clickhouse is DateTime

CREATE TABLE logs.haproxy
(
    `timestamp` DateTime,
    ...

Expected Behavior

Default Vector output timestamp format is expected to be compatible with default Clickhouse timestamp input format.

Actual Behavior

It is not. Clickhouse' DateTime parsing is determined by date_time_input_format setting, which allows two options:

  • basic — as documentation says, "ClickHouse can parse only the basic YYYY-MM-DD HH:MM:SS or YYYY-MM-DD format". This is the default behavior.
  • best_effort — can also parse ISO 8601 formats.

Vector' Clickhouse sink have option encoding.timestamp_format, which also may accept two options: rfc3339 (the default) or unix.

When using default Vector Clickhouse sink with default encoding.timestamp_format together with Clickhouse with default date_time_input_format, Clickhouse rejects data with error

Code: 27, e.displayText() = DB::Exception: Cannot parse input: expected '\"' before: 'Z\"}\\n': (while read the value of key timestamp): (at row 1)\n`

Proposed solutions

Setting Vector to use Unix timestamps in Clickhouse sink solves problem for me:

    encoding.timestamp_format = "unix"

However, I think this must be at least noted in the documentation.
Also, maybe it is worth to consider changing default value encoding.timestamp_format to unix, to get the correct behavior "out of the box".

Adding new timestamp_format option basic may also be an option, however, this cannot be the optimal default value IMHO, because timezone interpretation by Clickhouse may depend on column setting or, if none set, from system timezone, which is may be useful only when selected explicitly, i.e. user knows what she/he is doing.

References

#1634

@nnz1024 nnz1024 added the type: bug A code related bug. label Jan 2, 2021
@jszwedko jszwedko added the sink: clickhouse Anything `clickhouse` sink related label Aug 5, 2022
@ethack
Copy link

ethack commented Sep 8, 2022

I'm using the following workaround in VRL to format my timestamp in the default way ClickHouse expects:

.ts = format_timestamp(.ts, "%F %T%.9f") ?? .ts

I read in a ClickHouse issue that the basic date format is faster than the best_effort setting. I can't find the issue anymore, but the best effort path tries parsing a bunch of ways so it make sense.

@DarkWanderer
Copy link
Contributor

DarkWanderer commented Dec 20, 2022

The downside to using unix timestamp is that it is in seconds. best_effort may be slower but at the moment it is the only way to handle DateTime64. The best solution to this would be to add explicit RFC3339 option to ClickHouse

P.S. I'm trying to add support for best_effort in #15644.

P.P.S. There may be an argument for having it "on" as default for future releases - so that the sink works by default. An option for optimization by switching to unix and second-precision timestamp will also be retained.

@suiluj
Copy link

suiluj commented Jun 1, 2023

@DarkWanderer thanks for adding this option! the option date_time_best_effort: true worked for me.

First i tried to convert the timestamp field to unix nanoseconds and set the encoding to unix in the sink

.timestamp = format_timestamp!(.timestamp, "%s%f")

But this did not work for some reasong (perhaps wrong paramters or format). My guess is that the sink setting overwrites the transform format or something like that.
I don't know if this would even result in a better performance.

Your date_time_best_effort: true option works perfectly out of the box so I just keep it! :)

Part of my yaml configuration of the clickhouse sink:

internal_vector_clickhouse:
    type: clickhouse
    inputs:
      - transform_internal_logs
    date_time_best_effort: true
    encoding:
        timestamp_format: rfc3339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: clickhouse Anything `clickhouse` sink related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

5 participants