Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_tracking lacks sub-second data precision on time #42

Open
bryanlandia opened this issue Jul 24, 2024 · 5 comments
Open

_tracking lacks sub-second data precision on time #42

bryanlandia opened this issue Jul 24, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@bryanlandia
Copy link

Problem:
data analytics teams may need to differentiate between events at the sub-second level but _tracking uses a simple DateTime field.

Suggestion:
Consider using at least a DateTime64(3) for milliseconds, as the tracking logs themselves have microsecond specificity.

ALTER TABLE _tracking ADD COLUMN time_ms DateTime64(3);
ALTER TABLE _tracking UPDATE time_ms = toDateTime64(time, 3) WHERE 1;
ALTER TABLE _tracking DROP COLUMN time;
ALTER TABLE _tracking RENAME COLUMN time_ms TO time;

then same thing for the events table

I believe the Vector parse_timestamp function as written should include the milliseconds from tracking log entries, so they should be available.

@DawoudSheraz DawoudSheraz moved this from Pending Triage to Backlog in Tutor project management Jul 25, 2024
@bryanlandia
Copy link
Author

bryanlandia commented Jul 31, 2024

On a very busy server it turns out that we were getting duplicate time values in _tracking even using millisecond specificity. Had to change it to a DateTime64(4). Many of the duplicates were from automated health check requests, but not all were.

@bryanlandia
Copy link
Author

bryanlandia commented Aug 1, 2024

I'd just use DateTime64(6) as a DateTime64 takes 8 bytes regardless of specificity. You can go up to 9 decimal places for seconds, but the tracking logs themselves only go to microseconds.

@bryanlandia
Copy link
Author

Some more info. You will need to also update the Vector Clickhouse sink with...

[sinks.clickhouse]
...
# support time field with microseconds
date_time_best_effort = true
encoding.timestamp_format = "rfc3339"

As of Vector 0.34 it will support encoding.timestamp_format "unix_us" which should also work without requiring date_time_best_effort.

See also vectordotdev/vector#5797

@Danyal-Faheem
Copy link
Collaborator

Hey Bryan, sorry for the late response. There was a maintainership change in progress.

I've looked into the issue you mentioned and the PR you created (thank you so much for that, by the way).

I can see how making this change would allow for finer granular control over the analytics. However, we are also concerned about the performance impact it might have on existing users. I noticed that you mentioned you tried this out on a server with 1.2B rows. I would be very interested to know if you have any time metrics for that upgrade process.

@bryanlandia
Copy link
Author

Hi @Danyal-Faheem No problem and sorry for my own delay! I do have some notes on that upgrade process and will try to get those to you in the next day or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

3 participants