-
-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added datetime support #1396
Added datetime support #1396
Conversation
c8a066d
to
d6f27bc
Compare
Questions: @fulmicoton
|
Codecov Report
@@ Coverage Diff @@
## main #1396 +/- ##
==========================================
- Coverage 94.23% 94.22% -0.01%
==========================================
Files 236 237 +1
Lines 43685 43939 +254
==========================================
+ Hits 41165 41400 +235
- Misses 2520 2539 +19
Continue to review full report at Codecov.
|
Yes if it is possible that would be great. If it is not possible let me know.
If possible, I'd keep DateTime without timezone in tantivy and have a notion of timezone in quickwit... but I might not have thought this through so let me know what would be the obstacle. (Query parsing I assume?) |
We can leave timezone thing for quickwit, this just means we will let the Tantivy users know that when a timezone is not specified in the input, we assume UTC. I think that's also good for a library |
8a07c12
to
4b1ce74
Compare
examples/date_time_field.rs
Outdated
fn main() -> tantivy::Result<()> { | ||
// # Defining the schema | ||
let mut schema_builder = Schema::builder(); | ||
let mut date_formats = HashSet::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surely this needs to be a ordered. HashSet is a very bad idea here.
src/schema/field_type.rs
Outdated
@@ -58,13 +65,14 @@ pub enum Type { | |||
Json = b'j', | |||
} | |||
|
|||
const ALL_TYPES: [Type; 9] = [ | |||
const ALL_TYPES: [Type; 10] = [ | |||
Type::Str, | |||
Type::U64, | |||
Type::I64, | |||
Type::F64, | |||
Type::Bool, | |||
Type::Date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old Date type was actually a DateTime
(yes the name sucks).
Is the plan to add DateTime and deprecate the older Date, or to replace right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So all these two types DateTime
and PreciseDateTime
exist because my plan is to keep Type::Date working with a deprecated label. If we can drop the old DateTime type and accept the breaking change let me know your opinion.
src/schema/date_time_options.rs
Outdated
|
||
// Accepted input format for parsing input as DateTime. | ||
#[serde(default = "default_input_formats")] | ||
input_formats: HashSet<DateTimeFormat>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be an HashSet. Order matters here.
src/schema/date_time.rs
Outdated
} | ||
|
||
impl fmt::Debug for PreciseDateTime { | ||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { | |
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { |
src/schema/date_time.rs
Outdated
OffsetDateTime::from_unix_timestamp_nanos(timestamp as i128) | ||
} | ||
} | ||
.map_err(|error| error.to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.map_err(|error| error.to_string()) | |
.map_err(ToString::to_string) |
src/schema/date_time.rs
Outdated
} | ||
|
||
/// Convert to UTC `OffsetDateTime` | ||
pub fn into_utc(self) -> OffsetDateTime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
src/schema/date_time.rs
Outdated
} | ||
|
||
/// Returns the underlying timestamp. | ||
pub fn get_timestamp(&self) -> i64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name does not make it clear that this is with the given precision...
get_truncated_timestamp
get_inner_timestamp
...
src/schema/date_time.rs
Outdated
/// Converts to the underlying timestamp. | ||
/// TODO: kept for backward compatibility, should we remove it? | ||
pub const fn into_timestamp(self) -> i64 { | ||
let Self { timestamp, .. } = self; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.timestamp would have worked too
src/schema/date_time.rs
Outdated
} | ||
|
||
impl DateTime { | ||
/// Create new instance from UNIX timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Create new instance from UNIX timestamp. | |
/// Create new instance from UNIX timestamp (in seconds). |
(I agree UNIX timestamp generally suggests seconds, but we might as well be explicit here)
src/schema/date_time_options.rs
Outdated
input_formats: HashSet<DateTimeFormat>, | ||
|
||
// Internal storage precision, used to avoid storing | ||
// very large numbers when not needed. This optimizes compression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it apply to
- indexing
- stored
- fast
- index sorting
or only - fast
or only
-fast
-indexing?
It is not an obvious question and it should be detailed here.
src/schema/date_time_options.rs
Outdated
|
||
impl From<DateTimeOptions> for DateTimeParsersHolder { | ||
fn from(opts: DateTimeOptions) -> Self { | ||
let mut string_parsers: VecDeque<StringDateTimeParser> = VecDeque::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not Vec?
src/schema/date_time_options.rs
Outdated
for (index, parser) in self.string_parsers.iter().enumerate() { | ||
if let Ok(date_time) = parser(&value) { | ||
// Move successful parser in front of queue. | ||
self.string_parsers.swap(0, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see...
I think Vec has swap too.
I am not sure this is a good idea.
So with the HashSet defintion, the output of quickwit right now is not really deterministic at compile time.
Date format can be ambiguous ( think about MM-DD-YYY vs DD-MM-YYYY and 01/03/2021).
Which means that our output depends on the order of our date formats.
By using a HashSet, you make the output of quickwit dependent on the implementation of hashset.
(A change in the rust standard library fro instance, would break unit tests.)
We need to stay away from HashSet.
I would also stay away from the reordering logic, but I wonder...
So it has two functions I believe:
- a) performance.
- b) best effort to deal with the ambiguity described above.
a) can be partially addressed by extracting ez to check predicate for each format.
For instance on the date string length.
- If the date is not exactly 10chrs. skip format XXXXX.
b) is tricky. With your solution we can "eventually" find the right format when we reach a date that is unambiguous (e.g. 24/02/2021). On the other hand, I think we should give all the tools to the user to tell use the right thing. I'd rather not do the swapping logic and just rely on the priority logic.
src/schema/date_time_options.rs
Outdated
#[derive(Clone)] | ||
pub struct DateTimeParsersHolder { | ||
/// Functions for parsing string input as datetime. | ||
string_parsers: VecDeque<StringDateTimeParser>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vec?
src/schema/date_time.rs
Outdated
pub DateTime, | ||
); | ||
|
||
impl Deref for PreciseDateTime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the inner DateTime is incorrect right? The precision here is arbitrary.
We don't want to expose it.
src/schema/date_time.rs
Outdated
// It only exists to provide support the underlying fast | ||
// field value for both `[Value::Date]` and `[Value::DateTime]` variants. | ||
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] | ||
pub struct PreciseDateTime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to TruncatedTimestamp maybe.
src/schema/date_time.rs
Outdated
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { | ||
f.debug_struct("DateTime") | ||
.field("timestamp", &self.0.timestamp) | ||
.field("precision", &self.0.precision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the precision is wrong isn't it? Also we don't want to print DateTime
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precision is always correct, whether you use DateTime::from_unix_timestamp(u64)
or DateTime::from_utc_with_precision(utc_dt, precison)
constructor. Maybe the names should be:
- DateTime -> DateTimeWithSecondPrecision (old DateTime)
- PreciseDateTime -> DateTimeWithCustomPrecision (for new fast field)
I don't understand the logic around PreciseDateTime. In the FastField we store truncated values I assume? (e.g. a milliseconds precision timestamp for instance). |
So PreciseDateTime is the new date time with actual precision, currently, DateTime API was enhanced to conform to the old DateTime that's why I kept the old API and added few things. This allowed using DateTime like before and not breaking existing code. PreciseDateTime is the one that actually makes use of the new API. As I said in the comment it only exist because I could implement both fastfield types for the same DateTime. I don't want to rip something without us agreeing so DateTime behaves as the way it use to with a precison -> seconds and the timestamp is second. when using the DateTime field, precison can change to millis or ... . |
@evanxg852000 I don't see how this is possible... In the FastValue of PreciseDateTime, from assumes that we have seconds in the fast field, and the conversion to a fastvalue is taking whatever truncated value we had. |
src/indexer/json_term_writer.rs
Outdated
@@ -325,7 +327,7 @@ impl<'a> JsonTermWriter<'a> { | |||
self.close_path_and_set_type(T::to_type()); | |||
self.term_buffer | |||
.as_mut() | |||
.extend_from_slice(val.to_u64().to_be_bytes().as_slice()); | |||
.extend_from_slice(val.to_u64(None).to_be_bytes().as_slice()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fulmicoton I am not applying the precision here to DateTime. my reasoning is that this is all indexing though we are pushing fast values. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question!
FastValue
is a poorly named trait.
It is not only used for fast values, it is also used for the inverted index.
In reality it offers a strictly monotonic mapping from the type to U64 (and its reciprocal).
We could be named MonoMapU64
or something like that.
Can you fill a ticket to do this rename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to your question... in the json_term_writer
we should apply some truncation on dates.
I suggest we always truncate with second precision?
We should probably truncate to seconds always for regular (=not Json) date fields too; and only use the precision setting for fast fields.
Besides this is the current behavior of tantivy in a sense.
src/lib.rs
Outdated
} | ||
|
||
/// Convert to UNIX timestamp in microseconds. | ||
pub const fn into_timestamp_nanos(self) -> i64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should ditch the nanos one. It could suggest we have nanos precision.
src/lib.rs
Outdated
/// Create new from UNIX timestamp in nanoseconds. | ||
pub const fn from_timestamp_nanos(nanoseconds: i128) -> Self { | ||
Self { | ||
timestamp_micros: (nanoseconds as i64 / 1_000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should divide into an i128
first and then convert to i64
. See for yourself. Can you add the corresponding unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inlined it and added a test for it.
src/lib.rs
Outdated
} | ||
|
||
/// Create new from UNIX timestamp in nanoseconds. | ||
pub const fn from_timestamp_nanos(nanoseconds: i128) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove that one (and inline the implem in from_utc
)
src/fastfield/mod.rs
Outdated
@@ -62,7 +63,9 @@ pub trait FastValue: Clone + Copy + Send + Sync + PartialOrd + 'static { | |||
/// Converts a value to u64. | |||
/// | |||
/// Internally all fast field values are encoded as u64. | |||
fn to_u64(&self) -> u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you use the field type for truncation on datetime, but it seems too convoluted. Let's not change the to_u64
.
Instead, let's put the truncation at the fast field writer level.
src/fastfield/writer.rs
Outdated
@@ -305,7 +335,7 @@ impl IntFastFieldWriter { | |||
pub fn add_document(&mut self, doc: &Document) { | |||
match doc.get_first(self.field) { | |||
Some(v) => { | |||
self.add_val(super::value_to_u64(v)); | |||
self.add_val(super::value_to_u64(v, &self.field_type)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put the truncation logic here.
let mut val =
super::value_to_u64(v, self.field_type);
if let Some(precision) = self.precision_opt {
val = precision.truncate(val);
}
What's a bit lame is that we will have to do it elsewhere for indexing.
src/indexer/segment_writer.rs
Outdated
@@ -248,7 +248,7 @@ impl SegmentWriter { | |||
FieldType::Date(_) => { | |||
for value in values { | |||
let date_val = value.as_date().ok_or_else(make_schema_error)?; | |||
term_buffer.set_u64(date_val.to_u64()); | |||
term_buffer.set_u64(date_val.to_u64(None)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion in json_term_writer.
We want some truncation here.
I suggest we truncate to the second all of the time.
|
||
/// DateTime Precision | ||
#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)] | ||
pub enum DatePrecision { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"precision": "Milliseconds"
does not look great in a JSON.
`"precision": "milliseconds" would look better IMHO
can you add attriubte to make the serde serialization look nicer?
(serde lower should do the job.)
src/collector/histogram_collector.rs
Outdated
@@ -72,7 +72,7 @@ impl HistogramComputer { | |||
return; | |||
} | |||
let delta = value - self.min_value; | |||
let delta_u64 = delta.to_u64(); | |||
let delta_u64 = delta.to_u64(None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove that line anyway I think.
delta_u64 = delta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments inline
506197e
to
22df8d6
Compare
src/schema/date_time_options.rs
Outdated
|
||
impl Default for DateOptions { | ||
fn default() -> Self { | ||
Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have clippy warnings here. Either use derive(Default) or add an annotation to make sure Clippy keeps quiet.
I'm ok with both solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used Default, shorter & the recommended Clippy way
a5929b4
to
b4a80b0
Compare
… field precision configurable
e61a59f
to
0035969
Compare
Good job! |
Thanks for this! Any chance of a release with this included soon? 🙏 |
@twe4ked The next release will be a breaking release, so we try to add more stuff into it. Maybe a release in around 1 month |
Implements the new DateTime field supporting feature in Tantivy.
Closes #1408