Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up date time parsing #19545

Merged
merged 23 commits into from
Jan 24, 2017
Merged

speed up date time parsing #19545

merged 23 commits into from
Jan 24, 2017

Conversation

shashi
Copy link
Contributor

@shashi shashi commented Dec 9, 2016

This PR builds on @simonbyrne's #18000 and makes the (120x 😆) speed up available for custom date formats. It also re-does the DateFormat type to be sufficiently parameterized so that an efficient parsing function can be "generated". This also incidentally makes it easier to add new features to the parser.

before:

julia> @benchmark DateTime("2010-10-10T10:10:10.10")
BenchmarkTools.Trial: 
  memory estimate:  3.11 kb
  allocs estimate:  85
  --------------
  minimum time:     20.125 μs (0.00% GC)
  median time:      21.233 μs (0.00% GC)
  mean time:        21.818 μs (1.38% GC)
  maximum time:     3.128 ms (96.19% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

after:

julia> @benchmark DateTime("2010-10-10T10:10:10.10")
BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     164.320 ns (0.00% GC)
  median time:      168.183 ns (0.00% GC)
  mean time:        168.666 ns (0.00% GC)
  maximum time:     291.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     750
  time tolerance:   5.00%
  memory tolerance: 1.00%

Also works fast for custom date format strings.

julia> @benchmark DateTime("01-01-2010X53:23:12", dateformat"dd-mm-YXS:M:H")
BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     148.979 ns (0.00% GC)
  median time:      151.564 ns (0.00% GC)
  mean time:        153.853 ns (0.00% GC)
  maximum time:     327.571 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     818
  time tolerance:   5.00%
  memory tolerance: 1.00%

fixes #15888, #13644
supersedes #18000

@shashi shashi requested review from nalimilan and quinnj December 9, 2016 10:50
@shashi shashi added dates Dates, times, and the Dates stdlib module performance Must go faster labels Dec 9, 2016
@shashi shashi requested a review from simonbyrne December 9, 2016 10:51

"""
DateTime(dt::AbstractString, df::DateFormat) -> DateTime

Construct a `DateTime` by parsing the `dt` date string following the pattern given in
the [`DateFormat`](@ref) object. Similar to
the [`DateFormat`](:func:`Dates.DateFormat`) object. Similar to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad rebase here

@@ -1009,6 +1009,8 @@ export
# dates
Date,
DateTime,
DateFormat,
@dateformat_str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be added to the manual index if exported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see contributing.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you meant I need to add this to the manual, I have. I was wondering if there's some kind of list I need to update when you said "manual index"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not talking about doc/manual, talking about doc/stdlib for the docstring. yes there is a list, again see contributing.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Will do.

@StefanKarpinski
Copy link
Member

Amazing work, @shashi! I would note that this still doesn't preclude having even more efficient custom code for the three or so common date formats.

@shashi
Copy link
Contributor Author

shashi commented Dec 10, 2016

@StefanKarpinski that's right, and you don't even have to give them a name and add them to a type hierarchy. :) Just dispatch on typeof(dateformat"mycommonformat").

@ViralBShah
Copy link
Member

Can this be backported to 0.5?

@shashi
Copy link
Contributor Author

shashi commented Dec 10, 2016

It breaks compatibility with the previous DateFormat type (presumably DateFormat was an implementation detail) and changes parsing behavior in some corner cases (see diff of tests)...

@@ -268,7 +275,7 @@ f = "y m d"
@test_throws ArgumentError Dates.Date(" 1 1 32",f)
@test_throws ArgumentError Dates.Date("# 1 1 32",f)
# can't find 1st space delimiter,s o fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment no longer applies?

rata = xs[7] + 1000*(xs[6] + 60*xs[5] + 3600*xs[4] + 86400*Base.Dates.totaldays(xs[1],xs[2],xs[3]))
return DateTime(Base.Dates.UTM(rata))
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of having this duplicate constructor versus:

DateTime(t::NTuple{7,Int}) = DateTime(t[1], t[2], t[3], t[4], t[5], t[6], t[7])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation will run into issues with integer overflow on 32-bit systems

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a cost to splatting so I added this, forgot to remove / define one with the other. Was going to do this, thanks!

The tests are failing on 32 bit machines because of overflow. I modified this method to convert the input tuple to Int32s but wasn't able to reproduce the failure. I'm wondering how tests passed in the previous implementation if the constructor is the problem. Do you have an idea of where exactly it breaks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just DateTime(t...)

return R(reorder_args(parts, fmt.field_order, fmt.field_defaults, err_idx)::NTuple{7,Int})

@label error
return R((err_idx,state,0,0,0,0,0), true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to indicate null here you want false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, added a test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was a 0.6 change, so may give different results than 0.5

end
end

_create_timeobj(tup, T::Type{DateTime}) = T(tup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not T(tup...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed changed this in a later commit. this review business is a bit confusing. Thanks for reviewing!

@simonbyrne
Copy link
Contributor

This looks pretty awesome. A few minor changes mostly pointed out by @omus.

@label error
return R((err_idx,state,0,0,0,0,0), true)
end
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash do you know of a way to make this not generated, but still get the same performance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the classic way of doing it in a static language (hardcoding a jump table / switch / vtable), but afaik there isn't really a good way to express that optimization in Julia right now (other than by doing a generated function like this). fwiw, since this only uses @nexpr, it appears to be a completely valid generated function.

@@ -1009,6 +1009,8 @@ export
# dates
Date,
DateTime,
DateFormat,
@dateformat_str,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I do this?

rata = xs[7] + 1000*(xs[6] + 60*xs[5] + 3600*xs[4] + 86400*Base.Dates.totaldays(xs[1],xs[2],xs[3]))
return DateTime(Base.Dates.UTM(rata))
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a cost to splatting so I added this, forgot to remove / define one with the other. Was going to do this, thanks!

The tests are failing on 32 bit machines because of overflow. I modified this method to convert the input tuple to Int32s but wasn't able to reproduce the failure. I'm wondering how tests passed in the previous implementation if the constructor is the problem. Do you have an idea of where exactly it breaks?

return R(reorder_args(parts, fmt.field_order, fmt.field_defaults, err_idx)::NTuple{7,Int})

@label error
return R((err_idx,state,0,0,0,0,0), true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, added a test

end
end

_create_timeobj(tup, T::Type{DateTime}) = T(tup)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed changed this in a later commit. this review business is a bit confusing. Thanks for reviewing!

@omus
Copy link
Member

omus commented Dec 12, 2016

I am concerned about these changes as they break support for custom parser slots which don't return integers. Specifically, these new changes won't work with parsing TimeZone types.

is used. This object is passed as the last argument to
`tryparsenext` and `format` defined for each `AbstractDateToken` type.
"""
immutable DateLocale{lang} end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly recommend against baking lang into a type parameter. This will actually make the API more awkward to extend than presenting it to the user as a dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I can make this type hold two dictionaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to think about how this can be generalised to other locale-based print functions (e.g. decimal separators), though that may be better in a different PR.

@label error
return R((err_idx,state,0,0,0,0,0), true)
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the classic way of doing it in a static language (hardcoding a jump table / switch / vtable), but afaik there isn't really a good way to express that optimization in Julia right now (other than by doing a generated function like this). fwiw, since this only uses @nexpr, it appears to be a completely valid generated function.

@simonbyrne
Copy link
Contributor

@nanosoldier runbenchmarks("dates", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@davidanthoff
Copy link
Contributor

Might it make sense to do #16928 as part of this as well?

@omus
Copy link
Member

omus commented Dec 14, 2016

I'm refactoring @shashi's code to clean it up and maybe find some additional performance tweaks (I've found a couple). I'll be update #19519 with this work unless someone feels I should open a new PR and close #19519.

@ViralBShah
Copy link
Member

I don't see any reason to open a new PR.

@simonbyrne
Copy link
Contributor

Interesting: I wonder why the printing is now slower?

@shashi
Copy link
Contributor Author

shashi commented Dec 15, 2016

@simonbyrne I did not benchmark printing speeds, but had to redo it to accomodate the new DateFormat. There should be easy speedups available.

@JeffBezanson
Copy link
Member

This is some excellent work. Real julia wizardry on display here!

@shashi
Copy link
Contributor Author

shashi commented Dec 15, 2016

Thanks @JeffBezanson 😄

@shashi
Copy link
Contributor Author

shashi commented Dec 15, 2016

@omus I have merged your changes in here with the required conflict resolutions.

Do pull them, and let me know when you have more updates so that I can pull them into this PR.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@tkelman tkelman added this to the 0.6.0 milestone Jan 19, 2017
@tkelman
Copy link
Contributor

tkelman commented Jan 19, 2017

The regressions are real, but we should get this in before feature freeze. @omus @shashi have you looked into fixing them?

@shashi
Copy link
Contributor Author

shashi commented Jan 20, 2017

I tried reverting the change. On my machine there's barely any change. I don't know how #19545 (comment) gave a green tick for these benchmarks.

@tkelman
Copy link
Contributor

tkelman commented Jan 20, 2017

I'm not sure, but it's possible this might be hitting problems related to #20025 ?

@tkelman
Copy link
Contributor

tkelman commented Jan 20, 2017

That good set of results was just before we merged the llvm 3.9 upgrade, so newer llvm changes the performance characteristics of the patches here w.r.t. those benchmarks.

@StefanKarpinski
Copy link
Member

Wait, so what's the status here? The string change is so kickass that it makes these speed ups irrelevant?

@shashi
Copy link
Contributor Author

shashi commented Jan 21, 2017

@StefanKarpinski No, there's a regression in converting DateTime to string which I'm trying to track down. Parsing is still much much faster with this.

dateformat"YYYY-mm-dd\THH:MM:SS"
else
dateformat"YYYY-mm-dd\THH:MM:SS.s"
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omus @quinnj is it okay if we always use "YYYY-mm-dd\THH:MM:SS.s" here? This type instability adds to the printing slow down. Users wanting to ignore milliseconds can specify dateformat"YYYY-mm-dd\THH:MM:SS"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait. this is not really a problem...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad it's not a problem. If we were to change this I think I would prefer "YYYY-mm-dd\THH:MM:SS.sss"

@shashi
Copy link
Contributor Author

shashi commented Jan 23, 2017

@nanosoldier runbenchmarks("dates", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@quinnj
Copy link
Member

quinnj commented Jan 23, 2017

DateTime printing looks good! Do we just need to apply the same changes to Date?

write(io, str)
else
l = endof(str)
write(io, SubString(str,l-(n-1),l))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably save endof(str) before the if.

if millisecond(dt) == 0
format(dt, dateformat"YYYY-mm-dd\THH:MM:SS", 24)
else
format(dt, dateformat"YYYY-mm-dd\THH:MM:SS.s", 26)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are milliseconds in a DateTime only ever a single digit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, s stands for any number of digits e.g. s gives .5 or .05 or .005 while .sss would give .500, .050.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is the 26 a minimum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's usually 23 chars, I added 3 more bytes to keep this going fast even when the year is -10000 ;) I'm not sure if this really matters, doesn't julia allocate the buffer backing an array in multiples of 16 bytes or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. As long as this is just a rough-guess buffer preallocation and things work fine if the actual output is longer or shorter then I'll just go with whatever the benchmarks say.

@shashi shashi force-pushed the sh/fast_date_parsing branch from 51a7cbc to cbe5577 Compare January 24, 2017 06:26
@shashi
Copy link
Contributor Author

shashi commented Jan 24, 2017

Date formatting (to arbitrary formats) is 5x faster than before. So in a bit of a cop out, I changed string(::Date) to be the same as before. I guess the speed up mainly comes from not allocating an IOBuffer object.

@nanosoldier runbenchmarks("dates", vs = ":master")

end
end

function Base.string(dt::Date)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to delete the now-duplicate one at line 46 then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, I should overlooked that when resolving merge conflicts.

@shashi shashi force-pushed the sh/fast_date_parsing branch from cbe5577 to 2a86b75 Compare January 24, 2017 08:00
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@shashi shashi merged commit 947347f into master Jan 24, 2017
format the `tok` token from `dt` and write it to `io`. The formatting can
be based on `locale`.

all subtypes of `AbstractDateToken` must define this method in order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First word in a sentence should be capitalized

@tkelman tkelman deleted the sh/fast_date_parsing branch January 24, 2017 13:21
omus added a commit to JuliaTime/TimeZones.jl that referenced this pull request Mar 16, 2017
Performance improvements introduced in Base.Dates no longer allow for
TimeZones to extend the functionality.

JuliaLang/julia#19545
omus added a commit to JuliaTime/TimeZones.jl that referenced this pull request Mar 16, 2017
Performance improvements introduced in Base.Dates no longer allow for
TimeZones to extend the functionality.

JuliaLang/julia#19545
omus added a commit to JuliaTime/TimeZones.jl that referenced this pull request Mar 16, 2017
Performance improvements introduced in Base.Dates no longer allow for
TimeZones to extend the functionality.

JuliaLang/julia#19545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dates Dates, times, and the Dates stdlib module performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

slow DateTime parsing