Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimerWheel usually ignores Timers execution if tickDuration is more than 1ns #73

Closed
Spikhalskiy opened this issue Sep 1, 2016 · 13 comments

Comments

@Spikhalskiy
Copy link
Contributor

Spikhalskiy commented Sep 1, 2016

Looks like currently TimerWheel is effectively broken.

Test:

int TICK_PERIOD_NS = 10;
int TIMEOUT_NS = 12;
controlTimestamp = 0;

final AtomicLong firedTimestamp = new AtomicLong(-1);
final TimerWheel wheel = new TimerWheel(this::getControlTimestamp, 12, TimeUnit.NANOSECONDS, 8);
final Runnable task = () -> firedTimestamp.set(wheel.clock().nanoTime());

final TimerWheel.Timer timer1 = wheel.newTimeout(TIMEOUT_NS, TimeUnit.NANOSECONDS, task);

assertEquals("Nothing expired on 10 timestamp", 0, wheel.expireTimers()); //which is fine
controlTimestamp = 10;
assertEquals("Nothing expired on 10 timestamp", 0, wheel.expireTimers()); //mmm... 12 should be here
controlTimestamp = 20;
assertEquals("Nothing expired on 20 timestamp... it's already definitely an error", 0, wheel.expireTimers());

This test pass. So... tick is 10ns. We schedule execution on 12ns timestamp. Nothing executed on 10ns, nothing executed on 20ns. You can put any value in TICK_PERIOD_NS which % TICK_PERIOD_NS != 0.

Reason:

TimerWheel#

if (0 >= timer.remainingRounds)
{
    timer.remove();
    timer.state = TimerState.EXPIRED;
    if (now >= timer.deadline)
    {
        ++timersExpired;
        timer.task.run();
    }
}
else
{
    timer.remainingRounds--;
}
    final long calculatedIndex = deadline / tickDurationInNs;

So... we schedule Timer to tick number in the "floor" div mode and after that we usually get now <= timer.deadline() if now is around right tick time. As a result - we mark Timer as Expired and do nothing + remove it with timer.remove().

Proposals:

  1. Main one. Stop double check deadline value. In any inaccuracy of external scheduler that makes a call of expireTimers() a bit earlier - Timers execution will be postponed additionally for ticksPerWheel * tickDurationNs. So deadline should be used only to determine tickIndex and for nothing more.

WheelTimer already depends on the external scheduler accuracy, because it fully depends on triggering ticks more or less on right intervals outside. So, it should be fine to check only remainingRounds on current tick.

if (0 >= timer.remainingRounds)
{
    timer.remove();
    timer.state = TimerState.EXPIRED;
    ++timersExpired;
    timer.task.run();
}
else
{
    timer.remainingRounds--;
}
  1. Now we schedule to ticks by division in "floor" mode. For current implementation with deadline, scheduling would be safer in the "ceiling" mode. But, after implementing 1), we should assign the nearest tick as it should be done according to the idea of HashedWheelTimer.
//IntMath is a guava class, just to reference what do I mean
final long calculatedIndex = IntMath.divide(deadline, tickDurationInNs, RoundingMode.HALF_DOWN);

*) Another approach to fix design - stop doing currentTick++; on each expireTimers() invocation and rework tick triggering logic. So, when we trigger expireTimers() outside - we calculate how much ticks passed since previous execution using clock and expire-run all Timers in passed ticks. This is even better, because we stop to be dependent on accuracy of external scheduler not in terms of frequency, but in terms of accuracy and things like STW pauses which could cause ticks loosing.

And one more note about this code:
now >= timer.deadline
is fine if we work with milliseconds. But incorrect for nanoseconds because of long overflow. The only correct way to compare nanosecond timestamps is: now - timer.deadline >= 0

@Spikhalskiy Spikhalskiy changed the title TimerWheel usually postpones Timers execution if tickDuration is more than 1ns TimerWheel usually ignores Timers execution if tickDuration is more than 1ns Sep 1, 2016
@tmontgomery
Copy link
Contributor

This is currently not something we use and therefore not something we intend to address. Feel free to send a PR, though. But it is not something we currently support.

Originally the design of TimerWheel assumed a tickDuration of at least millisecond for the use case. Also, it assumes you don't schedule too close to the current time.

@Spikhalskiy
Copy link
Contributor Author

Originally the design of TimerWheel assumed a tickDuration of at least millisecond for the use case.

Implementation is broken with any values > 1 nanosecond. So, for a millisecond it's broken too.

too close to the current time.

Unrelated. Tick is 10ms, schedule timeout for 69ms - it would be broken too according to this code.
deadline would be 69000ns, it will go to tick 60000 - it would be ignored according to description above

@Spikhalskiy
Copy link
Contributor Author

Spikhalskiy commented Sep 1, 2016

The statement, that you have code in a pretty new library that you don't use and support looks very strange.

Before using a class from Agrona I should ask if this class you use and support?
Or Agrona is not a general-purpose library and just nobody should use it as a common library, because it could be unsafe, because this class maybe abandoned?

I will think about PR :)

@tmontgomery
Copy link
Contributor

Good point. We shall look into getting rid of TimerWheel.

@Spikhalskiy
Copy link
Contributor Author

Removing functionality from general-purpose project released with description "High Performance data structures and utility methods for Java and C++" only because somebody, even if it's parent project, doesn't use this it anymore? Cool general purpose library. Ok, I got how do you feel about Agrona, it's unsafe to use in external projects.

@tmontgomery
Copy link
Contributor

As with ANY library or piece of OSS code, you use at your own discretion. In addition, the support of any OSS code is up to the author(s).

This work has graciously been sponsored to produce a set of OSS projects. We have moved those pieces that are generic into this project for people to use as they desire without using the entire larger project.

TimerWheel, specifically, we stopped using because we have a better means to handle timers. But we kept it around for others to use here. As we don't have the time, resources, nor interest in the support of the current TimerWheel, there might be no reason to keep it around. However, I have asked some users if they require it. If so, we will support it on our own time.

@Spikhalskiy
Copy link
Contributor Author

Spikhalskiy commented Sep 1, 2016

The question is not about licenses and definitions, it's about attitude.

It's fine to don't develop some piece of general purpose libraries you don't interested in right now, but leave things in just a broken state, or respond with meanings "we don't use, nobody cares" is not fine for general purpose library and public APIs.

Yes, you are fine with the formal point, but it's not safe to use a library with such attitude of maintainers - nothing more.

About PRs - of course, no sense to contribute to general purpose library with such an attitude. There are others.

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 1, 2016

It was suggested some time ago that the TimerWheel be migrated to a project such as JCTools. I think it is the only class in here that is not being actively used. Even so I put in the time to review and integrate your PR this morning. I also did it with a quick turn around. How you choose to see the world I cannot have any influence over.

We do care. We also have to be practical. If no one is using it that wants to help contribute then who should support it? I'm willing to put in some time but there is only so much I can give.

@Spikhalskiy
Copy link
Contributor Author

Spikhalskiy commented Sep 1, 2016

@mjpt777 Quick review, feedback and merging PR is beautiful, thanks for this! It was a first positive opinion.

On the same time, a reaction like "we don't use this - we don't support this, even if it's broken" above for general purpose library looks pretty strange for me - that's all that I'm talking about.

If no one is using it that wants to help contribute then who should support it?

For example, I use this class and I want to contribute because I have a problem. I already made some new stuff on the morning. And I can prepare a fix for the current problem too. So I spent time to describe it in issue description and put it in a review for maintainers.

What I'm waiting for is something like

  1. 1 approach looks fine, sorry, don't have time to take look into, we love contributions
  2. 2 approach looks more solid, could you been interested in preparing PR
  3. Oh, I will take care of it

So, I'm waiting for some feedback, if it actually looks like a bug, or I just don't understand something. If I propose different ways of solving the problem - I'm waiting for some recommendations, which is a preferable way from maintainers point of view. Or right way is completely different.

Instead, I see:

This is currently not something we use and therefore not something we intend to address. Feel free to send a PR, though. But it is not something we currently support.

This answer is really a "Too long to read, nobody cares here, you can prepare PR as much as you want".

So, an answer to your question - everyone should support this who use this. And I don't require you to come and fix this right now because something is broken. And I didn't talk about it. I supposed to prepare fix for this issue and just waited some adequate advice of maintainer, not "TL;DR We don't use it - we don't support this".

Hope, I described how I see the world. And in the most open-source projects, I contributed in, my view of the world matches to culture and attitudes in this projects.

About subject:
From this JCTool discussion looks like people use it and it's the "main" implementation now, so it's worth to support it or yes, move it somewhere where people will support it.

@tmontgomery
Copy link
Contributor

tmontgomery commented Sep 1, 2016

@Spikhalskiy my apologies for phrasing things incorrectly initially. We do care. However, TimerWheel is not something we do use for various reasons. And, unfortunately, not something we can devote a lot of time to. As you submitted an issue instead of a PR for this, I was under the impression you wanted us to address it. Now that you have explained your meaning, it is clearer. Thanks!

I think it is worth explaining what is not in the Javadoc for TimerWheel to frame the discussion on usage.

The intended initial usage of TimerWheel (which, again, we don't use) was to repeatedly call expireTimers() guarded by computeDelayInMs() as part of a duty cycle which is either busy spinning or parking for very very short durations. Somewhat like the unit tests. Which is very different than how you call expireTimers in your test. The resolution of the timers was supposed to be well within this cycle time and thus a scenario as you mention was unlikely under most usage. The check on deadline was for more accuracy for multiple timers and because it was assumed that the expire would be called many times within a tick. This still has issues and is flawed, but was a tradeoff for the usage that was in place originally.

There are many ways to address this as you mention. Removing the deadline check is fine if timer ordering is not a concern. But insufficient.

The expiration logic is quite flawed. If I had the time, I would completely rewrite this with a slightly different approach. As I do not, the best way to go is to migrate this out of Agrona unless its usage is required.

@Spikhalskiy
Copy link
Contributor Author

Spikhalskiy commented Sep 1, 2016

@tmontgomery Thanks. Apologies for too sensitive reaction from my side too.

The intended initial usage of TimerWheel (which, again, we don't use) was to repeatedly call expireTimers() as part of a duty cycle which is either busy spinning or parking for very very short durations.

Maybe it was initial usage, but it doesn't looks right with the current code. We have a tickDurationNs. Each call to expireTimers() we increase current tick pointer. It looks like we should call expireTimers() once per tickDurationNs nanosecond. If we do it another way - ticks and putting Timer to specific tick in array has no sense and works not according to the idea, so position of task in array has no sense because we would just very quickly check whole array.

In any case - failing test for this scenario:

    @Test
    public void fail()
    {
        long ACTUALLY_TRIGGERING_EACH_MS = 1;

        controlTimestamp = 0;
        final AtomicLong firedTimestamp = new AtomicLong(-1);
        final TimerWheel wheel = new TimerWheel(this::getControlTimestamp, 5, TimeUnit.MILLISECONDS, 8);
        final Runnable task = () -> firedTimestamp.set(wheel.clock().nanoTime());

        wheel.newTimeout(11, TimeUnit.MILLISECONDS, task);

        for (int i = 0; i < 1000; i++) {
            controlTimestamp += TimeUnit.MILLISECONDS.toNanos(ACTUALLY_TRIGGERING_EACH_MS);
            wheel.expireTimers();
        }
        assertTrue(firedTimestamp.get() > 0);
    }

One more test that could help to figure out bug that I'm talking about (tests below are with reasonable calling expireTimers() only once per tickDurationNs):

    @Test
    public void fail1()
    {
        long ACTUALLY_TRIGGERING_EACH_MS = 1800; //2000 - would be OK

        controlTimestamp = 0;
        final AtomicLong firedTimestamp = new AtomicLong(-1);
        final TimerWheel wheel = new TimerWheel(this::getControlTimestamp, 2, TimeUnit.SECONDS, 8);
        final Runnable task = () -> firedTimestamp.set(wheel.clock().nanoTime());

        wheel.newTimeout(11, TimeUnit.SECONDS, task);

        for (int i = 0; i < 10000; i++) {
            controlTimestamp += TimeUnit.MILLISECONDS.toNanos(ACTUALLY_TRIGGERING_EACH_MS);
            wheel.expireTimers();
        }
        assertTrue(firedTimestamp.get() > 0);
    }

If we will trigger timer each 1800ms instead of 2s - task would never be executed at all. Just never.

Triggering of timer a bit earlier at least once could leave to completely loosing task. Take a look, triggering expireTimers() 1ns before right time (in terms of our NanoClock time) leads to completely task loosing:

    @Test
    public void fail2()
    {
        long TRIGGERING_MS = 2000;

        controlTimestamp = 0;
        final AtomicLong firedTimestamp = new AtomicLong(-1);
        final TimerWheel wheel = new TimerWheel(this::getControlTimestamp, TRIGGERING_MS, TimeUnit.MILLISECONDS, 8);
        final Runnable task = () -> firedTimestamp.set(wheel.clock().nanoTime());

        controlTimestamp += TimeUnit.MILLISECONDS.toNanos(TRIGGERING_MS) - 1; //1 ns
        wheel.newTimeout(TRIGGERING_MS, TimeUnit.MILLISECONDS, task);
        wheel.expireTimers();

        controlTimestamp += TimeUnit.MILLISECONDS.toNanos(TRIGGERING_MS) - 1; //1 ns
        wheel.expireTimers();

        for (int i = 0; i < 10000; i++) {
            controlTimestamp += TimeUnit.MILLISECONDS.toNanos(TRIGGERING_MS);
            wheel.expireTimers();
        }

        assertTrue(firedTimestamp.get() > 0);
    }

The source of all of this is the mistake in design, that we mix information about time in a discreet form:

  1. currentTime = numberOfExpiredTimersCalls() * tickDuration - we actually use this time when we schedule task and put it into the tick array
  2. and another time source - we have time that we get from NanoClock.
    Plus, we never synchronize them between each other. So, as a result of GC we could get different times in this virtual sources, depends on expireTimers() caller behaviour.

I would take this implementation and will make something based on NanoClock only as a time source or based on expireTimers() calls count only. Would send PR when it would be done, maybe you will want to adopt it here. If not - would just keep in my project.

@tmontgomery
Copy link
Contributor

Yes, calling expireTimers() before the computeDelayInMs() is, I believe, the source of the main problem. i.e. calling expire too early.

This is why the usage guarded the call. The original usage in the busy spin was (going from memory) of the form:

if (computeDelayInMs() <= 0)
{
    expireTimers();
}

The intention was to only call expireTimers() after the tick. Which is a flawed idea anyway.

The test shouldHandleExpiringTimersInPreviousTicks was to test going past. But not before the tick.

@Spikhalskiy
Copy link
Contributor Author

Put the class here https://github.com/Spikhalskiy/hashed-wheel-timer, simplified and fixed unsafe stuff, time/tick managing design. Would continue development in this repo, but it should be very close to the original version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants