Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APScheduler 4.0 progress tracking #465

Open
15 of 20 tasks
agronholm opened this issue Sep 29, 2020 · 190 comments
Open
15 of 20 tasks

APScheduler 4.0 progress tracking #465

agronholm opened this issue Sep 29, 2020 · 190 comments
Milestone

Comments

@agronholm
Copy link
Owner

agronholm commented Sep 29, 2020

I'm opening this issue as an easy way to interested parties to track development progress of the next major APScheduler release (v4.0).

Terminology changes in v4.0

The old term of "Job", as it was, is gone, replaced by the following concepts which are closer to the terminology used by Celery:

  • Task definition: a uniquely named callable coupled with configuration like maximum number of instances, misfire grace time etc.
  • Schedule: binds a trigger with a task definition
  • Job: queued work item for an executor (binds to a task definition, and optionally a schedule)

Also, the term "executor" is now being changed to "worker".

Notice that the terminology may still change before the final release!

Planned major changes

v4.0 is a ground-up redesign that aims to fix all the long-standing flaws found in APScheduler over the years.

Checked boxes are changes that have already been implemented.

  • Async-first design, with support for asyncio and trio (via AnyIO)
  • Static typing friendly (PEP 561)
  • Support for serializers other than pickle
  • Broader time zone support, including zoneinfo time zones (PEP 615)
  • Drop support for Python < 3.7
  • Calendar interval trigger
  • Stateful triggers
  • threshold value for AndTrigger (resolves issues with contained IntervalTrigger instances)
  • The interval trigger should start right away and not after the first interval ("Interval" scheduler skipping first iteration #375)
  • Persistent store sharing among multiple schedulers (arguably the most needed feature ever for APScheduler)
  • Decoupling of schedulers and workers
  • Schedule-level jitter support
  • Context-local job metadata information
  • Easy launching of tasks immediately without needing a schedule
  • Failure resilience for persistent data stores (so they don't crash the scheduler on a temporary outage)

Potential extra features I would like to have:

You will notice that I have dropped a number of features from master. Some I may never add back to v4.0, even if requested, but do voice your wishes in this issue (and this issue only – I will summarily close such requests in new tickets). Others have been removed only temporarily to give me space for the redesign.

Features on the chopping block

  • Twisted scheduler (may be usable through the async scheduler if AnyIO ever gets Twisted support)
  • Tornado scheduler (just use the async scheduler)
  • Gevent scheduler (does not play well with the new architecture)
  • Qt scheduler (difficult to test/maintain)
  • Redis as a data store (may not have sophisticated enough querying capabilities)
  • Rethink data store (the company has gone belly up some time ago)
  • Zookeeper as a data store (may not have sophisticated enough querying capabilities)

Being on the chopping block does not mean the feature will be gone forever! It may return in subsequent minor release or even before the 4.0 final release if I deem it feasible to implement on top of the new architecture.

@agronholm
Copy link
Owner Author

The master branch is now in a state where both the async and sync schedulers work, albeit with a largely incomplete feature set. Next I will focus on getting the first implementation of shareable data stores, based on asyncpg. I've made some progress on that a while back but got sidetracked by other projects, particularly AnyIO.

@codingadvocate
Copy link

Regarding Twisted scheduler on the chopping block for APScheduler v4.

My main OSS project is a multi-process app, that spins up many Twisted reactors in those processes, where several of the sub-processes use APScheduler inside the reactor (https://github.com/opencontentplatform/ocp). What would be a safe replacement scheduler if the twisted version is being removed?

@agronholm
Copy link
Owner Author

So you run multiple schedulers? Are you sharing job stores among them?

The main reason I'm thinking of dropping (explicit) Twisted support is because it carries a heavy burden of legacy with it. I will play around with it and see if I can make it work at least with the asyncio reactor. If it can be made to work with a small amount of glue, I will take it off the chopping block.

@codingadvocate
Copy link

Yes, it runs multiple instances of the schedulers - with their own independent job stores.

I understand the need for software redesigns, and I'm certainly not pushing back or trying to make more work for you. Just trying to understand what the recommendation would be. Maybe I could fall back to using APS' BackgroundScheduler since I don't spin it up until after the reactors are running? Either way, I saw the note and want to ensure I follow whatever happens on that one.

Either way, thank you for the solid project.

@agronholm
Copy link
Owner Author

Are the jobs you run typically asynchronous (returning Deferreds) or synchronous (run in threads)?

@codingadvocate
Copy link

The initial setup with creating job definitions is synchronous. Any updates to previous job definitions or newly created jobs (stored/managed in a DB) occur regularly in an asynchronous manner (LoopingCall that returns a Deferred). And all the work with job runtime (execution/management/reporting/cleanup) occurs in non-reactor threads.

@agronholm
Copy link
Owner Author

Ok, so it sounds like the actual job target functions are synchronous, correct? Then you would be able to make do with the synchronous scheduler, yes?

@codingadvocate
Copy link

If you're saying so, then yes. I defer to your knowledge there. I selected with TwistedScheduler since the user guide choosing-the-right-scheduler section said to do so when building a Twisted application.

I apologize for compounding the response with a question, but it's related. How is the thread pool and thread count handled if I use something other than the TwistedScheduler? Will the job run inside Twisted's thread pool, or inside BackgroundScheduler's thread pool? Do I need to extend both?

Does constructing the BackgroundScheduler with an explicit max_workers count (example below), do anything when it's running inside the Twisted's reactor?

self.scheduler = BackgroundScheduler({
'apscheduler.executors.default': {
'class': 'apscheduler.executors.pool:ThreadPoolExecutor',
'max_workers': '25'
}
})

@agronholm
Copy link
Owner Author

Will the job run inside Twisted's thread pool, or inside BackgroundScheduler's thread pool? Do I need to extend both?

The sync scheduler (including 3.x's BackgroundScheduler knows nothing about Twisted's thread pool. The Twisted scheduler in 3.x differs from BackgroundScheduler only in that its default executor uses the Twisted reactor's internal thread pool. It doesn't even have async support!

I want to provide first class async support in APScheduler 4.x. If I can do that with Twisted without having to create an entire ecosystem of Twisted specific components, then I'm open to doing that.

@agronholm
Copy link
Owner Author

I just added a few items to description:

  • External workers
  • Schedule-level jitter support
  • Ability to cancel jobs
  • Timeouts for jobs
  • Redis as data store
  • Zookeeper as data store
  • "executor" being renamed to "worker"

@agronholm
Copy link
Owner Author

I am open to it, but only as soon as their API stabilizes. As it stands, every beta release breaks backward compatibility. I have more important issues to work on. I don't think v4.0 will have OpenTelemetry support but I will consider adding it to a minor update release once they are in GA.

@agronholm
Copy link
Owner Author

A lot of progress has been made on the core improvements of v4.0. Vast code refactorings have taken place. The data store system is really taking shape now.

I've added "Failure resilience for persistent data stores" to the task list. It's one of the most frequent deployment issues with APScheduler, so I'm making sure that it's adequately addressed in v4.0.

I'm not sure what to do with the event system. I may rip it out entirely until I can figure out exactly how it should work. I know users will want to know when a job completes or a misfire occurs etc., so it will be implemented in some form at least before the first release.

I will post another comment when I've pushed these changes to the repository.

@agronholm
Copy link
Owner Author

I hit a snag with the synchronous version of the scheduler. I tried to use the AnyIO blocking portal system to run background tasks but I had to conclude that it won't work that way. I have an idea for that though.

@jykae
Copy link

jykae commented Dec 9, 2020

@agronholm do you have any estimate when 4.0 would be released?

@agronholm
Copy link
Owner Author

I had hoped at least for an alpha at this point, but the design problems in the sync version killed the momentum I had. I have not done any significant F/OSS development since. I am still committed to getting 4.0 done, but due to pressure at work I don't think I can work on it before Christmas holidays.

@williamwwwww
Copy link

@agronholm How will you make the jobstore can be shared among multiple schedulers?

@agronholm
Copy link
Owner Author

@agronholm How will you make the jobstore can be shared among multiple schedulers?

By coordination and notifications shared between schedulers. Notifications are optional but recommended, and without notifications the schedulers will periodically check for due schedules. How all this works is specific to each store implementation.

@ahmet2mir
Copy link

Hello @agronholm

Impressive task list and thanks for apscheduler.

By big christmas whish is "locking" (probably the idea of persistent storage)

I use apscheduler on several web nodes each node had some workers.

Today, I inherit scheduler, store etc to add locking.

Instead of using add_job I call queue_job, create an event, everyone wakeup, the first taking the job lock it (using NX with redis + redlock algorithm).
When the job pass a certain time, I mark the job as "dead" and our alerting tell us the dead job.

For me it's mandatory that a Task never belong to a worker, the job must be in queue then another worker or himself could process that task.

To achieve it I added in redis (like jobs and running keys) "ready", "locked", "dead", "failed", "done"

  • queue add in ready
  • event queued, wakeup, try to lock
  • when lock acquired, move from ready to jobs_key (which is what you use to process the job)
  • adding listener on the task, if success move to done key and release lock, otherwise move to failed key and release the lock.
  • if the job had a lock and never get ack on status, move it in dead and release (this part is tricky because a dead job depends on the nature of job)

I'm a big fan of Sidekiq (and also Faktory)

And I will be very happy with something like

In the "main"

def myfunc(x, y):
    print(x, y)

scheduler = Scheduler(...)
# register myfunc as a valid callable to avoid pickle on func
scheduler.register('myfunc', myfunc)
scheduler.start()

Then in code

# note that myfunc is in string
job = scheduler.queue('myfunc', kwargs={"x": 1, "y": 2})
print(job.status) # ready - no one process it
...
print(job.status) # pending - someone process it
...
print(job.status) # done - success

Why not Celery ?

I don't wan't to setup full celery/flower stuffs, my tasks are simple and I'm a bit lazy to repackage an entire app or split into small libs some line of codes just to allow celery running my code (and also split config, creds etc)
I prefer using celery when necessary.

Don't know if I'm clear (not native english)

@agronholm
Copy link
Owner Author

@ahmet2mir APScheduler 4.0 already has the proper synchronization mechanisms in place.

What's still missing is the synchronous API. I've come to a realization that I cannot simply copy the async API and remove the async keywords because cancellation isn't going to work with the sync API, and AnyIO's BlockingPortal mechanism (as it is currently) is inadequate for cases where you need to start background tasks. I must address this issue first and then come back to finish the basic APScheduler 4.0 API.

@agronholm
Copy link
Owner Author

While 4.0 is being worked on, I've gone back to the 3.x branch for a bit and fixed a number of bugs and other annoyances.

@agronholm
Copy link
Owner Author

Tests on async/sync workers (formely: executors) are passing now, but the sync worker tests are strangely slow and I want to get to the bottom of that before moving forward.

@agronholm
Copy link
Owner Author

Slowness in worker tests resolved: it was a race condition in which the notification about the newly added job was sent before the listener was in place, causing the data store to wait for the 1 second timeout to expire before checking for new jobs again.

I'll move on to completing the synchronous scheduler code now. I'm also very close to releasing AnyIO v2.1.0 which is a critical dependency for APScheduler 4.

@agronholm
Copy link
Owner Author

agronholm commented Feb 13, 2021

Tests for both sync and async schedulers pass, but the tests run into delays caused by the new schedule/job notifications not working as intended, plus the sync scheduler tests are causing lots of "Task exception was never retrieved" errors outside of the actual tests which I will have to investigate. I'm considering making an alpha release once these issues have been ironed out.

@agronholm
Copy link
Owner Author

agronholm commented Feb 14, 2021

After hours of debugging, I finally figured out that I was needlessly creating a new task group in the worker's run() method and overwriting the outer task group as a worker attribute. The odd errors went away after I fixed that.

@agronholm
Copy link
Owner Author

I've just pushed a big batch of changes that implement data store sharing on PostgreSQL (via asyncpg) and MongoDB (via motor). There are a lot of rough edges but at least the whole test suite passes now (at least locally – CI seems to have some troubles). In the coming days I'll try to polish the code base to the point where I can at least make an alpha release.

Feel free to try it out, but you'll have to look at the test suite for some idea on how to use it since I haven't updated the docs yet. Also, the database schema will change before the final release (tasks accounting is not currently done) so expect to have to throw out your schedules and jobs.

@NixBiks
Copy link

NixBiks commented Nov 11, 2023

Stateful triggers contain state which is saved after the trigger is used to calculate new fire times for a schedule. All triggers are stateful in APScheduler 4.

Got it. And it isn't possible to share state between jobs on the same worker currently, right? E.g. I want to reuse a database connection for a schedule (and then close it once the schedule is "done"). Maybe it can be done via events now that I think of it.

@agronholm
Copy link
Owner Author

For schedules, "stateful" means that its jobs retain some internal state which is then saved after the execution of the job. Sharing database connections is out of scope anyway since you can't serialize them.

@agronholm
Copy link
Owner Author

I've released another alpha, with tons of fixes/workarounds for less capable RDBMS (sqlite, mysql). Explicit task configuration is also in there. As usual, this update requires wiping your data store and starting over.

@franz101
Copy link

Thank you so much Alex, I just started using 3.x, do you see any specific date around a production ready 4.x release? @agronholm

@agronholm
Copy link
Owner Author

I'm sure I can get a beta out before the end of the year (I am furloughed most of December so I have plenty of time to work on APScheduler), but production? That depends on what issues come up in testing. Q2/2024? Not impossible at least.

@franz101
Copy link

Thank you so much for being transparent, really appreciate your community effort <3

@agronholm
Copy link
Owner Author

Some good news again. I'm making significant progress on the cleanup feature which periodically purges expired job results, and now also finished schedules which are no longer purged right after the last job is submitted to the store. With luck, I can push these changes to GitHub this weekend.

I've also opened two discussions I would like your input on:

  1. Data store extensibility
  2. Serialization and security

@agronholm
Copy link
Owner Author

The automated cleanup is now in. That's 3 out of 4 blockers completed for the beta release. My idea of a "beta" release is that it's feature complete but may still contain bugs. I would like to get the data store schema settled so that there won't be any need for nuking the data stores after an upgrade to a newer beta. To that end, the first bullet point of my previous comment needs to be addressed ASAP. In the absence of any feedback on that issue, my plan is to introduce dynamic fields and to correspondingly reduce the number of columns to only those that need to be indexed and queried against.

The last blocker is now the implementation of maximum running jobs limits. Ideally there would be two levels of such limits: task and schedule level. The total number of jobs with the same task ID would never be allowed to rise above the task-level limit, and the number of jobs with the same schedule ID would never exceed the schedule-level limit.

The promised import/export feature will likely not land in the first beta, but probably in the second one.

@agronholm
Copy link
Owner Author

Alright, so we're in 2024 now. I know what I said about the beta, but I got sidetracked by two other projects of mine that needed urgent work on them. There will be another alpha as soon as:

  • APScheduler 4.0.0 Bug #803 is resolved
  • The limit_jobreleased_size branch is merged with tests (prevents serialization error with the asyncpg event broker when the traceback is too long)
  • Data stores have code to periodically update their jobs' expiration times so they're not deleted before completion

@camsmith
Copy link

Hey @agronholm , wanted to ask whether it would be helpful to submit issues for bugs in the 4.0.0 releases at this point, or if it would just be best to wait? I'd like to move my system to 4.0.0 because it has some settings that 3.10.9 does not, but I'm running into some problems here and there. Thanks again.

@agronholm
Copy link
Owner Author

It might be helpful, but remember that it's still in alpha state for a reason.

@JBrut22
Copy link

JBrut22 commented Apr 15, 2024

Hey, I am really loving the new version. it is a lot easier to use when compared to the other options available. I am using AsyncScheduler. Diving into the code, I can see why it is a lot of work to get this version released to the world! I will take a look at the issue below and see if I can make some changes. I also noted that the latest commits may address a few of these issues.

I read through much of the discussion, but thought it prudent to add my own thoughts on v4.0.0a4:

  • add_job fails to ever run the job, i think this has already been noted. Basically, this should not be used currently...
    • As i can best tell, this is likely because the job does not register a scheduled_fire_time in the database, so it never runs. Not sure what happens if another job gets added from the schedule.
    • If it is a scheduled job you want to move up, you can remove the schedule and then add it again to have it run right now.
    • if you use add_job the above, then it seems to get stuck and never run any jobs at all.
    • I did create a quick workaround, where it will completely wipe the job history using SQL on start up (or using a FastAPI endpoint. This works great if you only want to use schedules and never add jobs.
    • job results only seem to be an option for add_job, so basically it does not work at the moment as far as i can tell. Not really a big issue...

Workaround, only for scheduling; if you are manually running or adding jobs, this will fail to help you.
It deletes all the jobs, sets running jobs to 0 (in case the scheduler failed during the job), and adds 10 minutes to the schedule so the jobs will start again in a bit. works for my purposes, but might not be suitable for everyone. I run this at start up just in case and created an API endpoint to run it in case it hangs, adding scheduler.stop() and scheduler.start_in_background().

of course, i think there is another fix, referenced here.

def fix_scheduler():
    """
    fix scheduler
    """
    status_code = 200
    try:
        q = """
            DELETE FROM jobs;
            UPDATE tasks SET running_jobs = 0;
            UPDATE schedules SET next_fire_time = next_fire_time + 600 * 1e6;
            """
        with engine.connect() as conn:
            conn.execute(text(q))
            conn.commit()

        msg = 'Scheduler records fixed.'
    except Exception as e:
        logger.error(e)
        status_code = 500
        msg = f'Failed to fix scheduler records: {str(e)}'
    
    return {"message": msg, 'status_code': status_code}

@agronholm
Copy link
Owner Author

Hi, it's been a while! I just released a new alpha with a metric ton of fixes, and a handful of new features too! Importantly, data stores now finally have a clean-up procedure which will remove expired job results and finished schedules. Schedules can now also be paused and unpaused (contributed by @WillDaSilva). This restores a 3.x series feature in an even more powerful form. Kudos to the 3 people who contributed fixes too!

As usual, the data store schemas have changed in a backwards incompatible manner, so you need to start from scratch when updating. This should stop happening once the beta is out.

@agronholm
Copy link
Owner Author

Hey, I am really loving the new version. it is a lot easier to use when compared to the other options available. I am using AsyncScheduler. Diving into the code, I can see why it is a lot of work to get this version released to the world! I will take a look at the issue below and see if I can make some changes. I also noted that the latest commits may address a few of these issues.

I read through much of the discussion, but thought it prudent to add my own thoughts on v4.0.0a4:

  • add_job fails to ever run the job, i think this has already been noted. Basically, this should not be used currently...

    • As i can best tell, this is likely because the job does not register a scheduled_fire_time in the database, so it never runs. Not sure what happens if another job gets added from the schedule.
    • If it is a scheduled job you want to move up, you can remove the schedule and then add it again to have it run right now.
    • if you use add_job the above, then it seems to get stuck and never run any jobs at all.
    • I did create a quick workaround, where it will completely wipe the job history using SQL on start up (or using a FastAPI endpoint. This works great if you only want to use schedules and never add jobs.
    • job results only seem to be an option for add_job, so basically it does not work at the moment as far as i can tell. Not really a big issue...

Workaround, only for scheduling; if you are manually running or adding jobs, this will fail to help you. It deletes all the jobs, sets running jobs to 0 (in case the scheduler failed during the job), and adds 10 minutes to the schedule so the jobs will start again in a bit. works for my purposes, but might not be suitable for everyone. I run this at start up just in case and created an API endpoint to run it in case it hangs, adding scheduler.stop() and scheduler.start_in_background().

of course, i think there is another fix, referenced here.

def fix_scheduler():
    """
    fix scheduler
    """
    status_code = 200
    try:
        q = """
            DELETE FROM jobs;
            UPDATE tasks SET running_jobs = 0;
            UPDATE schedules SET next_fire_time = next_fire_time + 600 * 1e6;
            """
        with engine.connect() as conn:
            conn.execute(text(q))
            conn.commit()

        msg = 'Scheduler records fixed.'
    except Exception as e:
        logger.error(e)
        status_code = 500
        msg = f'Failed to fix scheduler records: {str(e)}'
    
    return {"message": msg, 'status_code': status_code}

There are tests making sure add_job() works as intended. If you have evidence to the contrary, please file an issue with a minimal working example.

@hmilkovi
Copy link

hmilkovi commented Jun 6, 2024

In current state (no yet beta) it would be great to add restart worker after X processed jobs (behaviour same as Gunicorn) to battle memory leaks in third party libs and code.

Reasons why I would like it:

  • slow memory leaks often are detected in production and this is easy way to fix when investigation leaks
  • we have same in Celery (I know use case is not 1:1)

Desired flow:

  1. global counter
  2. if counter > X pause processing wait for all running jobs to finish
  3. restart worker

Please give feedback on idea and if people like it I offer my help to make a PR.

@agronholm
Copy link
Owner Author

In current state (no yet beta) it would be great to add restart worker after X processed jobs (behaviour same as Gunicorn) to battle memory leaks in third party libs and code.

Reasons why I would like it:

  • slow memory leaks often are detected in production and this is easy way to fix when investigation leaks
  • we have same in Celery (I know use case is not 1:1)

Desired flow:

  1. global counter
  2. if counter > X pause processing wait for all running jobs to finish
  3. restart worker

Please give feedback on idea and if people like it I offer my help to make a PR.

GUnicorn is an entry point (or a launcher, if you like), so it's fine for it to do restarts at will. APScheduler, however, is merely a library, and it is not acceptable for it to be restarting its host process. If you want such a mechanism, it will have to be implemented elsewhere.

@hmilkovi
Copy link

hmilkovi commented Jun 6, 2024

I agree it's out of the scope, many thanks for fast response I really appreciate you work.

@adk23333
Copy link

hi, I just noticed this project. I would like to know if the documentation for v4.0 can be queried now?

@agronholm
Copy link
Owner Author

What do you mean by "queried"?

@adk23333
Copy link

“查询”是什么意思?

sorry, is "view the document"

@agronholm
Copy link
Owner Author

The documentation is here. You just had to select master from the drop-down menu at the bottom left.

@adk23333
Copy link

You just had to select master from the drop-down menu at the bottom left.

thanks

@parthiibank
Copy link

@agronholm, any idea when we move to beta or stable release which can be used for production? TIA

@agronholm
Copy link
Owner Author

I was already going to release the beta at the end of July, but then some critical issues were reported by users against master. Currently I'm busy getting AnyIO v4.5.0 out the door, and after that I will focus on APScheduler once again. I'm afraid to give any time tables, but getting the beta out is my next top priority after AnyIO.

@parthiibank
Copy link

Thanks @agronholm. We hit a roadblock while we trying to use 3.10 with clustering. We got undesired behaviors while we ran 3 instances with a shared data store. We are eagerly waiting for a stable release. We have done PoCs with master and for our simple usecase, the master is working fine.

@PraveenKumarM21
Copy link

@agronholm . I'm using Apscheduler 4.0.0a5. I have a problem with it as I using Redis as event broker and mongo has datastore my flask application is running in multiple instances which inturn creates multiple schedule instances. This causes duplication of tasks. Any idea on this??. Thank you

@agronholm
Copy link
Owner Author

Could you please not ask questions in this thread, and create a new discussion for that? Any messages posted here will notify a lot of people. My suggestion while waiting for the beta is to use the master directly, as it has tons of fixes already in it.

@agronholm
Copy link
Owner Author

It's been a while again. This weekend I decided to give the old 3.x branch an overhaul because it was causing difficulties for users. I got a lot of work done over there, including some that will help with the eventual 4.x migration:

  • ZoneInfo time zones now work with the built-in triggers (they are transparently converted)
  • APScheduler 3.x is now compatible with later releases of tzlocal
  • CalendarIntervalTrigger was backported to 3.x
  • Export/import of jobs is now implemented in 3.x (4.x implementation TBD)

Check out the full details in the version history.

I realize that a year ago I estimated the beta to be released in December (2023), and that hasn't happened yet. I was going to do it in July but then a flurry of bug reports came in which I deemed necessary to be fixed before the beta, and the mental exhaustion, plus work from a gazillion other projects hit me hard. But I'm still working on it when I can. I have a hard deadline on finishing another major release before the end of the year, but I'll try to dedicate so time towards fixing the APScheduler 4.x issues too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests