Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create cron utility extension #3437

Closed
MeltyBot opened this issue May 24, 2022 · 19 comments
Closed

Create cron utility extension #3437

MeltyBot opened this issue May 24, 2022 · 19 comments

Comments

@MeltyBot
Copy link
Contributor

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/3522

Originally created by @alxthm on 2022-05-24 14:54:47


For simple meltano deployments, it would be great to have a cron-based orchestrator using the meltano schedule configuration, without having to set-up an Airflow instance with a db, users, etc.

The meltano UI already makes it possible to manually run meltano schedules, so this would be a nice complement 😄

@MeltyBot
Copy link
Contributor Author

@WillDaSilva
Copy link
Member

@pandemicsyn would #5962 resolve this issue? I think it would, since it could essentially do what I suggested here: https://gitlab.com/meltano/meltano/-/issues/3522#note_962685280. Though I'm not sure about having Meltano itself edit the user's crontab, rather than Meltano printing the crontab lines to stdout.

@pandemicsyn
Copy link
Contributor

@pandemicsyn would #5962 resolve this issue? I think it would, since it could essentially do what I suggested here: https://gitlab.com/meltano/meltano/-/issues/3522#note_962685280. Though I'm not sure about having Meltano itself edit the user's crontab, rather than Meltano printing the crontab lines to stdout.

Yep, i think #5962 would solve this. I'd probably lean toward not using user crontab entries and generating meltano cron file that can be dropped in /etc/cron.d but that seems like a user preferences thing and controllable through a cron export plugin config.

@alxthm
Copy link

alxthm commented Jun 29, 2022

Based on your input, I made a small python script to transform meltano schedules (both job and elt schedule, in json format) into a crontab file:
https://gist.github.com/alxthm/b62eceb1e47f0320e5bd82a67f33e042

I am currently using it this way:

meltano schedule list --format=json | python make_crontab.py >crontab.file
crontab crontab.file

It relies on a CRONTAB_MELTANO_EXECUTABLE environment variable to specify the path to meltano (in case it is not in the crontab PATH).

Happy if it can help (and open to feedback if you have any ofc).

@aaronsteers
Copy link
Contributor

aaronsteers commented Aug 8, 2022

meltano invoke cron-util schedule
meltano invoke cron-util deschedule

Ideally, this would be able to schedule and deschedule/disable the active list of scheduled items, perhaps with an option to specify which environment should be scheduled for.

@aaronsteers
Copy link
Contributor

The pattern ideally would also work for:

  • AWS Cloudwatch Events (could drive step functions or ECS executions)

@longtomjr
Copy link
Contributor

I don't know if it will be out of scope for this feature request, but it might be valuable to have an option to integrate with cron / dead man's switch tools like https://healthchecks.io/.

It might also be valuable for setups based on the proposal at #5962.

The simplest way I can think of would be to add a way to have an option for an extra command that will be be added to the cron entry.

So instead of generating

0 0 * * * (cd /project && meltano run ...) 2>&1 | /usr/bin/logger -t meltano

meltano generates the following cron

0 0 * * * (cd /project && meltano run ...) 2>&1 | /usr/bin/logger -t meltano && <user defined command here>

WillDaSilva added a commit that referenced this issue Aug 11, 2022
A simple first step towards #3437.

TODO:
- [ ] Add tests.
- [ ] Open an issue to add a utility plugin using the new plugin SDK that invokes `meltano schedule list --format=cron`, and installs/uninstalls the output.
- [ ] Open an issue to add `on_success` and `on_failure` hooks to the cron plugin, as suggested in #3437 (comment).
@WillDaSilva WillDaSilva changed the title Add cron as orchestrator plugin Create cron utility extension Sep 6, 2022
@WillDaSilva
Copy link
Member

Brain dump about the cron utility extension:

cron-ext commands:

meltano invoke cron [list] # List the Meltano schedule cron entries
meltano invoke cron install [schedule UUID...] # Install the Meltano schedule cron entries into the current user's crontab
meltano invoke cron uninstall [schedule UUID...] -a --all # Uninstall the Meltano schedule cron entries from the current user's crontab

Template cron entry:

cron_entry = f"{interval} (cd {meltano_project_dir} && export PATH={path} {venv_export} && {command}) 2>&1 | /usr/bin/logger -t meltano # <UUID>"

meltano_project_dir will be determined by traversing up the directory tree until meltano.yml
is found. Because this extension can only be invoked using meltano invoke, we should always be
able to find meltano.yml in this way.

Each Meltano cron entry should have an inline comment with a UUID identifying the schedule entry. This will allow for its updating/removal later on. Cron does not permit in-line comments, but in practice they work fine because they're treated as part of the command, which is passed into the shell, which ignores everything after the #.

If running as root, we could drop the entries into /etc/cron.d/meltano. Note that the filename used within /etc/cron.d/ must not contain a ., and must be owned as root. Excerpt from the cron man pages:

[...] the files under these directories have to be pass some sanity checks including the following: be executable, be owned by root, not be writable by group or other and, if symlinks, point to files owned by root. Additionally, the file names must conform to the filename requirements of run-parts: they must be entirely made up of letters, digits and can only contain the special signs underscores ('_') and hyphens ('-'). Any file that does not conform to these requirements will not be executed by run-parts. For example, any file containing dots will be ignored. This is done to prevent cron from running any of the files that are left by the Debian package management system when handling files in /etc/cron.d/ as configuration files (i.e. files ending in .dpkg-dist, .dpkg-orig, and .dpkg-new).

If not running as root, we should install it as a user crontab using the crontab command:

subprocess.run(('crontab', '-'), input='new crontab content (will overwrite old content)')

For consistency and simplicity, we can probably get by with just the crontab approach. Using crontab for root instead of /etc/cron.d is unusual, but acceptable.

The Meltano crontab entries should have a header and footer that allows this extension to identify it, thereby preventing the extension from messing up other entries in the user crontab. If the header/footer is found, we should operate within it. Otherwise we should append a new Meltano section to the crontab.

----- BEGIN MELTANO CRONTAB SECTION -----
entry 1
entry 2
...
----- END MELTANO CRONTAB SECTION -----

Install/uninstall should install/uninstall all schedules listed by meltano schedule list by default, but can install specific ones if provide their UUIDs. If install finds an existing cron entry with a matching UUID, then it will be replaced by the new one. Otherwise the new one is appended within the Meltano section of the crontab. If uninstall is provided the -a/--all flag, it should uninstall all cron entries in the Meltano section of the crontab, rather than only those listed by meltano schedule list.

Because we may have to share a crontab with other entries (if not running as root), we cannot set environment variables for the whole crontab, nor can we depend on them having been set in any particular way. Notably this means that the command in each cron entry may be executed by any shell by default.

We will want to get around this issue by explicitly running each entry with /bin/sh (or maybe /bin/bash if there is a good enough reason to use it), and by exporting the env vars for each entry prior to running its commands. Env vars that should be exported (and written into the crontab) include:

  • PATH
  • VIRTUAL_ENV (optional - only export if $VIRTUAL_ENV is set)

With those set it should be the case that simply running meltano from the shell being used by cron runs meltano in whatever Python envrionment it was run in when this plugin was run.

Each cron entry command is going to be very long using the above approach, and quoting the command properly may be tricky. Putting the command content into a file is a common approach to keep it clean, but that introduces some additional issues. Notably, we need somewhere to store these files. With /etc/cron.d it's easy because we could make a subdirectory to put them in, and use the strict name rules to prevent cron from looking inside for crontabs, but we don't have access to /etc/cron.d when not running as root. Maybe we can use the .meltano directory in the project to store them (under a cron subdirectory).

@aaronsteers
Copy link
Contributor

@aaronsteers - Note to self, review above ☝️

@edgarrmondragon, @pandemicsyn - fyi above for discussion/review.

@pandemicsyn
Copy link
Contributor

I, personally, would have preferred we not actually attempt to install/manage per user crontab's and just printed out the entries instead (because I think that might be more friendly when folks are deploying this via ansible/salt/etc), but I think if we're going to actually install and manage user crontab entries then what @WillDaSilva laid out seems solid 🤘

For consistency and simplicity, we can probably get by with just the crontab approach. Using crontab for root instead of /etc/cron.d is unusual, but acceptable.

Big +1. not a huge fan of our behavior being very different for these two paths. I'd also rather we be consistent and predictable and use the per user crontab entries across the board.

Maybe we can use the .meltano directory in the project to store them (under a cron subdirectory).
If install finds an existing cron entry with a matching UUID, then it will be replaced by the new one.

This should probably replace the cron entry AND that wrapping script (incase it got deleted?).

Definitely not required, but might be nice if list flags orphaned entries or entries that are missing their wrapping script or vice-versa - and need to be re-installed (or maybe a dedicated "fix" command?).

@WillDaSilva
Copy link
Member

I, personally, would have preferred we not actually attempt to install/manage per user crontab's and just printed out the entries instead (because I think that might be more friendly when folks are deploying this via ansible/salt/etc), [...]

@pandemicsyn That will be supported by running meltano invoke cron list.

We probably also want to support meltano invoke cron list --installed, to show the entries that have already been installed.

Definitely not required, but might be nice if list flags orphaned entries [...]

Good idea. Supporting --orphaned for list and uninstall is probably worthwhile.

[...] or entries that are missing their wrapping script or vice-versa - and need to be re-installed [...]

Not sure what you mean by "wrapping script" here.

@pandemicsyn
Copy link
Contributor

Not sure what you mean by "wrapping script" here.

@WillDaSilva I meant these:

Putting the command content into a file is a common approach to keep it clean, but that introduces some additional issues. Notably, we need somewhere to store these files.

@WillDaSilva
Copy link
Member

@pandemicsyn @aaronsteers My proposal (and current implementation) relies on each schedule having an associated UUID. Currently schedules provided by Meltano do not have any unique identifier. I see 3 options we can take:

  • We generate a UUID from a hashed canonical representation of the schedule (what is currently implemented, with issues).
  • We have the cron extension set random UUIDs for each installed schedule, and so the UUID is only useful for selecting which entries to uninstall. It cannot be used to update an entry, selectively install entries, or remove orphaned entries.
  • We update Meltano to give schedules a UUID. The cron extension would have limited capabilities (by using one of the 2 other approaches, or erroring with a useful error message) when dealing with schedules created by an other version of Meltano.

Which approach do you think we should take? I'm partial to the third approach.

@WillDaSilva
Copy link
Member

Looks like we'll be forced into putting the command and env vars into a script, and managing that. The maximum length for a cron command is 999 characters.

@WillDaSilva
Copy link
Member

We probaly want to break the "begin/end Meltano section" blocks in the crontab up by project. We can use the project ID in the header and footer comments.

Additionally, inline comments with the schedule ID are no longer required, as we can identify the schedules in the crontab based off of which script they're calling. The script path is <project directory>/.meltano/cron-ext/<schedule UUID>.sh.

@WillDaSilva
Copy link
Member

@aaronsteers @pandemicsyn I've created https://github.com/meltano/cron-ext, and opened meltano/cron-ext#1

The license for the repository is currently MIT. Is that suitable, or should it be Apache-2.0 instead?

We can start to use that repository to track features/bugs/issues for the cron extension going forwards. For instance, multi-project support by putting the project ID in the section header and footer is probably worth pulling into its own issue.

@pandemicsyn
Copy link
Contributor

pandemicsyn commented Sep 8, 2022

@pandemicsyn @aaronsteers My proposal (and current implementation) relies on each schedule having an associated UUID. Currently schedules provided by Meltano do not have any unique identifier. I see 3 options we can take:

The schedule name is "supposed to be" unique. Although in practice, not everything will complain if a schedule with a duplicate name is found:

(melty-3.8) ➜  rundev meltano schedule add yetanothertest --job trackertest --interval="@hourly"
2022-09-08T22:50:41.907628Z [info     ] Environment 'dev' is active
Schedule 'yetanothertest' already exists.

The most likely way to end up with a dupe is if someone manually edits the yaml. If we encounter a dupe we should raise an error. Even without that, I'm not sure you strictly speaking - need a UUID. Could hash some subset of the schedule fields.

The license for the repository is currently MIT. Is that suitable, or should it be Apache-2.0 instead?

I'll defer to @aaronsteers for that but fwiw - i think most of the other repo's are Apache licensed.

@WillDaSilva
Copy link
Member

Closing as complete now that meltano/cron-ext#1 has been merged, and the outstanding features/fixes discussed here have all been logged at https://github.com/meltano/cron-ext/issues

@aaronsteers
Copy link
Contributor

aaronsteers commented Sep 16, 2022

The license for the repository is currently MIT. Is that suitable, or should it be Apache-2.0 instead?

I'll defer to @aaronsteers for that but fwiw - i think most of the other repo's are Apache licensed.

@WillDaSilva and @pandemicsyn - Just circling back here to confirm the above. MIT and Apache 2.0 are both okay. Apache 2.0 is indeed preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants