Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce 1 session = 1 run #1329

Merged
merged 16 commits into from
Mar 14, 2022
Merged

Enforce 1 session = 1 run #1329

merged 16 commits into from
Mar 14, 2022

Conversation

merelcht
Copy link
Member

@merelcht merelcht commented Mar 8, 2022

Description

After the discussion about #1273, it has been decided that a session can only ever have one run. Currently, we've got the concept of a run_id in the codebase, which is mainly a remnant from the journal and doesn't serve a purpose. I will write up a summary of the whole discussion, but baseline is that a session can only have 1 run, and so the session_id and run_id will always be the same.

Development notes

  • Removed run_id where it didn't serve a purpose.
  • Changed the run_params pipeline hook specs to refer to session_id instead of run_id.
  • Enforce/inform users that they shouldn't run multiple runs during 1 session: 55fb343

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@merelcht merelcht changed the title Remove the run_id after it was agreed that 1 session = 1 run Enforce 1 session = 1 run Mar 9, 2022
@merelcht merelcht linked an issue Mar 9, 2022 that may be closed by this pull request
@@ -411,7 +406,7 @@ class DataValidationHooks:
expectation_suite,
)
expectation_context.run_validation_operator(
"action_list_operator", assets_to_validate=[batch], run_id=run_id
"action_list_operator", assets_to_validate=[batch]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether GE would still work without the run_id argument here. It appears to be an optional argument but I think it might be quite useful for users to keep track of validations done over multiple runs.

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that we want to remove run_id from after_catalog_created rather than replacing it with session_id? 🤔 Or is there just not an easy way to access session_id at the point we would need it?

@merelcht
Copy link
Member Author

@AntonyMilneQB I think both of your points are valid and so perhaps instead of removing the run_id it makes more sense to just replace it with session_id.

Copy link
Contributor

@lorenabalan lorenabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that we're getting rid of all the run_id passed around everywhere, code feels so much more breathable now!!
I'm ambivalent about the error on re-run, I have no strong opinion for or against it. On the one hand it does the job, on the other hand I worry that it's gonna end up throwaway code a bit further down the line. I have an open question about what should happen in case the pipeline run fails and maybe that'll clarify things in my head.

tests/framework/session/test_session_extension_hooks.py Outdated Show resolved Hide resolved
@@ -379,7 +398,8 @@ def run( # pylint: disable=too-many-arguments,too-many-locals
)

try:
run_result = runner.run(filtered_pipeline, catalog, hook_manager, run_id)
run_result = runner.run(filtered_pipeline, catalog, hook_manager)
self._run_called = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is just in case the run is successful. If there's a problem with the pipeline halfway through, we can call session.run() again. Is that the intended course? Makes sense for debugging, if a little confusing for the user (but I've run it multiple times to failure, it allowed me to that then!).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so my thinking was that if you're in an interactive workflow to experiment/debug your pipeline it would be overhead if you need to recreate a new session if the pipeline didn't run successfully. At the same time this is a niche case just for those people that would actually do a run inside a notebook/ipython session.

kedro/framework/session/session.py Outdated Show resolved Hide resolved
tests/framework/session/test_session.py Outdated Show resolved Hide resolved
tests/framework/session/test_session.py Show resolved Hide resolved
@idanov
Copy link
Member

idanov commented Mar 10, 2022

I wonder whether it's worth keeping the concept of a run_id, but only setting it to the session_id, in order to have a much less distruptive change in APIs and also concepts. Would that make it compatible with Kedro Viz without the need for a new release?

Also a lot of hooks here were receiving the run_id and now no longer will receive it, but the users might still need to know the run_id / session_id at that particular hook. Have you checked with users if they need the run_id (now session_id) in those hooks or that's no longer needed?

@merelcht
Copy link
Member Author

I wonder whether it's worth keeping the concept of a run_id, but only setting it to the session_id, in order to have a much less distruptive change in APIs and also concepts. Would that make it compatible with Kedro Viz without the need for a new release?

I am against keeping the concept of the run_id, because I feel that it would just cause confusion and we'd end up having discussions in the future again about what the difference is between run_id and session_id

Also a lot of hooks here were receiving the run_id and now no longer will receive it, but the users might still need to know the run_id / session_id at that particular hook. Have you checked with users if they need the run_id (now session_id) in those hooks or that's no longer needed?

However, I do hear this point and Antony was saying something similar. I think I got too excited removing the run_id altogether from the hook specs, because in our codebase it doesn't really serve a purpose but you're right in that it might be used by users for reasons unknown to me. I will change this so that the specs will now have the session_id.

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I'm actually now not so sure about the hook specs after I realised we already have save_version there, sorry 😬 No strong feelings though - up to you.

kedro/framework/session/session.py Outdated Show resolved Hide resolved
@@ -290,7 +291,7 @@ def _get_catalog(
feed_dict=feed_dict,
save_version=save_version,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can save_version ever be different from session_id now actually? 🤔 On seconds thoughts maybe we don't need to keep this in the hook spec after all if it's just a duplicate of save_version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment they will always be the same, but this goes back to the question whether we should allow users to have a custom save_version, which requires user research. With the mindset of "don't add things that could potentially be used in the future", I'll remove it for now.

Copy link
Contributor

@lorenabalan lorenabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a huge fan of getting rid of passing the id around everywhere. :( I seriously doubt it was a heavily used feature by any user, I'm worried we're leaving around code that's only hypothetical and probably 90% dead. Having said that, it's not a hill I'm willing to die on anytime soon, so happy to let it go for now.
We should add something in the release notes though that we are replacing run_id with sth else in the hook specs, and that users should update their code accordingly (1 short migration guide note).

@merelcht
Copy link
Member Author

I was a huge fan of getting rid of passing the id around everywhere. :( I seriously doubt it was a heavily used feature by any user, I'm worried we're leaving around code that's only hypothetical and probably 90% dead. Having said that, it's not a hill I'm willing to die on anytime soon, so happy to let it go for now.

I hear you, I would've preferred to remove it as well, but I think there's an argument around being able to find back the run through the id when using external monitoring/orchestration systems. Anyway, I think this needs user research before we can make a final decision on removing it.

We should add something in the release notes though that we are replacing run_id with sth else in the hook specs, and that users should update their code accordingly (1 short migration guide note).

Yes I'll add this!

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants