-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨🐛 Follow-up removing long running http requests from dy-sidecar (⚠️ devops) #3232
✨🐛 Follow-up removing long running http requests from dy-sidecar (⚠️ devops) #3232
Conversation
…car-replace-long-running
…car-replace-long-running
…car-replace-long-running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe consider adressing my requests, especially the process that is started explicitly without a termination timeout.
thanks for this
services/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/task.py
Show resolved
Hide resolved
) | ||
if failed_while_saving_state_and_outputs: | ||
# Since user data is important and must be saved, take no further | ||
# action and wait for manual intervention from support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log a clear line here that can be used to create a graylog alert, like so:
logger.log("Dynamic sidecar SIDECARNAME of node uuid NODEUUID failed to save its data and needs manual intervention. This is a #SIMCORE_GRAYLOGALERT"
We could ake a mail alert for "#SIMCORE_GRAYLOGALERT"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GitHK see the proposal I did for the same issue below. the graylog alert is a very good idea. @mrnicegyu11 can you define one, or that would be a ENV variable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advise against this one. Since this will be triggered every 5 seconds. You can pickup errors with sidecars if you try to search for OEC:
in the logs. If you want that precise error. Search for
Could not contact dynamic-sidecar to save service state or upload outputs
Or maybe would you want me to replace that message with the above suggestion?
result = await _write_file_and_spawn_process( | ||
compose_spec_yaml, | ||
command=f'docker-compose --project-name {settings.DYNAMIC_SIDECAR_COMPOSE_NAMESPACE} --file "{{file_path}}" start', | ||
process_termination_timeout=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite sure we would want a termination timeout here, maybe of one hour or something high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure how to deal with these. I will have a PR following this one with a review of all the timeouts.
@@ -59,7 +60,8 @@ async def task_create_service_containers( | |||
) | |||
shared_store.container_names = assemble_container_names(shared_store.compose_spec) | |||
|
|||
logger.debug("Validated compose-spec:\n%s", f"{shared_store.compose_spec}") | |||
# it is useful to see what compose spec was used to start the containers | |||
logger.warning("Validated compose-spec:\n%s", f"{shared_store.compose_spec}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an INFO, and thus the default loglevel should be info and this message should maybe be info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is "info" but something a support person would like to see. This is only displayed to you if you log it at level warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with @mrnicegyu11 here. let's not write the code for how the log level is set in a specific deployment. it makes no sense to make it a warning. If you want to see info messages in the logs just set the log level to info. you can probably set something like DY_SIDECAR_LOG_LEVEL=info right @GitHK ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add "Info:" to the string of the log thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not what I want. I want to have a warning level and also see this message. If I set the log level to DY_SIDECAR_LOG_LEVEL=INFO
it gets too spammy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking very good, please have a look at my comments and we can discuss if something is unclear
services/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/task.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/task.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/task.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/task.py
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/task.py
Outdated
Show resolved
Hide resolved
|
||
# - cleanup: after removing? | ||
logger.warning( | ||
"Removed %s service after error, no state saving was required.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the code above, I see that it will run this if cas_save is False, but also if were_services_created is False, where then this message is wrong.
And this brings the following question: Are you then saving the state if the services were not created? If yes why and is it necessary? Anyway you need to change the message...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct for this to run if were_services_created is False or can_save is False. Both start from False values.
State is never saved if the services are not created.
I've tried to reword the message, but the meaning is the same for me, since it is what happens.
@@ -132,14 +143,23 @@ async def test_skip_observation_cycle_after_error( | |||
scheduler: DynamicSidecarsScheduler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about creating a test per use case? just to be on the safe side
Like deleting the node, this should always work right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I've added all the possible cases. Please have a look at the tests.
services/dynamic-sidecar/src/simcore_service_dynamic_sidecar/modules/long_running_tasks.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! please have a look at the last comments.
services/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/task.py
Show resolved
Hide resolved
services/director-v2/tests/unit/test_modules_dynamic_sidecar_scheduler_task.py
Show resolved
Hide resolved
@@ -59,7 +60,8 @@ async def task_create_service_containers( | |||
) | |||
shared_store.container_names = assemble_container_names(shared_store.compose_spec) | |||
|
|||
logger.debug("Validated compose-spec:\n%s", f"{shared_store.compose_spec}") | |||
# it is useful to see what compose spec was used to start the containers | |||
logger.warning("Validated compose-spec:\n%s", f"{shared_store.compose_spec}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about this one?
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please just check this last comment?
Add graylog mail alert for
Could not contact dynamic-sidecar to save service state or upload outputs
when deploying thisWhat do these changes do?
dynamic-sidecar
dynamic-sidecar
to save the state directories (usually workspace)dynamic-sidecar
to push the outputs portsdirector-v2
webserver
webserver
was passing the wrong parameter to the director-v2 in the API when closing a service. It is now possible to remove nodes once more. This was no longer possible.Related issue/s
How to test
Checklist