Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨🐛 Follow-up removing long running http requests from dy-sidecar (⚠️ devops) #3232

Merged
merged 115 commits into from
Aug 8, 2022

Conversation

GitHK
Copy link
Contributor

@GitHK GitHK commented Aug 3, 2022

⚠️ devops

Add graylog mail alert for Could not contact dynamic-sidecar to save service state or upload outputs when deploying this

What do these changes do?

dynamic-sidecar

director-v2

  • all failing dynamic-sidecars will now be removed from the system (only the ones which had issues while saving data will still be kept and require manual intervention)

webserver

  • fixed an issue where the webserver was passing the wrong parameter to the director-v2 in the API when closing a service. It is now possible to remove nodes once more. This was no longer possible.

Related issue/s

How to test

Checklist

  • Unit tests for the changes exist
  • Runs in the swarm

@GitHK GitHK marked this pull request as ready for review August 5, 2022 08:15
@GitHK GitHK requested review from mrnicegyu11 and odeimaiz August 5, 2022 08:15
Copy link
Member

@mrnicegyu11 mrnicegyu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider adressing my requests, especially the process that is started explicitly without a termination timeout.

thanks for this

)
if failed_while_saving_state_and_outputs:
# Since user data is important and must be saved, take no further
# action and wait for manual intervention from support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please log a clear line here that can be used to create a graylog alert, like so:

logger.log("Dynamic sidecar SIDECARNAME of node uuid NODEUUID failed to save its data and needs manual intervention. This is a #SIMCORE_GRAYLOGALERT"

We could ake a mail alert for "#SIMCORE_GRAYLOGALERT"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GitHK see the proposal I did for the same issue below. the graylog alert is a very good idea. @mrnicegyu11 can you define one, or that would be a ENV variable ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise against this one. Since this will be triggered every 5 seconds. You can pickup errors with sidecars if you try to search for OEC: in the logs. If you want that precise error. Search for
Could not contact dynamic-sidecar to save service state or upload outputs
Or maybe would you want me to replace that message with the above suggestion?

result = await _write_file_and_spawn_process(
compose_spec_yaml,
command=f'docker-compose --project-name {settings.DYNAMIC_SIDECAR_COMPOSE_NAMESPACE} --file "{{file_path}}" start',
process_termination_timeout=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite sure we would want a termination timeout here, maybe of one hour or something high.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure how to deal with these. I will have a PR following this one with a review of all the timeouts.

@@ -59,7 +60,8 @@ async def task_create_service_containers(
)
shared_store.container_names = assemble_container_names(shared_store.compose_spec)

logger.debug("Validated compose-spec:\n%s", f"{shared_store.compose_spec}")
# it is useful to see what compose spec was used to start the containers
logger.warning("Validated compose-spec:\n%s", f"{shared_store.compose_spec}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an INFO, and thus the default loglevel should be info and this message should maybe be info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is "info" but something a support person would like to see. This is only displayed to you if you log it at level warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with @mrnicegyu11 here. let's not write the code for how the log level is set in a specific deployment. it makes no sense to make it a warning. If you want to see info messages in the logs just set the log level to info. you can probably set something like DY_SIDECAR_LOG_LEVEL=info right @GitHK ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "Info:" to the string of the log thx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not what I want. I want to have a warning level and also see this message. If I set the log level to DY_SIDECAR_LOG_LEVEL=INFO it gets too spammy.

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking very good, please have a look at my comments and we can discuss if something is unclear


# - cleanup: after removing?
logger.warning(
"Removed %s service after error, no state saving was required.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the code above, I see that it will run this if cas_save is False, but also if were_services_created is False, where then this message is wrong.
And this brings the following question: Are you then saving the state if the services were not created? If yes why and is it necessary? Anyway you need to change the message...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct for this to run if were_services_created is False or can_save is False. Both start from False values.
State is never saved if the services are not created.
I've tried to reword the message, but the meaning is the same for me, since it is what happens.

@@ -132,14 +143,23 @@ async def test_skip_observation_cycle_after_error(
scheduler: DynamicSidecarsScheduler,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about creating a test per use case? just to be on the safe side
Like deleting the node, this should always work right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I've added all the possible cases. Please have a look at the tests.

@GitHK GitHK requested review from sanderegg and mrnicegyu11 August 5, 2022 12:21
@mrnicegyu11 mrnicegyu11 changed the title ✨🐛 Follow-up removing long running http requests from dy-sidecar ✨🐛 Follow-up removing long running http requests from dy-sidecar (DEVOPS!) Aug 5, 2022
@mrnicegyu11 mrnicegyu11 changed the title ✨🐛 Follow-up removing long running http requests from dy-sidecar (DEVOPS!) ✨🐛 Follow-up removing long running http requests from dy-sidecar (🚧 DEVOPS!) Aug 5, 2022
Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! please have a look at the last comments.

@@ -59,7 +60,8 @@ async def task_create_service_containers(
)
shared_store.container_names = assemble_container_names(shared_store.compose_spec)

logger.debug("Validated compose-spec:\n%s", f"{shared_store.compose_spec}")
# it is useful to see what compose spec was used to start the containers
logger.warning("Validated compose-spec:\n%s", f"{shared_store.compose_spec}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this one?

@GitHK GitHK changed the title ✨🐛 Follow-up removing long running http requests from dy-sidecar (🚧 DEVOPS!) ✨🐛 Follow-up removing long running http requests from dy-sidecar (⚠️ devops) Aug 5, 2022
@sonarqubecloud
Copy link

sonarqubecloud bot commented Aug 5, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.3% 0.3% Duplication

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please just check this last comment?

@sanderegg sanderegg merged commit 4a0057e into ITISFoundation:master Aug 8, 2022
@GitHK GitHK deleted the pr-osparc-followup-long-running branch August 8, 2022 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:director-v2 issue related with the director-v2 service a:dynamic-sidecar dynamic-sidecar service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dy-sidecar issue: sometimes containers are created but not started
3 participants