Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to not delete stopped agents #27

Open
cj-matti-niemenmaa opened this issue Jan 24, 2019 · 9 comments
Open

Option to not delete stopped agents #27

cj-matti-niemenmaa opened this issue Jan 24, 2019 · 9 comments
Assignees

Comments

@cj-matti-niemenmaa
Copy link

Setting the idle time after which instances are stopped seems to result in them getting not only stopped, but also deleted. This is unfortunate when you'd like to cache things on the instance's disk, such as large git repositories or intermediate build artifacts.

It would be great to have an option to avoid this deletion, and keep the instances alive but stopped.

Would removing the following code be a safe way of achieving this, preventing the plugin from ever terminating instances? Or is it unable to e.g. restart the stopped instances?

if ("TERMINATED" == it.status) {
GlobalScope.launch(image.coroutineContext) {
try {
LOG.info("Removing terminated instance $name")
deleteVm(GoogleCloudInstance(image, it.name, zone))
} catch (e: Exception) {
LOG.infoAndDebugDetails("Failed to remove instance $name", e)
}
}
}

@dtretyakov
Copy link
Contributor

@matti-cujo, originally this piece of code was intended to remove preempted instances, so to resolve your use case we need to check whether cloud image has enabled "preemptible instances" and remove terminated instances only if it's true.

@cj-matti-niemenmaa
Copy link
Author

Apparently that's not the only place that causes terminated instances to be deleted, there's also this:

if (myImageDetails.behaviour.isDeleteAfterStop) {
LOG.info("Removing virtual machine ${instance.name} due to cloud image settings")
myApiConnector.deleteVm(instance)
} else {
LOG.info("Stopping virtual machine ${instance.name}")
myApiConnector.stopVm(instance)
}

Where the isDeleteAfterStop comes from CloneBehaviour:

public enum CloneBehaviour {
START_STOP(false, true),
FRESH_CLONE(true, false),
ON_DEMAND_CLONE(false, false);
private final boolean myDeleteAfterStop;
private final boolean myUseOriginal;
CloneBehaviour(final boolean deleteAfterStop, final boolean useOriginal) {
myDeleteAfterStop = deleteAfterStop;
myUseOriginal = useOriginal;
}
public boolean isDeleteAfterStop() {
return myDeleteAfterStop;
}

And it turns out that the FRESH_CLONE behaviour is used unconditionally:

override fun getBehaviour(): CloneBehaviour {
return CloneBehaviour.FRESH_CLONE
}

Meaning that isDeleteAfterStop is always true. In fact, START_STOP and ON_DEMAND_CLONE are completely unused in the code. (And as a side note, isUseOriginal is also unused, so START_STOP and ON_DEMAND_CLONE would be equivalent.)

@cj-matti-niemenmaa
Copy link
Author

It turns out that START_STOP doesn't entirely work: the instances are never actually stopped! Apparently the stop request fails with HTTP error 411:

28-Jan-2019 07:10:57.109 SEVERE [Gax-39] com.google.common.util.concurrent.AbstractFuture.executeListener RuntimeException while executing runnable com.google.common.util.concurrent.Futures$4@3188b1c6 with executor MoreExecutors.directExecutor()
 java.lang.IllegalArgumentException: Unrecognized http status code: 411
        at com.google.api.gax.httpjson.HttpJsonStatusCode.httpStatusToStatusCode(HttpJsonStatusCode.java:102)
        at com.google.api.gax.httpjson.HttpJsonExceptionCallable$ExceptionTransformingFuture.onFailure(HttpJsonExceptionCallable.java:99)
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
        at com.google.common.util.concurrent.Futures$4.run(Futures.java:1123)
        at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:435)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:900)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:811)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:675)
        at com.google.api.core.AbstractApiFuture$InternalSettableFuture.setException(AbstractApiFuture.java:95)
        at com.google.api.core.AbstractApiFuture.setException(AbstractApiFuture.java:77)
        at com.google.api.core.SettableApiFuture.setException(SettableApiFuture.java:52)
        at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:145)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Which suggests that the Content-Length header isn't getting set on the underlying HTTP request, and the server not liking that. Is there a way of doing something about that at this level — isn't this the Google SDK's job?

@cj-matti-niemenmaa
Copy link
Author

I tried updating the Google SDK to 0.79.0-alpha and the issue persisted, and I also complained over at googleapis/google-cloud-java#3735, which describes a very similar problem though with a different API.

For the record, updating the SDK also required the following patch before it worked (or seemed to work) correctly:

diff --git google-cloud-server/src/main/kotlin/jetbrains/buildServer/clouds/google/connector/GoogleApiConnectorImpl.kt google-cloud-server/src/main/kotlin/jetbrains/buildServer/clouds/google/connector/GoogleApiConnectorImpl.kt
index 5235e58..20e5537 100644
--- a/google-cloud-server/src/main/kotlin/jetbrains/buildServer/clouds/google/connector/GoogleApiConnectorImpl.kt
+++ b/google-cloud-server/src/main/kotlin/jetbrains/buildServer/clouds/google/connector/GoogleApiConnectorImpl.kt
@@ -461,7 +461,7 @@ class GoogleApiConnectorImpl : GoogleApiConnector {
     }
 
     private fun <T : ClientSettings<T>> getResourceId(value: String, settings: T): String {
-        return "projects/${value.removePrefix(settings.endpoint)}"
+        return value.removePrefix(settings.endpoint)
     }
 
     companion object {

@dtretyakov
Copy link
Contributor

@matti-cujo, it seems that the problem with resource identifiers was caused by resolution of googleapis/google-cloud-java#3604 where has changed the way to parse resource identifiers withing Google Compute SDK library. Will try upgrading the plugin to latest version and resolve found problems.

@dtretyakov
Copy link
Contributor

@matti-cujo, regarding this feature request - the same option is currently available in TeamCity Azure cloud agents plugins, it's called "Reuse allocated virtual machines": https://github.com/JetBrains/teamcity-azure-agent/wiki/Managed-Image

It removes stopped instances only when source image has changed and instead of removal just stops them.

@wuservices
Copy link

Along the lines of caching, being able to reuse stopped instances would also really help for things like Maven or Ivy / SBT cache. When booting from a clean image, these dependencies take quite a while to download. I can snapshot an image from a pre-warmed cache, but that doesn't work too well because dependencies will change over time or we start building different projects, and it's a pain to keep creating new images just to refresh the cache.

@brodieve
Copy link

@dtretyakov any chance the new new api library resolves this?

@darkn3rd
Copy link

For the caching the libraries, that's a good idea, but it could an anti-pattern in this scenario. For testing, you really want a clean environment, as results may be invalid. This goes back to the controlled experiment concept. Maybe docker is a good solution, with layered images, and an image built with best packages, such as copying the manifest (pom.xml) on a separate layer, so that package downloads only happen when the manifest changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants