-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User/tom/fix/s2 pctasks perf #291
Conversation
CI failure:
Fixed in 2bf2bb3 by bumping the version of azurite we use. |
957f7b2
to
7bb3371
Compare
The BlobStorage.file_exists method had a lot of overhead compared to azure.storage.blob. Tracked down to the ContainerClientWrapper context manager. Cache that, and don't explicitly close stuff. ``` storage = BlobStorage("pctasksteststaging", "taskio") %timeit storage.file_exists("status/0a3e7fd0-a9ae-43ea-9d1f-fc4e8cb30ef3/copy/0/copy/status-00b39.txt") 1.4 s ┬▒ 107 ms per loop (mean ┬▒ std. dev. of 7 runs, 1 loop each) client = azure.storage.blob.ContainerClient.from_container_url("https://pctasksteststaging.blob.core.windows.net/taskio", credential=azure.identity.DefaultAzureCredential()) %timeit client.get_blob_client("status/0a3e7fd0-a9ae-43ea-9d1f-fc4e8cb30ef3/copy/0/copy/status-00b39.txt").exists() 239 ms ┬▒ 34.1 ms per loop (mean ┬▒ std. dev. of 7 runs, 1 loop each) ```
7bb3371
to
136ee66
Compare
Profiling with memray showed a lot of allocations (leaks? not sure) under `process_output_if_available`. Some of this is inevitable: we are reading data from the network. But some of it seems to be overhead from creating very many ContainerClient / BlobClient objects. We'll see if this helps. Uncertain.
I think that e93c2bf fixed another (hopefully last?) memory "leak" that was causing issues in the orchestrator pod. The fix there is relatively simple: share the Some of those allocations are necessary (we are reading data after all). But some of it seemed to be overhead from establishing connections and other HTTP infrastructure ("pipelines" / "transports" in Azure SDK docs) that could maybe be shared. It might be premature, but this query against our Container Insights maybe shows that we've got it fixed? let FilteredPerf = Perf
| where ObjectName == "K8SContainer" and CounterName == "memoryRssBytes"
| where InstanceName endswith "main";
FilteredPerf
| summarize min(CounterValue) by InstanceName
| join kind=inner (FilteredPerf) on InstanceName
| project InstanceName, MemoryGrowth=CounterValue - min_CounterValue, TimeGenerated
| summarize avg(MemoryGrowth) by InstanceName, bin(TimeGenerated, 1m)
| render timechart Both of those were generated with |
One of the sentinel-2 item creation tasks failed with an odd error:
Trying to make sense of that. Something went wrong when we tried to generate that SAS token:
|
I can produce a similar error with this diff, which exaggerates the duration of the SAS token by extending the diff --git a/pctasks/core/pctasks/core/storage/blob.py b/pctasks/core/pctasks/core/storage/blob.py
index 059152e6..20c00f79 100644
--- a/pctasks/core/pctasks/core/storage/blob.py
+++ b/pctasks/core/pctasks/core/storage/blob.py
@@ -343,7 +343,7 @@ class BlobStorage(Storage):
attached credentials) to generate a container-level SAS token.
"""
start = Datetime.utcnow() - timedelta(hours=10)
- expiry = Datetime.utcnow() + timedelta(hours=24 * 7)
+ expiry = Datetime.utcnow() + timedelta(hours=24 * 8)
permission = ContainerSasPermissions(
read=read,
write=write, that gives In [1]: from pctasks.core.storage.blob import *
st
In [2]: storage = BlobStorage("sentinel2l2a01", "sentinel2-l2")
In [3]: _ = storage._generate_container_sas()
---------------------------------------------------------------------------
HttpResponseError Traceback (most recent call last)
Cell In[3], line 1
----> 1 _ = storage._generate_container_sas()
File ~/src/Microsoft/planetary-computer-tasks/pctasks/core/pctasks/core/storage/blob.py:353, in BlobStorage._generate_container_sas(self, read, list, write, delete)
346 expiry = Datetime.utcnow() + timedelta(hours=24 * 8)
347 permission = ContainerSasPermissions(
348 read=read,
349 write=write,
350 delete=delete,
351 list=list,
352 )
--> 353 key = self._get_client()._account_client.get_user_delegation_key(
354 key_start_time=start, key_expiry_time=expiry
355 )
356 sas_token = generate_container_sas(
357 self.storage_account_name,
358 self.container_name,
(...)
362 expiry=expiry,
363 )
364 return sas_token
File ~/src/Microsoft/planetary-computer-tasks/.direnv/python-3.10.14/lib/python3.10/site-packages/azure/core/tracing/decorator.py:78, in distributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer(*args, **kwargs)
76 span_impl_type = settings.tracing_implementation()
77 if span_impl_type is None:
---> 78 return func(*args, **kwargs)
80 # Merge span is parameter is set, but only if no explicit parent are passed
81 if merge_span and not passed_in_parent:
File ~/src/Microsoft/planetary-computer-tasks/.direnv/python-3.10.14/lib/python3.10/site-packages/azure/storage/blob/_blob_service_client.py:225, in BlobServiceClient.get_user_delegation_key(self, key_start_time, key_expiry_time, **kwargs)
221 user_delegation_key = self._client.service.get_user_delegation_key(key_info=key_info,
222 timeout=timeout,
223 **kwargs) # type: ignore
224 except HttpResponseError as error:
--> 225 process_storage_error(error)
227 return parse_to_internal_user_delegation_key(user_delegation_key)
File ~/src/Microsoft/planetary-computer-tasks/.direnv/python-3.10.14/lib/python3.10/site-packages/azure/storage/blob/_shared/response_handlers.py:184, in process_storage_error(storage_error)
181 error.args = (error.message,)
182 try:
183 # `from None` prevents us from double printing the exception (suppresses generated layer error context)
--> 184 exec("raise error from None") # pylint: disable=exec-used # nosec
185 except SyntaxError as exc:
186 raise error from exc
File <string>:1
HttpResponseError: The value for one of the XML nodes is not in the correct format.
RequestId:ca881233-e01e-0024-5317-b11c0e000000
Time:2024-05-28T15:55:00.0351181Z
ErrorCode:InvalidXmlNodeValue
xmlnodename:2024-06-05T15:54:50Z
xmlnodevalue:2024-06-05T15:54:50Z
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidXmlNodeValue</Code><Message>The value for one of the XML nodes is not in the correct format.
RequestId:ca881233-e01e-0024-5317-b11c0e000000
Time:2024-05-28T15:55:00.0351181Z</Message><XmlNodeName>2024-06-05T15:54:50Z</XmlNodeName><XmlNodeValue>2024-06-05T15:54:50Z</XmlNodeValue></Error> Under the assumption of "something weird about clocks" I'm going to adjust the duration of that token to be the |
Hopefully avoids the invalid token we ran into on the last run.
That finished. Planning to merge this tonight and start the process of rolling it and #294 out to production Thursday. |
Description
We've observed some issues running the sentinel-2 STAC pipeline on pctasks. The orchestrator pod had high memory and CPU usage.
This PR adjust the workflow orchestrator to free up some memory (but releasing final tasks) and adds a setting to limit the number of concurrent tasks.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Checklist:
Please delete options that are not relevant.