-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debugging memory issues in IMERG #227
Comments
This optimization only works when @sharkinsspatial to work around the issues you were seeing in import numpy as np
import zarr
from pangeo_forge_recipes.recipes.xarray_zarr import _gather_coordinate_dimensions
def consolidate_coordinate_dimensions(config):
target_mapper = config.target.get_mapper()
group = zarr.open(target_mapper, mode="a")
# https://github.com/pangeo-forge/pangeo-forge-recipes/issues/214
# intersect the dims from the array metadata with the Zarr group
# to handle coordinateless dimensions.
dims = set(_gather_coordinate_dimensions(group)) & set(group)
combine_dims = {x.name: x for x in config.file_pattern.combine_dims}
for dim in dims:
arr = group[dim]
chunks = arr.shape
dtype = arr.dtype
compressor = arr.compressor
fill_value = arr.fill_value
order = arr.order
filters = arr.filters
# Performance optimization: get the values from combine_dims if possible.
# This avoids many small reads from the non-consolidated coordinate dimension.
# https://github.com/pangeo-forge/pangeo-forge-recipes/issues/227
if dim in combine_dims and combine_dims[dim].nitems_per_file == 1:
data = np.asarray(combine_dims[dim].keys, dtype=dtype)
else:
data = arr[:] # this triggers reads from the target
attrs = dict(arr.attrs)
new = group.array(
dim,
data,
chunks=chunks,
dtype=dtype,
compressor=compressor,
fill_value=fill_value,
order=order,
filters=filters,
overwrite=True,
)
new.attrs.update(attrs) That should run pretty quickly. (make sure to still run |
Hmmm I've had a thought: why does the memory usage level off before we're done with all the reads / writes? Running a smaller version of the earlier profile, we see that memory use peaks at 20s, while the job finishes at 30s. What if we don't really have a memory leak? What if we're trying to run all 360,000 It's just a guess, but if we limit the maximum concurrency, we might see a different chart. For this run, we have an import os
import time
import asyncio
import azure.storage.blob.aio
N = 10_000
data = b"\x02\x013\x08\x08\x00\x00\x00\x08\x00\x00\x00\x18\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" # 0
# Obervation: we don't see a memory leak when just creating the bc; Actually have to put data.
async def put(acc, i, sem):
async with sem:
async with acc.get_blob_client(f"async/{i}") as bc:
await bc.upload_blob(data=data, overwrite=True, metadata={"is_directory": "false"})
async def main():
conn_str = os.environ["AZURE_CONNECTION_STRING"]
async with azure.storage.blob.aio.ContainerClient.from_connection_string(conn_str, "gpm-imerg-debug") as acc:
sem = asyncio.Semaphore(100)
coros = (put(acc, i, sem) for i in range(N))
t0 = time.time()
await asyncio.gather(*coros)
t1 = time.time()
print(f"processed {N} blobs in {t1 - t0:.2f}s")
if __name__ == "__main__":
asyncio.run(main()) Interestingly, we aren't any slower with the limited-concurrency version. I think that's because we were CPU bound anyway (with event loop / Python overhead). So next steps might be to look into fsspec backends to see if there's a way to limit concurrency, and design a way to set that from the recipe. I'll look into adlfs. |
cc @martindurant, in case you see an obvious place where concurrency could be limited (context at #227 (comment), but the tl/dr is it'd be nice to have a way to limit the concurrency of A possible fix is in But that feels very ad-hoc and too narrow. Why just limit concurrency in I don't immediately see a way to do this maximum concurrency thing generically in fsspec, unless it were to define some protocol. So I think for now it'll have to be implemented independently in each of adlfs / s3fs / gcsfs. |
Drat. I know there are some examples that break without it, but I guess not the test suite. I think something from the recipe tutorials. This is a good motivation to perhaps run the tutorials as part of the test suite. |
xref #173 |
This adds a new keyword to AzureBlobFileSystem to limit the number of concurrent connectiouns. See pangeo-forge/pangeo-forge-recipes#227 (comment) for some motivation. In that situation, we had a single FileSystem instance that was generating many concurrent write requests through `.pipe`. So many, that we were seeing memory issues from creating all the BlobClient connections simultaneously. This adds an asyncio.Semaphore instance to the AzureBlobFilesSytem that controls the number of concurrent BlobClient connections. The default of None is backwards-compatible (no limit)
I agree that we should pursue an upstream fix. But there are also a few other possibilities Make time coordinate contiguous from the beginningWhat if we explicitly specify that the Pangeo Forge should be able to handle this situation using locks. There will be some write contention between processes, which could be managed via our locking mechanism. This is not possible with the current code but could be implemented fairly easily. Fix handling of encoding to avoid the big write in
|
# now explicity write the sequence coordinate to avoid missing data | |
# when reopening | |
if concat_dim in zgroup: | |
zgroup[concat_dim][:] = 0 |
Was added to avoid cf time decording errors related to decoding an uninitialized time dimension that can crop up when opening the target with xarray. However, we we have not been opening the target with xarray in store_chunk
since b603c98. Therefore, perhaps those lines can indeed be removed. I'm going to look into this.
I've been assuming that the locking would essentially kill parallelism while storing data chunks. Is that not the case? Ah, (maybe) it's OK, because writing a variable like |
If the locks are fine-grained enough (e.g. only on the |
Great. I'll look into doing the coordinate consolidation earlier (in |
You'll need to revisit this line:
We currently are assuming that all variables with pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py Lines 231 to 234 in 78a274b
For this to work, you may need to add an argument to |
async fsspec methods |
Perhaps there could also be a global config option to set |
Note that you set |
Closes pangeo-forge#227 (maybe, might be worth testing)
@martindurant Is the expanded use of |
Exactly that |
Dropping a few notes here for the day:
@sharkinsspatial and I are seeing some slowness and high memory usage on the client when running the imerg recipe. This has lots of steps in the time dimension.
I've traced something that looks like a memory leak to
azure.storage.blob
. This script just puts a bunch of 24-bytes objects:I don't see an easy workaround (I also don't really understand where the memory leak might be, and hopefully won't need to find out).
So making a bunch of small writes isn't great for performance (and memory for Azure, it turns out). Can we avoid the small writes in the first place?
One of the batches of small writes happens at
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 109 to 112 in 78a274b
That writes
0
for each chunk in the concat dim (time
typically). In the case of IMERG, that's ~360,000 writes. @rabernat mentioned this was to avoid an issue withdecode_times
. For better or worse, the test suite passes with those lines removed...The second place we interact with a bunch of small objects is when consolidating the dimension coordinates at
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 564 to 578 in 78a274b
recipe.file_pattern.combine_dims[0].keys
, which is the DatetimeIndex we provided to the FilePattern. Then we just have one "big" write at the end, and no small reads.The text was updated successfully, but these errors were encountered: