-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nodata example to collections #129
Conversation
For me it seems the most useful example. In particular, if we want to parallelize collections download we can first fetch all keys then split on batches. I would also like to add `nodata` to both items and collections`iter()`, but I don't know how to do it.
Closes #124 |
scrapinghub/client/collections.py
Outdated
|
||
- get 1000th item key:: | ||
|
||
>>> keys = foo_store.list(nodata=True, meta=["_key"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, wouldn’t it be slightly better to use .iter()
with https://docs.python.org/3/library/itertools.html#itertools.islice, to avoid building a big list?
scrapinghub/client/collections.py
Outdated
|
||
>>> import itertools | ||
>>> keys = foo_store.iter(nodata=True, meta=["_key"])) | ||
>>> next(itertools.islice(keys, 1000, 1001)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your list has only 1k elements (and it's important because iter & list handlers return up to 1k elements max by default), this won't work:
>>> next(itertools.islice([1,2,3], 3, 4))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
Also, getting Nth item's key is simpler (and more lightweight) by providing index param or getting item by the index, isn't it? In this case, you don't need to go through elements at all. There might be additional filters, so it probably won't work as you want in all cases, but as a general approach for an abstract scenario (i.e. as an example in the docs) that's a way better.
>>> job.items.list(index=1000)
{'field': 'data', ...}
>>> job.items.get(1000)
{'field': 'data', ...}
I'd suggest you to pick another example to demonstrate the nodata feature, smth like:
- iterate over keys only (doesn't fetch items data)
>>> for elem in foo_store.iter(nodata=True, meta=["_key"])):
... print(elem)
{'_key': '002d050ee3ff6192dcbecc4e4b4457d7'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it's obvious that we know our dataset size is bigger then 1000, isn't it?
(and it's important because iter & list handlers return up to 1k elements max by default)
It seems they return all elements by default, and islice is the fastest way https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Experiments/Arche/collections_nodata.ipynb
But I made the example simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it's obvious that we know our dataset size is bigger then 1000, isn't it?
Maybe, but it's very common that users are glancing through the examples very fluently, and their assumptions can be totally different 🙂 Thanks!
Thanks @manycoding 👍 |
For me it seems the most useful example. In particular, if we want to parallelize collections download we can first fetch all keys then split on batches.
I would also like to add
nodata
to both items and collectionsiter()
, but I don't know how to do it.