Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nodata example to collections #129

Merged
merged 3 commits into from
Aug 19, 2019
Merged

Add nodata example to collections #129

merged 3 commits into from
Aug 19, 2019

Conversation

manycoding
Copy link
Contributor

For me it seems the most useful example. In particular, if we want to parallelize collections download we can first fetch all keys then split on batches.
I would also like to add nodata to both items and collectionsiter(), but I don't know how to do it.

For me it seems the most useful example. In particular, if we want to parallelize collections download we can first fetch all keys then split on batches.
I would also like to add `nodata` to both items and collections`iter()`, but I don't know how to do it.
@manycoding manycoding requested a review from vshlapakov August 19, 2019 11:05
@manycoding
Copy link
Contributor Author

Closes #124


- get 1000th item key::

>>> keys = foo_store.list(nodata=True, meta=["_key"]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, wouldn’t it be slightly better to use .iter() with https://docs.python.org/3/library/itertools.html#itertools.islice, to avoid building a big list?


>>> import itertools
>>> keys = foo_store.iter(nodata=True, meta=["_key"]))
>>> next(itertools.islice(keys, 1000, 1001))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your list has only 1k elements (and it's important because iter & list handlers return up to 1k elements max by default), this won't work:

>>> next(itertools.islice([1,2,3], 3, 4))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Also, getting Nth item's key is simpler (and more lightweight) by providing index param or getting item by the index, isn't it? In this case, you don't need to go through elements at all. There might be additional filters, so it probably won't work as you want in all cases, but as a general approach for an abstract scenario (i.e. as an example in the docs) that's a way better.

>>> job.items.list(index=1000)
{'field': 'data', ...}

>>> job.items.get(1000)
{'field': 'data', ...}

I'd suggest you to pick another example to demonstrate the nodata feature, smth like:

- iterate over keys only (doesn't fetch items data)

    >>> for elem in foo_store.iter(nodata=True, meta=["_key"])):
    ...     print(elem)
    {'_key': '002d050ee3ff6192dcbecc4e4b4457d7'}

Copy link
Contributor Author

@manycoding manycoding Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's obvious that we know our dataset size is bigger then 1000, isn't it?

(and it's important because iter & list handlers return up to 1k elements max by default)

It seems they return all elements by default, and islice is the fastest way https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Experiments/Arche/collections_nodata.ipynb

But I made the example simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's obvious that we know our dataset size is bigger then 1000, isn't it?

Maybe, but it's very common that users are glancing through the examples very fluently, and their assumptions can be totally different 🙂 Thanks!

@vshlapakov vshlapakov merged commit e4db20a into master Aug 19, 2019
@vshlapakov vshlapakov deleted the nodata branch August 19, 2019 17:28
@vshlapakov
Copy link
Contributor

Thanks @manycoding 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants