Add nodata example to collections #129

manycoding · 2019-08-19T11:05:25Z

For me it seems the most useful example. In particular, if we want to parallelize collections download we can first fetch all keys then split on batches.
I would also like to add nodata to both items and collectionsiter(), but I don't know how to do it.

For me it seems the most useful example. In particular, if we want to parallelize collections download we can first fetch all keys then split on batches. I would also like to add `nodata` to both items and collections`iter()`, but I don't know how to do it.

manycoding · 2019-08-19T11:11:11Z

Closes #124

Gallaecio · 2019-08-19T11:36:09Z

scrapinghub/client/collections.py

+
+    - get 1000th item key::
+
+        >>> keys = foo_store.list(nodata=True, meta=["_key"]))


For this, wouldn’t it be slightly better to use .iter() with https://docs.python.org/3/library/itertools.html#itertools.islice, to avoid building a big list?

vshlapakov · 2019-08-19T15:45:44Z

scrapinghub/client/collections.py

+
+        >>> import itertools
+        >>> keys = foo_store.iter(nodata=True, meta=["_key"]))
+        >>> next(itertools.islice(keys, 1000, 1001))


If your list has only 1k elements (and it's important because iter & list handlers return up to 1k elements max by default), this won't work:

>>> next(itertools.islice([1,2,3], 3, 4)) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration

Also, getting Nth item's key is simpler (and more lightweight) by providing index param or getting item by the index, isn't it? In this case, you don't need to go through elements at all. There might be additional filters, so it probably won't work as you want in all cases, but as a general approach for an abstract scenario (i.e. as an example in the docs) that's a way better.

>>> job.items.list(index=1000) {'field': 'data', ...} >>> job.items.get(1000) {'field': 'data', ...}

I'd suggest you to pick another example to demonstrate the nodata feature, smth like:

- iterate over keys only (doesn't fetch items data) >>> for elem in foo_store.iter(nodata=True, meta=["_key"])): ... print(elem) {'_key': '002d050ee3ff6192dcbecc4e4b4457d7'}

I thought it's obvious that we know our dataset size is bigger then 1000, isn't it?

(and it's important because iter & list handlers return up to 1k elements max by default)

It seems they return all elements by default, and islice is the fastest way https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Experiments/Arche/collections_nodata.ipynb

But I made the example simpler.

I thought it's obvious that we know our dataset size is bigger then 1000, isn't it?

Maybe, but it's very common that users are glancing through the examples very fluently, and their assumptions can be totally different 🙂 Thanks!

vshlapakov · 2019-08-19T17:28:25Z

Thanks @manycoding 👍

manycoding requested a review from vshlapakov August 19, 2019 11:05

Gallaecio reviewed Aug 19, 2019

View reviewed changes

islice instead of list

da7fd5d

manycoding mentioned this pull request Aug 19, 2019

Use pool to download items with filter scrapinghub/arche#13

Open

vshlapakov reviewed Aug 19, 2019

View reviewed changes

simplify example

9ac6062

vshlapakov approved these changes Aug 19, 2019

View reviewed changes

vshlapakov merged commit e4db20a into master Aug 19, 2019

vshlapakov deleted the nodata branch August 19, 2019 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nodata example to collections #129

Add nodata example to collections #129

manycoding commented Aug 19, 2019

manycoding commented Aug 19, 2019

Gallaecio Aug 19, 2019

vshlapakov Aug 19, 2019

manycoding Aug 19, 2019 •

edited

Loading

vshlapakov Aug 19, 2019

vshlapakov commented Aug 19, 2019


		- get 1000th item key::

		>>> keys = foo_store.list(nodata=True, meta=["_key"]))

Add nodata example to collections #129

Add nodata example to collections #129

Conversation

manycoding commented Aug 19, 2019

manycoding commented Aug 19, 2019

Gallaecio Aug 19, 2019

Choose a reason for hiding this comment

vshlapakov Aug 19, 2019

Choose a reason for hiding this comment

manycoding Aug 19, 2019 • edited Loading

Choose a reason for hiding this comment

vshlapakov Aug 19, 2019

Choose a reason for hiding this comment

vshlapakov commented Aug 19, 2019

manycoding Aug 19, 2019 •

edited

Loading