Add TileDataset #63

alexanderwerning · 2023-11-15T18:51:13Z

TileDataset should be more efficient than concatenating the input dataset for large repetitions (in my case in the 1000s)

boeddeker · 2023-11-15T18:58:31Z

lazy_dataset/core.py

+                iterable = self.input_dataset.__iter__(with_key=True)
+            else:
+                iterable = self.input_dataset
+            for example in iterable:


Could you use yield from?

boeddeker · 2023-11-15T19:05:54Z

lazy_dataset/core.py

+                item = item + len(self)
+                if item < 0:
+                    raise IndexError(_item)
+            if item > self.repetitions * len(self.input_dataset):


Use len(self)?

boeddeker · 2023-11-15T19:14:33Z

Does the following code works?

import lazy_dataset
ds = lazy_dataset.new([1, 2, 3])
ds = ds.shuffle(reshuffle=True)
ds = ds.tile(4).catch()
list(ds)

We have to handle the combination of non-ordered (e.g. reshuffle), tile, copy(freeze) and indexing.
We use copy(freeze) and indexing too often to introduce a breaking change (e.g. prefetch).
It should be the same as non-ordered (e.g. reshuffle), tile and iter.

I see two solutions:

Use TileDataset only, when the input is ordered.
Convert a TileDataset to the old type (multiple datasets with concat)
- In this case, we should modify the repr/str that the user recognizes this.

boeddeker · 2023-11-15T19:31:28Z

lazy_dataset/core.py

+        """
+        if isinstance(item, str):
+            return self.input_dataset[item]
+        elif isinstance(item, numbers.Integral):


I am not sure, whether this code will have an effect in the performance.

How about changing the code to the follwing?

input_len = len(self.input_dataset) if not (-self.repetitions <= item // input_len < self.repetitions): raise IndexError(_item) return self.input_dataset[item % input_len]

boeddeker · 2023-11-15T19:34:32Z

lazy_dataset/core.py

+
+    """
+
+    def __init__(self, input_dataset, repetitions):


Could you rename repetitions to reps?
I prefer to have names close to numpy. (We don't want to do the same as pytorch, where you have to learn the new names for arguments, because they differ to numpy)

boeddeker · 2023-11-15T19:35:42Z

lazy_dataset/core.py

+            datasets = [ds.shuffle() for ds in datasets]
+            return self.__class__.concatenate(*datasets)
+        else:
+            return TileDataset(self, reps)


To minimize overhead: Could your return self, when reps is one? concatenate does this already.

Add TileDataset

7dd1bb0

boeddeker reviewed Nov 15, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TileDataset #63

Add TileDataset #63

alexanderwerning commented Nov 15, 2023

boeddeker Nov 15, 2023

boeddeker Nov 15, 2023

boeddeker commented Nov 15, 2023

boeddeker Nov 15, 2023

boeddeker Nov 15, 2023

boeddeker Nov 15, 2023

Add TileDataset #63

Are you sure you want to change the base?

Add TileDataset #63

Conversation

alexanderwerning commented Nov 15, 2023

boeddeker Nov 15, 2023

Choose a reason for hiding this comment

boeddeker Nov 15, 2023

Choose a reason for hiding this comment

boeddeker commented Nov 15, 2023

boeddeker Nov 15, 2023

Choose a reason for hiding this comment

boeddeker Nov 15, 2023

Choose a reason for hiding this comment

boeddeker Nov 15, 2023

Choose a reason for hiding this comment