Allow for tokenizers/preprocessors to change batch size #866

Aphoh · 2025-01-24T05:05:51Z

Small change that keeps changes how the counting for the number of batches is done. Instead of assuming $n$ examples from shard iterator means $n$ examples after tokenizing/processing, it gets the length from the output of the tokenizer. I had a use case that involved combining multiple samples together during the processing stage, and ran into a bug where the number of batches stored in the offsets was incorrect.

dlwh · 2025-01-24T06:31:45Z

thanks for this. I think it's not quite right in the case of preemption. In particular, we use open_shard_at_row with this value when we resume tokenization. What we would need to do is save both "rows_in" and "rows_out" and only use rows_in for open_shard_at_row. Does that make sense?

Aphoh · 2025-01-24T06:48:32Z

@dlwh ah yeah totally! I'll try to get that worked in.

dlwh

sorry, one small rename then i'm happy

dlwh · 2025-01-24T17:48:19Z

src/levanter/store/cache.py

@@ -473,7 +473,11 @@ def _monitor_metrics(self):
 class CacheLedger:
    # NB: unlike the old cache, the mere existence of a ledger doesn't mean the cache is finished
    total_num_rows: int
-    shard_rows: Dict[str, int]
+    """Number of outputted rows in the cache"""
+    shard_rows_in: Dict[str, int]


sorry, can we rename this one back to just shard_rows (leave comment) so that we don't invalidate all other caches

Ah gotcha yeah... Do I need to do some other logic to make shard_rows_out an optional key during deserialization?

i'm pretty sure if you give it a default value it will be fine. I'll check it against a cache before merging

Allow for tokenizers/preprocessors to change batch size

3462813

Track in/out batches separately for correct resumption

900f50f

dlwh reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for tokenizers/preprocessors to change batch size #866

Allow for tokenizers/preprocessors to change batch size #866

Aphoh commented Jan 24, 2025

dlwh commented Jan 24, 2025

Aphoh commented Jan 24, 2025

dlwh left a comment

dlwh Jan 24, 2025

Aphoh Jan 25, 2025

dlwh Jan 25, 2025

Allow for tokenizers/preprocessors to change batch size #866

Are you sure you want to change the base?

Allow for tokenizers/preprocessors to change batch size #866

Conversation

Aphoh commented Jan 24, 2025

dlwh commented Jan 24, 2025

Aphoh commented Jan 24, 2025

dlwh left a comment

Choose a reason for hiding this comment

dlwh Jan 24, 2025

Choose a reason for hiding this comment

Aphoh Jan 25, 2025

Choose a reason for hiding this comment

dlwh Jan 25, 2025

Choose a reason for hiding this comment