-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic padding via collate_fn
#761
Comments
@KamWithK I see you explored a similar issue here #603 a while ago. Did you ever figure out a workaround? What I did find is that I can use the Hugging Face Here is an example that works because I am padding within
Of course without using Hugging Face any custom padding function would do, but with HF it is trivial. You'd only have to pass
|
I would also like to add a Thanks |
Sorry for the delayed response... been super busy at my day work. Would something like this user defined def test_read_with_collate(reader_factory, tmp_path):
data = pd.DataFrame({"str": ["a", "bc"], "varlen_nums": [[1], [3, 4]]})
path = tmp_path / 'data'
url = f"file:///{path}"
data.to_parquet(path)
def collate_lists_fn(column_name: str, schema: Unischema, values):
max_len = max(map(len, values))
result = np.asarray([np.pad(v, (0, max_len - len(v)), 'constant', constant_values=0) for v in values])
return result
with make_batch_reader(url, collate_lists_fn=collate_lists_fn) as reader:
actual = list(reader)
assert len(actual) == 1
np.testing.assert_equal(actual[0].varlen_nums, [[1, 0], [1, 2]])
np.testing.assert_equal(actual[0].str, ["a", "bc"]) Here is a draft PR: #772 |
Yes this is! How would this differ to TransformSpec though? Would the input also be a dataframe or would it be a torch Tensor? |
Hmmm. I think I take it back. It's not conceptually different than TransformSpec, so we probably should not introduce another tool that does the same. More over, the PR would do the collation in the main process/thread and not in the workers which is not necessary in this case. So, I am not sure then. How do you think we could improve the API to suite your needs better? Is there anything that needs to be done? |
I personally ended up having to inherit the petastorm Below is what I did.
At a high level, this same idea can be used to store really anything in a tuple in the buffer, as long as it is properly transformed at the batch level for input during training. Hopefully this idea can be used by others where the pandas transform is falling short and/or by those contributing to the |
Interesting. I wanted to use the
To answer @selitvin, im not sure. I was hoping to be able to pass a |
@ianbenlolo I think your idea is certainly preferred as it works within the petastorm libraries current functionality. In many cases dynamically padding tensors as I did isn't needed (but can increase training speed). I could have worked out what I did with pandas I assume, but I just found there to be too much overhead in the expectations of data input and output personally. You'll notice in my code I removed |
Hm so it turns out what i wanted to do (pass a pad_collate function) is in fact possible with the
This now confuses me though as to the purpose of
So; I am confused why the padding in TransformSpec is not sufficient? |
Agree. It's confusing. Indeed there are two different ways of doing this: either preparing all data in TranssformSpec so that it can be automatically collated; or doing the transformation during the collation. To give some context on the design choices that led to the current implementation:
|
I would like to dynamically pad my tensors by way of the
collate_fn
argument that can be passed topetastorm.pytorch.DataLoader
, but I am seemingly thwarted bymake_batch_reader
here, thus it appearsmake_batch_reader
prevents the user from shoring up tensor size through the dataloader.Or is this possible and I'm just missing how to do so?
collate_fn
can take care of the variable length values on a batch by batch basis. Otherwise it seems like I'd need to pad all the data in my spark data frame which increases data size substantially, slows training and I assume i/o through petastorm in general.What I would like to do looks something like below where the function passed to
collate_fun
would dynamically pad my variable length values.The text was updated successfully, but these errors were encountered: