Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LH5iterator behaviour if buffer size doesnt divide table size #118

Open
tdixon97 opened this issue Nov 13, 2024 · 3 comments
Open

LH5iterator behaviour if buffer size doesnt divide table size #118

tdixon97 opened this issue Nov 13, 2024 · 3 comments
Labels
lh5 HDF5 I/O question Further information is requested

Comments

@tdixon97
Copy link
Contributor

When reading a LGDO.Table with lh5.LH5Iterator if the buffer size does not divide the size of the table some data is read twice in the last iteration.

For example:

from lgdo import lh5
import lgdo
print("LGDO version = ",lgdo.__version__)
from lgdo import Array, Table
import numpy as np

# create input data
arr = Array([1,2,3,4,5,6,7])
tab = Table(size=7)
tab.add_field("arr",arr)
print(f"data = {arr} length = {len(arr)}")
lh5.write(tab,"tab","test.lh5",wo_mode="of")

# read with Lh5Iterator
for lh5_obj,_,_ in lh5.LH5Iterator("test.lh5", "tab/arr", buffer_len=5):
    print(f"data = {lh5_obj.view_as('np')},    length = {len(lh5_obj.view_as('np'))}")

Which results in:

LGDO version =  1.9.0
data = [1 2 3 4 5 6 7] length = 7
data = [1 2 3 4 5],    length = 5
data = [6 7 3 4 5],    length = 5

So for the last iteration the buffer is not cleared for the rows which are not needed. This could mean that some rows of the Table are read twice...
What do you think @iguinn ?

@ManuelHu
Copy link
Contributor

you always have to use n_rows_read returned by the iterator, otherwise you will get duplicate rows. This is - at the moment - expected and documented(?) behaviour.

also see related #107 (issue) and #109 (PR)

@tdixon97
Copy link
Contributor Author

Ok makes sense, but I do not think its well documented (@iguinn ?) and potentially a bit confusing can lead to errors.

@iguinn
Copy link
Contributor

iguinn commented Nov 14, 2024

Hi Toby, I agree that it is confusing, and I am hoping we can remove the need for n_rows_read altogether, as Manual said.

@gipert gipert added question Further information is requested lh5 HDF5 I/O labels Nov 21, 2024
@gipert gipert changed the title Bug in LH5iterator if buffer size doesnt divide table size LH5iterator behaviour if buffer size doesnt divide table size Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lh5 HDF5 I/O question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants