You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When I filed #3123, I was surprised to discover that concatenating lots of Utf8 elements is supposed to panic when the total size is over 2 GB, even though the individual sizes are much smaller. That constraint was really unexpected! It makes sense if you understand the storage model, but I didn't and so was very surprised.
Describe the solution you'd like
I'm not sure how to surface this knowledge better. When I first skimmed the data type docs, I walked away thinking that LargeUtf8 is for cases where an individual element is large (I wasn't even clear that large meant > 2 GB) and that I should use Utf8 for everything else. But I should've understood the constraint as "use LargeUtf8 everywhere except places where you can guarantee that you'll never have an array with more than 2 GB of text total".
Maybe we just need a big statement in the Physical Memory Layout guide and the DataType doc string explaining that you cannot ever build an array where the total text size is over 2 GB if you use Utf8
Describe alternatives you've considered
This feels like a landmine and I wish Arrow could transparently convert between these types as needed. Ideally there should just be a Utf8 type that internally specifies what type it uses to manage offsets.
Alternatively, I wish the concat kernel could return a more explicit failure message by explicitly checking for this sort of overflow, something like "I've been asked to concat 2 Utf8 arrays into an array that will be over 2 GB and I cannot do that: these arrays need to be LargeUtf8 instead". I mean, when you're doing the concatenation, you can check lengths explicitly ahead of time.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When I filed #3123, I was surprised to discover that concatenating lots of
Utf8
elements is supposed to panic when the total size is over 2 GB, even though the individual sizes are much smaller. That constraint was really unexpected! It makes sense if you understand the storage model, but I didn't and so was very surprised.Describe the solution you'd like
I'm not sure how to surface this knowledge better. When I first skimmed the data type docs, I walked away thinking that
LargeUtf8
is for cases where an individual element is large (I wasn't even clear that large meant > 2 GB) and that I should useUtf8
for everything else. But I should've understood the constraint as "useLargeUtf8
everywhere except places where you can guarantee that you'll never have an array with more than 2 GB of text total".Maybe we just need a big statement in the Physical Memory Layout guide and the
DataType
doc string explaining that you cannot ever build an array where the total text size is over 2 GB if you useUtf8
Describe alternatives you've considered
This feels like a landmine and I wish Arrow could transparently convert between these types as needed. Ideally there should just be a
Utf8
type that internally specifies what type it uses to manage offsets.Alternatively, I wish the
concat
kernel could return a more explicit failure message by explicitly checking for this sort of overflow, something like "I've been asked to concat 2Utf8
arrays into an array that will be over 2 GB and I cannot do that: these arrays need to beLargeUtf8
instead". I mean, when you're doing the concatenation, you can check lengths explicitly ahead of time.The text was updated successfully, but these errors were encountered: