-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Make cudf::size_type 64-bit #3958
Comments
🤣 |
We have these APIs on the Python side already, so we'd ideally need these APIs sooner rather than later, though I do believe the PyArrow CPU side will handle splitting chunks for us. |
This seems like a really reasonable way to ensure that we are compatible with larger columns almost for free. If we want to revisit later that might make sense but right now I can imagine that the discussions alone for going to chunked representation would be long. |
My biggest concern is the amount of memory that this will take and the performance impact it will have. In a lot of the data sets we encounter the length of the strings is small. often very small. Doing this in many of these cases will effectively double the amount of memory used to store a string, and hence cut the performance of all string operations in half, assuming that they are bound by memory access speed. This will also be true for arrays when they show up. Not to mention that we will use a lot more precious GPU memory to store offsets that we could have used to store actual data. I just want to be sure that if we do this for performance reasons that we do some performance and capacity mesurements before and after to see what impact this will actually have. |
@revans2 that is a valid concern. We could handle that by having a string datatype variant that stores offsets in int32_t. Similar to how we have different TIMESTAMP datatypes, we could implement STRING variants. This would be more of a future optimization that we dont have to implement right now, but we know that we could do this in the future and not necessarily paint ourselves into a corner so to speak, by adopting a 64-bit size_type |
The reason for switching to 64 bit is more about possibility reasons than perf reasons. Right now there are ways to break it having data that is well within the limits of the gpu. I agree there will be a perf penalty which we can evaluate but the other alternatives pretty bleak. Either Don't make your columns too big to the users or taking on another massive refactor project. |
There are a number of cudf functions that use cudf::size_type to create intermediate data structures (e.g. build an array of row offsets in some kernel and then use gather/scatter to create a new table). Doubling the size of cudf::size_type would double the temporary memory used by these methods - essentially halving the size table you could effectively operate on for one GPU. To get around that you could either change these to use an int32 directly (although as GPU memory sizes increase that would be an artificial limit), or create 32 and 64 bit versions of these internal algorithms and call the 32 bit version if the number of rows < 2 billion. Could make things a bit more complicated. |
Quick scan of the code doesn't turn up much in the way of portability problems. Every Pure mismatches between size_type and int/int32_t are harder to detect without actually compiling testing of course, but at least on the surface level there doesn't appear to be any deep-rooted issues with implicit assumptions that size_type == 4 bytes. |
@ChuckHastings do you have a pointer to an example? I think that would be helpful. |
To @revans2 's comment, we have actually already run into the case I mentioned of a string column with < 2B strings but > 2B characters. Not saying it's a common case, but it's an ugly situation. |
Groupby/Join are an example of this. The hash map uses |
How big is the hash map? |
Join creates the following intermediate products:
|
For the intermediates, could we detect that the column size is < 2B and use int32 for the hash map and VectorPair? |
That's what I was intending by my suggestion above. We could create a template parameter that identifies whether to use int32 or int64 for the hash map and vector pair and then call the correct version depending on the size of the tables. |
Just as a note, we will want to update any lines like this one that may type this for higher level code (Python and/or Java). |
Yes, java code would have to be updated to handle this. It is not a big deal though and I am happy to do it. I have also run into situations where a string column cannot be stored because there are too many characters in it, so I know that it is a real issue. But, just to play devil's advocate... When we first started work on Spark and mentioned that there were algorithms in Spark that would spill to disk because it could operate on data larger than fits in host memory we were told that was not really a priority because we could always partition the data into smaller pieces. Why is that not true here too? Why is this a case of ability when you have that option to partition the data into smaller pieces? Yes, I know that cudf itself is not going to be partitioning the data, ala chunked arrow vectors, but Spark chunks the data itself without any cudf knowledge, and I know Dask can as well. |
Yep agreed. Just making sure that point doesn't get lost in the shuffle (pun not intended 😉). Yep that's true of both frameworks. There's a tradeoff between using algorithms that operate on local data vs. distributed data. The increased size would let users tune this more appropriately for their workflows to get the desired performance. |
I think this is the core issue. If we could have max(int32_t) strings I think everyone would be a happy camper, but having max(int32_t) bytes severely limits the number of elements you can have in a table before needing to partition it where it can adversely impact performance. |
The main thing is that libcudf does not want to have to write algorithms that operate on chunked columns. We prefer dask/spark to do the distributed algorithms, and libcudf provide the primitives on which to build them. If we can solve the strings column problem by having a different size_type for the size of character arrays, that might reduce the current pain. Note that this problem probably affects other nested types too (e.g. lists). @jrhemstad @davidwendt this suggestion goes against the grain of the current design of strings columns / nested columns, where the data array (characters, in the case of strings) is just a child column. |
It's not really possible to change the |
I noticed a few places in the libcudf code where columns are created with Regardless, it got me thinking that there is no reason the chars column needs to be created as With This change would only effect strings code. Also, I think the change would be minimal -- effecting mostly the factories used to create strings columns and any code that access offsets and chars directly (e.g. I cannot speak to how this would effect the Cython code interface though. |
I've been mulling over the same idea. Some of the concerns I've had:
|
I had the same thoughts but here is my rationalization:
|
From the discussions above, I think supporting large columns (strings as well as other types) will require converting I started experimenting with changing Finally, I agree with the concern about this change effecting column types that have offsets child columns. Changing the offsets columns to |
This seems like it would be very non-trivial as you'd have to somewhere make the decision when making a new dictionary column about whether you're going to use 32 or 64 bit indices. I don't know how you make that decision without just assuming you always use the largest size, or first introspecting the data to determine what size you need. The same applies for nested types. At some point you need to make a decision about what size indices you need to use. In general, I don't know how you do that without just always using the biggest size, or running the computation twice and instantiating code paths for all possible required sizes. |
Fixed point types have a similar issue. Knowing it's "fixed point" isn't enough. You need to know the scale value as well, so you can't generalize fixed points types as arithmetic and every place that uses fixed point types needs to be specialized. I wonder if it would make sense to have an OFFSET type_id that also comes with additional info in the data_type indicating it's size. It would be a big undertaking to get everything that uses offsets to work this way but once you do, you've insulated yourself well. |
For strings, I've found creating the offsets array by first building the array of sizes and then performing a scan over the sizes to get the offset values is faster than doing a single transform-scan over the sizes-calc-functor (mainly because of the aggressive inlining in thrust). This means if we limit individual strings in a column size type int32 we could first build the sizes column with int32 values and then add a reduce using an int64 accumulator to calculate the total size. So the process introduces an extra reduce step. If the size is small enough, we can scan into the existing int32 array to build the offsets output column. Otherwise, we need to scan into a new int64 array and the original int32 array is then discarded. Limiting an individual string size to max::int32 seems reasonable since we could not store many of these anyway. But I don't think we could make the same restriction for other types like nested lists. |
A list column's offsets are tracking the row offset of the immediate child column, and I think it's quite reasonable to have a limit of 2^31 child rows per parent list row, similar to the 2^31 limit on bytes for a single row of a string column. That's a huge improvement from where we are now. This would enable a list-of-strings column where the list column offsets are 32-bit but the string column offsets are 64-bit. I don't know enough about the libcudf nested type offsets-of-offsets handling to see if that's going to be problematic in practice. |
I wouldn't expect it to be too bad. Most things that process lists end up just fobbing off the final leaf work to the appropriate type handler (string, etc). So as long as strings inherently handled 32/64 bit offsets, it should be ok. There's oddball edge cases like contiguous_split which glom string and list offsets together as the same type, but that's relatively easy to fix. |
I hit the wall with this string limit (2**32) today when trying to load a large dataset with 185M rows and 2 string features. The dataset comes from Kaggle talkingdata competition. In real life, we can expect much larger datasets, so I believe its very important to expand that limit in cudf . |
Hi @titericz, I would highly recommend using a framework built on top of cudf in this situation instead of cudf directly. I.E. using |
To update this issue, we've tentatively decided that we will NOT be changing |
Is it possible to have a new |
So we double our library size and compile time? |
If that's the only option then yes, we do. This problem isn't going away
and it's becoming a more common problem in the RecSys space where datasets
are massive. This is a huge blocker for adoption of NVTabular by the
bigger companies unless we find a solution.
…On Mon., Jun. 7, 2021, 12:35 a.m. Mark Harris, ***@***.***> wrote:
So we double our library size and compile time?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3958 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGF5LTIIR7TA67FGZI7W62DTRRZF5ANCNFSM4KMMGCZA>
.
|
Overall dataset size doesn't really matter so long as you partition it into small enough pieces. |
Yeah, this is a good (and general) solution. Given this idea, I think it is a huge benefit to the users to add a utility function that outputs a list of
The utility API (let call it
Of course, the users can implement this utility function very easily, but it's very helpful if we provide it. |
@EvenOldridge We have discussed this pretty thoroughly and we have pretty much decided that making We should probably open a new issue. I made the error of making the title of this issue a proposed solution rather than a description of the problem. |
If this can be solved through partitioning that's awesome. We're working on examples that show the difference in perf and the errors that occur on data where the partitions are too large and that will hopefully prevent a lot of errors on the customer side. But we commonly run into issues with text data and also occasionally with output sets that are larger than the size_type limits that I'd like to have a clear solution for. Doesn't have to be this. @rjzamora was suggesting that it should be possible to overcome that on our end. Still might be worth thinking about at the cudf level too though for those cases. |
I would be interested in the examples for working with partitions and cugraph and cudf to work around the limits. A little unclear how much of this will be exposed to users at the Python level vs having to drop down to C++. I've been creating blocks in my algorithm using cudf which work for the most part, but with 1.7B directed edges I also have to run the thing twice inverting my columns so i can treat it as undirected because i couldnt simply have a single dataframe with all the edges in both. I've also struggled to get anything going with cudf and dask. |
I’ve decided to close it as “wontfix” for now. The string column size limit has a lot of salience with groups using libcudf because it’s a common first limit that teams encounter when scaling distributed workloads. For text-processing workloads, the 2 GB string column limit will be hit before general out-of-memory errors. However, the string column size limit can generally be worked around by adjusting the data partitioning. Raising this limit in libcudf has been the subject of a LOT of discussion, only some of which is captured in this issue. For each option, the performance cost and engineering effort is too great to prioritize this issue. Here is a list of ideas that have been evaluated and disqualified:
|
Is your feature request related to a problem? Please describe.
cudf::size_type
is currently anint32_t
, which limits column size to two billion elements (MAX_INT
). Moreover, it limits child column size to the same. This causes problems, for example, for string columns, where there may be fewer than 2B strings, but the character data to represent them could easily exceed 2B characters.A 32-bit size was originally chosen to ensure compatibility with Apache Arrow, which dictates that Arrow arrays have a 32-bit size, and that larger arrays are made by chunking into individual Arrays.
Describe the solution you'd like
Change
size_type
to be anint64_t
.Handle compatibility with Arrow by creating arrow chunked arrays in the libcudf
to_arrow
interface (not yet created), and combine arrow chunked arrays in the libcudffrom_arrow
interface. This can be dealt with when we create these APIs.Describe alternatives you've considered
Chunked columns. This would be very challenging -- supporting chunked columns in every algorithm would result in complex distributed algorithms and implementations, where libcudf currently aims to be communication agnostic / ignorant. In other words, a higher level library handles distributed algorithms.
Additional context
A potential downside: @felipeblazing called us brave for considering supporting chunked columns. If we implement this feature request, perhaps he will not consider us quite so brave. :(
The text was updated successfully, but these errors were encountered: