-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] libcudf column abstraction redesign #1443
Comments
My pain points:
|
Pain Points
|
Pain points
|
Pain PointsMostly reiterating what I think are the most important things to fix first. Basically, the problem is that gdf_column lacks encapsulation.
|
|
I have an idea regarding step 5. Revise to: 5. Initial Refactor
I think this will help flush out more issues in the abstraction. We could keep it at one candidate feature, done by a different engineer than the column implementer, but two would be nice. |
I also would like to tighten up the schedule a bit. I would like to get the basic abstraction designed and implemented by 0.8 release. If we can't do that, then I think the scope is too broad and should be narrowed. Since most of our pain points agree, I don't think the scope is too large at this point. |
Updated with @harrism's feedback about schedule and the initial refactor process. |
Pain Points
|
@jlowe and everyone else: About column names - do we actually use column names at all in libcudf? I'm not asking about Python code, just the C/C++. |
My pain points:
|
|
@eyalroz ,
I mean "the algorithms" like cudf::scan(), does not mean cudf::column class.
It's really nice! |
The column names are filled in by the csv and parquet parsers. There needs to be some way to convey the names discovered during parsing back to the caller. However I could see some debate whether it really needs to be stored directly in the column structure. |
@eyalroz and @kovaltan
is not a part of the design goals here. But I think it will become an important feature likely combined with JITing so we should at least think ahead to how this might work and not paint ourselves into a corner. One of the key pieces of doing the lazy evaluation is that it can eliminate intermediate results and as such will not need to allocate the corresponding memory. So if we want to be able to eliminate memory allocation we cannot pre-allocate it for every operation. I would propose, instead, that the operation itself do the device memory allocation. Not allocate the column class, but allocate the device memory the column ultimately points to. We could structure the column class so that it knows how to "allocate" and "free" memory which can be overridden by the user creating it. That should give us the flexibility to do just about anything we want with memory management of the columns, but also lets the algorithms themselves decide when to allocate memory, and even how much memory to allocate. |
@kovaltan I think you may have misunderstood Mark's comment here:
I believe he was referring to how presently returning a For example:
However, we have identified that is desirable to be able to return outputs from functions rather than use output parameters (for various reasons: clearer style, delayed JIT evaluation, etc.). Therefore, we want to make that safer via ownership semantics. As such, we likely will be moving in the direction that algorithms allocate the device memory for outputs. To be clear:
|
@revans2 :
You are preaching to the choir - I definitely agree (in fact I would love to work on faster JITing using PTX, regardless of cudf). This part of why I've proposed removing some of the features of columns for the basic/fundamental abstraction - flexibility for later use.
But why are we certain that the allocating code will be part of what gets JITed together? Clearly, to construct operators by JITting we would use composable building-blocks which are not the functions in
I'm thinking about a design which is reminiscent of |
@jlowe :
Certainly, but the parsers are not part of libcudf; so why should this extra information be passed to libcudf if it doesn't get used there at all? |
Since when? The CSV, parquet, and soon ORC and Jason-lines parsers are all part of libcudf. |
@harrism : Yes, right, I misspoke. That is part of the libcudf codebase. But the library itself, or rather the API functions which process columns, don't take CSVs and parse them, nor do they make calls to CSV-parsing code. And the names aren't used (it seems). So, what if the names were to simply be parsed into a separate array in the |
Ah, I've misunderstood Mark's comment. ''' Pros:
Cons:
|
You've alluded to the solution. It's orthogonal to the column design, but ultimately we would like it such that you can pass a custom allocator into any cudf algorithm, similar to how you can with Thrust. |
Regarding column names: I think it should eventually not be part of the column itself but part of a parent array of column that would include the schema, containing column names and their relationship (will also allow preserving schema information when reading/writing columnar formats). |
Pain points
|
The thing is, libcudf doesn't provide all possible common functionality, and AFAIK it's not supposed to. Granted, though, the exact scope of libcudf doesn't seem to be put in writing well enough (certainly not in the repository's Also, again, can you be specific about concrete optimization opportunities which will be missed due to columns being immutable after construction? |
Pain points
|
@thomcom : Following your listing of points:
|
Actually @eyalroz we DO have |
Thanks for your response @eyalroz.
I'm sorry if I'm way out in left field here. I have a fantasy that with careful design we could, for example, transpose 10GB of columns in a few operations, simply by shuffling labels around. This would be powerful.
|
In regards to all my "lets support matrix styled data" - if this is |
@revans2 @jlowe @felipeblazing @harrism and I are discussing whether or not an allocator should be exposed to libcudf users to control how columns are allocated. I.e., should the libcudf functions accept an If RMM w/ pool mode is "fast enough" such that memory allocations are no longer a bottleneck, then is there another reason you'd need control over how the memory is allocated? Would it be a showstopper to not provide control over how the device memory is allocated? (Just making sure we're covering our bases and understand all the use cases). |
@jrhemstad : What about @felipeblazing 's allocator use scenarios mentioned in this comment? |
The two cases I can think of where allocators could be useful:
|
The design is in progress, but moving this to 0.9 to keep it open. |
Creating and interfacing with the
gdf_column
C struct is rife with problems. We need better abstraction(s) to make libcudf a safer and more pleasant library to use and develop.In lieu of adding support for any new column types, the initial goal of a
cudf::column
design is to ease the current pain points in creating and interfacing with the column data structure in libcudf.Goals:
gdf_column
structureNon-Goals
Note that a “Non-Goal” is not something that we want to expressly forbid in our redesign, but rather are not the focus of the current effort. Whenever possible, we can make design decisions that will enable these “Non-Goals” sometime in the future, so long as those decisions do not compromise the timely accomplishment of the above “Goals”
Process
1. Gather pain points
2. Derive requirements
3. Design Mock Up
4. Implementation
5. Initial Refactor
gdf_column
to the new abstraction(s)6. Full Refactor
The text was updated successfully, but these errors were encountered: