-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specializing for sparse arrays #114
Comments
Sorry about the delay, I meant to reply. There is at present no code at all for sparse arrays. They are just blindly indexed over like any I'm not actively working on this. But I know Will Kimmerer is working on some related things, and maybe this will all get hooked up. For TensorOperations.jl, there is also https://github.com/Jutho/SparseArrayKit.jl which does some N-dim sparse stuff. A few years ago Peter Ahrens was working on some things too, https://www.youtube.com/watch?v=Rp7sTl9oPNI is the (very nice) talk, but as far as I know Tortilla.jl is gathering dust in someone's desk drawer. This was related to https://github.com/tensor-compiler/taco . |
Ok. Neat. From some of the talks at JuliaCon it does seem like sparse operations are still in flux quite a bit in the ecosystem. I may think about this a little bit, especially experimenting to see if I can make things faster within the current implementation of tullio (i.e. using some of the extra parameters that tullio accepts) and think about some documentation. But likely any kind of actual specialization should wait until later, as there seems to be other priorities for this project. |
@mcabbott can you assign this issue to me? I am beginning to make progress on my port of taco. I'll probably be able to do a simple SpMV in ~2 weeks. Generating serial/unoptimized loops for everything TACO does I think can be ready early this Fall, then I'll need to consult some 🧙s about the LoopVec/CUDA backends. It'll start separate from Tullio, because honestly parts of Tullio are terrifying (and my interface will require several rounds of iteration), but the plan is to shove it all in here eventually. Some things I'll drop here that I'm thinking about:
And the big one, how much of this should go into Tullio? Data structures will be separate, but in TACO everything is fairly tightly integrated. Letting me/Tullio choose how you iterate is good and all, but until Tullio becomes part of Base :) there should be a way to operate on them outside. But I don't think anyone has ever figured out how to sync sparse iterators generally. |
@Wimmerer This is exciting. Are you aware of or planning to iterate with the compiler plugin work? Also CC @tkf who's also interested in structured loop representations, IRs and iteration of complex datastructures with JuliaFolds Also MLIR might have some good ideas: For a concrete use case, I'm personally eventually interested in exploring stuff like https://github.com/facebookresearch/SparseConvNet and https://pdfs.semanticscholar.org/5125/a16039cabc6320c908a4764f32596e018ad3.pdf These aren't really covered well by other tensor frameworks |
The compiler plug-in stuff is on my radar, but only after I finish the basic code generation stuff. I'm aiming to make several passes (haha get it) after I provide the basic TACO functionality to make it highly extensible etc. The sparse convolution stuff is definitely on my radar, if it can be encoded with Tullio I'm hoping it just drops out of that capability. If not then I'll work on providing special functionality. |
This sounds great!
Not just to you, trust me :) This package is very much the result of me feeling out what's possible / useful / fast. The "several rounds of iteration" have mostly yet to happen, blame the pandemic for not lasting long enough, perhaps... I have some idea how a cleaner parser would look, and something slightly more like an IR instead of a giant (almost) NamedTuple. The ideal future might have such a thing shared between various packages. I haven't spent the time to understand e.g. LoopVectorization's representation. But it might be too early to standardise -- from what little I know about TACO it sounds like this may need to keep more information about how indices relate to each other, which is irrelevant in the dense case. There is also the question of whether this representation knows about types, or just notation.
I am open, BTW, to removing particularly weird edge cases if desirable.
I tried to write a gradient for reductions over It might be simpler if the gradients were done individually. Right now the same loop nest writes to The gradient for
Yes, this we will have to see. Gaining thousands of lines of code I don't understand to maintain wouldn't be optimal. But winding up with a Tullio bus factor > 1 would also be a nice outcome. One potential dividing line to think about: At present Tullio is almost entirely ignorant of types, it generates worker functions based on what packages are loaded, and only the runtime dispatch to call one of those looks at array types. This means it contains no generated functions, which greatly simplifies debugging. Might trying to keep this be a useful guide about what goes where, that array-neutral index-tracking can be done here, but the specialisation of that for particular storage happens elsewhere? |
Could you elaborate on this? Aren't sparse (or dense) computation described by einsum notations equivalent to the join operation in the SQL/Relational algebra sense? Typical dot product algorithms for sparse arrays are essentially Sort-merge join - Wikipedia. So, haven't DB people been doing this for ages? I have a somewhat generic lazy join implementation based on the JuliaFolds protocol. But it is also accumulating dust in my
Did you consider something other way around ("type-erasing" approach)? I can imagine lowering each data structure to a canonical form which is a simple immutable |
I definitely have to know about types, each axis is encoded as one of several types to allow for coiterating properly
I'm hopeful I can support everything. If not then the sparse part can error if it finds something weird.
This is, I think, the critical issue. I need to know quite a bit about the arguments. Chris Elrod recommended a macro which then creates a generated function. I don't believe there's any other way. I do think that knowing more about types could be useful to the rest of Tullio, but it's also nice to just blindly (not in a bad way) use AbstractArray indexing.
That's what I mean. When described by einsum notation this is fairly easy. It'd be nice to make these data structures work more generally, but then you have to somehow help the user sync their own iterators. It's an interface problem really. I aim to provide as much flexibility as possible when it comes to merging indices, but it'll still be done effectively internally. If users want to write their own loops it'll be painful in any case beyond 2 tensors of low dimensionality. But I suppose that's true to a lesser extent for dense tensors as well. You could think of some function |
TACO.jl + Tullio will have one primary data structure Background: Inside there would be something like This brings us to two questions:
E: this: http://groups.csail.mit.edu/commit/papers/2020/kjolstad-thesis.pdf is the thesis I'm basing this on. I have a several extensions in mind, but at the start it's basically a port of that. |
It has been noted many times (and it feels rather painful to criticize everyone's favorite package), but, LoopVectorization.jl's AST-based approach cannot escape from the lack of inter-procedural optimization. To be a real compiler, the work has to be done at the correct layer, so that very basic abstractions like function work. It's already mentioned in this thread, but, the compiler plugin-based approach seems to be a more promising direction.
These are exactly what I meant by layout information. The point on type-erasure is two fold: (1) You don't need to distinguish
Iterators are state machines. Complex state machines are not compiler friendly and also writing correct state machines is not very easy. An alternative approach is to use higher-order function and it turned out to be a verify friendly interface both to the compiler and the users. This can probably be understood by recognizing that higher-order functions are essentially control-flow graph manipulation tools. It lets users move around basic blocks in a safe way, without exposing any of the guts in the compiler. Another limitation in the iterator-based approach is that it is not straightforward to leverage the properties like associativity and commutativity in the reduction (which are essential for parallelism and vectorization). (That said, I've been thinking about fold-equivalent code from type-level state machines. But this is tricky to do correctly. For example, it can generate invalid code on GPU if it used with certain synchronization mechanisms.) But, the reason why I didn't comment on the higher-order function approach initially was that it is much less structured than the computations describable by einsum. Since the key to enabling optimizations is to add more structures to the program, composing pre-defined set of |
@tkf this is very enlightening thank you. What TACO does is take tensors whose layout is essentially
Step 3 uses known properties of each As I understand it you're essentially recommending that Tullio (eventually) accept AbstractArrays that are either already represented in terms of a tree* of *Note: By tree I mean that the coordinates of a tensor are represented as a "levelized" tree, where each level/dimension is an AbstractIndex which defines how you arrive at positions or coordinates in the next dimension/level of the tree. On the iteration subject do you think I should just ignore "non-einsum" interfaces for now? It's a much harder problem, and probably better solved with a good foundation/experience in Tullio. |
@Wimmerer have you seen https://arxiv.org/abs/2104.05372 |
I did briefly! The approach of https://github.com/shashi/ArrayMeta.jl was to lower index notation to types in the macro, and then do everything in generated functions. And my lesson from tinkering with it is that this is just much harder to work with, so for now Tullio is an experiment to see how far you can get just generating code you can look at. (It's possible this will eventually run out of steam, e.g. while it's easy to do first derivatives up front, if you want an N-th derivative maybe that has to be generated on demand.) But I'm not quite sure if that generated function approach is what you had in mind. Right now even the tiling and multi-threading just works by dividing the longest index range in half, repeatedly, using runtime size but not type. That seems like the first place that it would pay to know the strides of dense arrays, but in examples where I tried to exploit this, I didn't see clear gains. Of course using LoopVectorization to generate the innermost kernels means that for SIMD it does know the strides. I presume that Tullio could equally well hand this task to any more compiler-integrated replacement library.
If the index types of matrix For dense arrays, this is enough information. I did try using some map like
Can you expand a on this a litle? Are you picturing something like the
I see from a very quick glance at the thesis that, while it builds many ways to run and transform these calculations, it does not say much about how to automatically choose the best one. Tullio at present provides very few knobs to control such things, as crude heuristics (and LV's modelling) seem to work OK. But if this is to be user-specified, that's another level of interface to think about. |
@AriMKatz I've seen it but not read it yet, but I will shortly :), you've provided quite a useful reading list!
It does not do much optimization in this thesis. That thesis also doesn't use workspaces etc. There are 5+ more theses from other people in that lab in 2020 that provide some of that. I think maximizing the number of knobs available to Tullio is something this can help provide. |
I'd be interested to have a call about this, and Tullio at some point in the next 1-2 weeks. Essentially I have the following question: How should type information be provided to my code. Generated function? Abuse of multiple dispatch? Is there a better way to do this now with compiler plugins @tkf? For generating the iteration graph I need at least types for the axes (and probably at least 5+ parameters for each for different settings like For now I will probably verify that things work using multiple dispatch on a few trivial cases (CSR/CSC), and then move on to a function like Stages:
|
I think it would be useful to loop in the compiler plugin people @aviatesk @femtomc and @Keno for the call also, because IIUC those layers (to support the stages you mentioned) are still being developed, and I think the idea is for it to co-evolve with applications like this. Can also help answer those questions. Also Keno had interest in developing a similar TACO like compiler. Perhaps also the other folks working on structured loop representations/codegen such as @chriselrod and @vchuravy to share ideas and potentially infrastructure. More prior art: https://openai.com/blog/triton/ (super interesting approach) Though I think the dex "pointful" paper is really promising. |
I'm happy to join :) Please feel free to ping me on discourse/slack/zulip.
Yes, that's what I was thinking!
I guess it depends on your interest? I do think "non-einsum" iteration interface is interesting too and that's basically what I do in JuliaFolds. But I think focusing on einsum has large benefits. I wouldn't call einsum "easier." Certain things are simplified (e.g., you probably aren't going to support Also, array-focused approach can have extension to other directions like handling jagged arrays: e.g., "Jagged, ragged, awkward arrays" by Jim Pivarski - YouTube (https://awkward-array.org) Though it's not a compiler project and rather a conventional Pythonic library that wraps C++ kernels.
Right, approach heavily relying on generated function is not quite my taste :) If I only need CFG manipulation, I'd just use higher-order functions (though it has its own pain, as you know). I find it easier to work with since everything is plain Julia code with the full capacity of Julia. I think generated function is in a rather unfortunate middle ground since you can't quite play the role of compiler (e.g., inter-procedural analysis for cost estimation) while not being transparent to the compiler and the user for how the code is generated.
Do you mean what's possible with an approach focusing on einsum? What I had mind is somewhat similar to struct Axes{InputIndex, DimIndex} end
tacolike((I, J) -> A[I] * B[J], (A, B) (Axes{1, 1}(), Axes{2, 2}()), +, (Axes{1, 2}(),)) (and maybe first do For example, for dense-sparse dot product, we can iterate over the stored indices, wrap it into a special type Of course, to really automate the scheduling, you'd need to analyze what the anonymous function and |
Forgive my ignorance here: if |
A and B have 4 axes, and we're only going to iterate over 3 of them. But which 3 would differ if (say) I wrote Takafumi's version (if I'm reading it right) just passes tags which tell you which axis of which argument to look at. The reason Tullio doesn't do that is that it wants to allow e.g. My question here is whether individual axes can contain enough information, or whether to interpret them correctly you've got to see the whole
|
Ah I'm operating under the assumption the first argument is an Though to fully complete the project I really do need to know what that anonymous function does. Or at least I need a bunch of knobs that allow some way to specify that higher up. Things like semirings and other operator properties. E: It seems then that I'd prefer probably to operate on the expression and the type information since that's more or less a direct port of TACO, which frees me to work on other parts. But I'm not sure yet. |
I don't think I necessarily need info about the whole of A at once, so long as I know about what axes are currently being coiterated from A, B, C, ... But there are lots of transformations that do use the entire iteration graph, and TACO is constructed with full information. I'd have to do a ton more research (which I'd rather spend on other extensions), rather than just porting their Expr -> IR -> Iteration Graph -> Imperative sequence. |
Ah, it makes sense. I think there might be a way around this, but my impression is that the anonymous function-based approach won't work anyway since, e.g., you might need to look at |
Good point. This is something Tullio does not try to do, at all, at the moment. Assuming that (If you write |
I hadn't though much about optimizations like that yet, but doing that is definitely a goal.
What do you mean by this? Skimming through the papers again I'm fairly convinced I need to know 1. the full expression and 2. the types of each axis ( Further down the line I probably need the types of operators as well, so users can define trait-like properties on them. There's other ways to do that part though. |
I was wondering if there were any plans to specialize when sparse arrays are loaded/being used.
For example:
and then benchmarking:
The text was updated successfully, but these errors were encountered: