Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specializing for sparse arrays #114

Open
mkschleg opened this issue Jul 2, 2021 · 24 comments
Open

Specializing for sparse arrays #114

mkschleg opened this issue Jul 2, 2021 · 24 comments
Labels
enhancement New feature or request

Comments

@mkschleg
Copy link

mkschleg commented Jul 2, 2021

I was wondering if there were any plans to specialize when sparse arrays are loaded/being used.

For example:

using SparseArrays, LinearAlgebra, Tullio, BenchmarkTools, LoopVectorization

n = 10000
n2 = 150

x = spzeros(n);
x[rand(1:n, n2)] .= 1 ;
y = rand(n);

function sparsedot(x, y)
           s = zero(promote_type(eltype(x), eltype(y))
           for (i, val) in zip(x.nzind, x.nzval)
               s += val * y[i]
           end
           s
end

function tulliodot(x, y)
           @tullio (+) s = x[i] * y[i]
end

and then benchmarking:

julia> @benchmark dot(x, y)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     184.583 ns (0.00% GC)
  median time:      187.732 ns (0.00% GC)
  mean time:        190.062 ns (0.14% GC)
  maximum time:     1.585 μs (87.98% GC)
  --------------
  samples:          10000
  evals/sample:     671

julia> @benchmark sparsedot(x, y)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     166.202 ns (0.00% GC)
  median time:      169.454 ns (0.00% GC)
  mean time:        184.200 ns (0.13% GC)
  maximum time:     1.419 μs (86.28% GC)
  --------------
  samples:          10000
  evals/sample:     759

julia> @benchmark tulliodot(x, y)
BenchmarkTools.Trial:
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     80.558 μs (0.00% GC)
  median time:      81.338 μs (0.00% GC)
  mean time:        87.725 μs (0.00% GC)
  maximum time:     316.229 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
@mcabbott
Copy link
Owner

mcabbott commented Jul 29, 2021

Sorry about the delay, I meant to reply.

There is at present no code at all for sparse arrays. They are just blindly indexed over like any AbstractArray, and being thousands of times slower than ideal is expected.

I'm not actively working on this. But I know Will Kimmerer is working on some related things, and maybe this will all get hooked up. For TensorOperations.jl, there is also https://github.com/Jutho/SparseArrayKit.jl which does some N-dim sparse stuff. A few years ago Peter Ahrens was working on some things too, https://www.youtube.com/watch?v=Rp7sTl9oPNI is the (very nice) talk, but as far as I know Tortilla.jl is gathering dust in someone's desk drawer. This was related to https://github.com/tensor-compiler/taco .

@mkschleg
Copy link
Author

mkschleg commented Jul 30, 2021

Ok. Neat. From some of the talks at JuliaCon it does seem like sparse operations are still in flux quite a bit in the ecosystem. I may think about this a little bit, especially experimenting to see if I can make things faster within the current implementation of tullio (i.e. using some of the extra parameters that tullio accepts) and think about some documentation. But likely any kind of actual specialization should wait until later, as there seems to be other priorities for this project.

@rayegun
Copy link

rayegun commented Aug 16, 2021

@mcabbott can you assign this issue to me?

I am beginning to make progress on my port of taco. I'll probably be able to do a simple SpMV in ~2 weeks. Generating serial/unoptimized loops for everything TACO does I think can be ready early this Fall, then I'll need to consult some 🧙s about the LoopVec/CUDA backends.

It'll start separate from Tullio, because honestly parts of Tullio are terrifying (and my interface will require several rounds of iteration), but the plan is to shove it all in here eventually.

Some things I'll drop here that I'm thinking about:

  • I will have an IR of some sort (actually several levels). Once it's solidified it might be good to dive deep into Tullio, and see how it should interact. In theory the Taco stuff works just fine for dense arrays as well. But, it's not optimized for it. It may be best for the sparse stuff to be sequestered by itself, but then sparse-dense stuff becomes sticky. A proper IR might make some existing Tullio optimizations easier as well though.
  • Related to IRs, is pluggable optimization at different stages (tensor notation stage, IR stage, loop stage).
  • Can we define good gradients over other reducers?
  • Masks
  • Matching Tullio's extensive support for all sorts of incredible indexing will be probably the most novel and difficult thing. How useful some indexing tools are when you're operating on sparse tensors isn't clear, but I'll try and support everything.

And the big one, how much of this should go into Tullio? Data structures will be separate, but in TACO everything is fairly tightly integrated. Letting me/Tullio choose how you iterate is good and all, but until Tullio becomes part of Base :) there should be a way to operate on them outside. But I don't think anyone has ever figured out how to sync sparse iterators generally.

@AriMKatz
Copy link

@Wimmerer This is exciting. Are you aware of or planning to iterate with the compiler plugin work?

Also CC @tkf who's also interested in structured loop representations, IRs and iteration of complex datastructures with JuliaFolds

Also MLIR might have some good ideas:
https://mlir.llvm.org/docs/Dialects/SparseTensorOps/ https://llvm.discourse.group/t/sparse-tensors-in-mlir/3389

For a concrete use case, I'm personally eventually interested in exploring stuff like

https://github.com/facebookresearch/SparseConvNet and https://pdfs.semanticscholar.org/5125/a16039cabc6320c908a4764f32596e018ad3.pdf

These aren't really covered well by other tensor frameworks

@rayegun
Copy link

rayegun commented Aug 16, 2021

The compiler plug-in stuff is on my radar, but only after I finish the basic code generation stuff. I'm aiming to make several passes (haha get it) after I provide the basic TACO functionality to make it highly extensible etc.

The sparse convolution stuff is definitely on my radar, if it can be encoded with Tullio I'm hoping it just drops out of that capability. If not then I'll work on providing special functionality.

@mcabbott
Copy link
Owner

This sounds great!

because honestly parts of Tullio are terrifying

Not just to you, trust me :)

This package is very much the result of me feeling out what's possible / useful / fast. The "several rounds of iteration" have mostly yet to happen, blame the pandemic for not lasting long enough, perhaps... I have some idea how a cleaner parser would look, and something slightly more like an IR instead of a giant (almost) NamedTuple.

The ideal future might have such a thing shared between various packages. I haven't spent the time to understand e.g. LoopVectorization's representation. But it might be too early to standardise -- from what little I know about TACO it sounds like this may need to keep more information about how indices relate to each other, which is irrelevant in the dense case. There is also the question of whether this representation knows about types, or just notation.

all sorts of incredible indexing

I am open, BTW, to removing particularly weird edge cases if desirable.

Can we define good gradients over other reducers?

I tried to write a gradient for reductions over * at some point, but if you want it to be correct when some elements may be zero, then it's tricky.

It might be simpler if the gradients were done individually. Right now the same loop nest writes to dx[i,j] and dy[j,k] because I hoped that would be efficient. But I think separating them might be better, and allow more re-use... via this hypothetical IR. It might also allow 2nd derivatives more easily.

The gradient for max is also a bit tricky, and the present implementation doesn't play well with LoopVectorization (it is disabled).

how much of this should go into Tullio? Data structures will be separate, but

Yes, this we will have to see. Gaining thousands of lines of code I don't understand to maintain wouldn't be optimal. But winding up with a Tullio bus factor > 1 would also be a nice outcome.

One potential dividing line to think about: At present Tullio is almost entirely ignorant of types, it generates worker functions based on what packages are loaded, and only the runtime dispatch to call one of those looks at array types. This means it contains no generated functions, which greatly simplifies debugging. Might trying to keep this be a useful guide about what goes where, that array-neutral index-tracking can be done here, but the specialisation of that for particular storage happens elsewhere?

@tkf
Copy link

tkf commented Aug 16, 2021

But I don't think anyone has ever figured out how to sync sparse iterators generally.

Could you elaborate on this? Aren't sparse (or dense) computation described by einsum notations equivalent to the join operation in the SQL/Relational algebra sense? Typical dot product algorithms for sparse arrays are essentially Sort-merge join - Wikipedia. So, haven't DB people been doing this for ages?

I have a somewhat generic lazy join implementation based on the JuliaFolds protocol. But it is also accumulating dust in my ~/.julia/dev/ too :). BTW, when I played with it, it was easy to observe speed differences between the algorithms based on galloping search and linear search depending on the distribution of the indices (which is, of course, expected). So, I guess it'd be nice for users to be able to specify algorithms manually, since the information about the index distribution is entirely in the data and not the code/IR. Or maybe another way to put it is that it's nice to have profile-guided optimization in mind.

Might trying to keep this be a useful guide about what goes where, that array-neutral index-tracking can be done here, but the specialisation of that for particular storage happens elsewhere?

Did you consider something other way around ("type-erasing" approach)? I can imagine lowering each data structure to a canonical form which is a simple immutable struct that has (say) a pointer and the layout information (size, strides, sortedness, ...). Custom data structures can be lowered to the canonical form at the very beginning of the pipeline and so the actual compute function only has to be specialized against each set of canonicalized input.

@rayegun
Copy link

rayegun commented Aug 16, 2021

There is also the question of whether this representation knows about types, or just notation.

I definitely have to know about types, each axis is encoded as one of several types to allow for coiterating properly

I am open, BTW, to removing particularly weird edge cases if desirable.

I'm hopeful I can support everything. If not then the sparse part can error if it finds something weird.

One potential dividing line to think about: At present Tullio is almost entirely ignorant of types, it generates worker functions based on what packages are loaded, and only the runtime dispatch to call one of those looks at array types. This means it contains no generated functions, which greatly simplifies debugging. Might trying to keep this be a useful guide about what goes where, that array-neutral index-tracking can be done here, but the specialisation of that for particular storage happens elsewhere?

This is, I think, the critical issue. I need to know quite a bit about the arguments. Chris Elrod recommended a macro which then creates a generated function. I don't believe there's any other way.

I do think that knowing more about types could be useful to the rest of Tullio, but it's also nice to just blindly (not in a bad way) use AbstractArray indexing.

Could you elaborate on this? Aren't sparse (or dense) computation described by einsum notations

That's what I mean. When described by einsum notation this is fairly easy. It'd be nice to make these data structures work more generally, but then you have to somehow help the user sync their own iterators. It's an interface problem really. @tullio will work great, but manually indexing will be ugly. I'd like users to be able to use things like Compressed Sparse Fibers, CSR, etc without Tullio, but I'm not sure there's a good interface there.

I aim to provide as much flexibility as possible when it comes to merging indices, but it'll still be done effectively internally. If users want to write their own loops it'll be painful in any case beyond 2 tensors of low dimensionality. But I suppose that's true to a lesser extent for dense tensors as well.

You could think of some function sync which takes in 2+ tensors and produces an iterator over them. That's essentially what TACO/Tullio will do, but inside out.

@rayegun
Copy link

rayegun commented Aug 16, 2021

Did you consider something other way around ("type-erasing" approach)? I can imagine lowering each data structure to a canonical form which is a simple immutable struct that has (say) a pointer and the layout information (size, strides, sortedness, ...). Custom data structures can be lowered to the canonical form at the very beginning of the pipeline and so the actual compute function only has to be specialized against each set of canonicalized input.

TACO.jl + Tullio will have one primary data structure Tensor (or something). But it'll be parametrized by something like AbstractIndexType for each axis. A CSC Tensor would be something like Tensor{Dense, Compressed} (it needs to be varargs but whatever). A DCSC (hypersparse) would by Tensor{Compressed, Compressed}. A rank 3 COO tensor would be Tensor{Compressed, Singleton, Singleton}.

Background: Inside there would be something like Tensor.indices which is a tuple of these AbstractIndexTypes: Dense, Compressed, Range, Offset, Hash, Singleton, Bitmap. Each of these is essentially a different kind of map from positions/coordinates in the previous dimension to positions/coordinates in the next, with values being at the end of the last map.

This brings us to two questions:

  1. Can I do this with a functional interface so that I can support any AbstractArray that lets me do TypedIndex(A, i) for the i'th dimension?
  2. Do I even need these to be typed for dispatch? It's nice to use dispatch, but I could use manual dispatch instead.

E: this: http://groups.csail.mit.edu/commit/papers/2020/kjolstad-thesis.pdf is the thesis I'm basing this on. I have a several extensions in mind, but at the start it's basically a port of that.

@tkf
Copy link

tkf commented Aug 17, 2021

Chris Elrod recommended a macro which then creates a generated function. I don't believe there's any other way.

It has been noted many times (and it feels rather painful to criticize everyone's favorite package), but, LoopVectorization.jl's AST-based approach cannot escape from the lack of inter-procedural optimization. To be a real compiler, the work has to be done at the correct layer, so that very basic abstractions like function work. It's already mentioned in this thread, but, the compiler plugin-based approach seems to be a more promising direction.

Inside there would be something like Tensor.indices which is a tuple of these AbstractIndexTypes: Dense, Compressed, Range, Offset, Hash, Singleton, Bitmap.

These are exactly what I meant by layout information. The point on type-erasure is two fold: (1) You don't need to distinguish xs::Vector and @view xs[begin:end] for computation (this one is rather trivial); (2) It'd let users express their custom data structures based on the composition on primitive memory layouts.

You could think of some function sync which takes in 2+ tensors and produces an iterator over them.

Iterators are state machines. Complex state machines are not compiler friendly and also writing correct state machines is not very easy. An alternative approach is to use higher-order function and it turned out to be a verify friendly interface both to the compiler and the users. This can probably be understood by recognizing that higher-order functions are essentially control-flow graph manipulation tools. It lets users move around basic blocks in a safe way, without exposing any of the guts in the compiler.

Another limitation in the iterator-based approach is that it is not straightforward to leverage the properties like associativity and commutativity in the reduction (which are essential for parallelism and vectorization).

(That said, I've been thinking about fold-equivalent code from type-level state machines. But this is tricky to do correctly. For example, it can generate invalid code on GPU if it used with certain synchronization mechanisms.)

But, the reason why I didn't comment on the higher-order function approach initially was that it is much less structured than the computations describable by einsum. Since the key to enabling optimizations is to add more structures to the program, composing pre-defined set of Tensor.indices like you have described sound like more fruitful approach in the domain Tullio.jl shines.

@rayegun
Copy link

rayegun commented Aug 17, 2021

@tkf this is very enlightening thank you.

What TACO does is take tensors whose layout is essentially [Tuple{AbstractIndex}, values], and the users einsum notation and then move through a couple stages.

  1. "Concretize" the einsum notation, just a simple rewrite to put indices into prefixes (forall_i, forall_j(a_i...) essentially)
  2. Scheduling transforms on the IR from step 1 (split/collapse/reorder/etc)
  3. Emit imperative loops.

Step 3 uses known properties of each AbstractIndex type to emit different access patterns for each dimension.

As I understand it you're essentially recommending that Tullio (eventually) accept AbstractArrays that are either already represented in terms of a tree* of AbstractIndexs, or can be lowered to that form?

*Note: By tree I mean that the coordinates of a tensor are represented as a "levelized" tree, where each level/dimension is an AbstractIndex which defines how you arrive at positions or coordinates in the next dimension/level of the tree.

On the iteration subject do you think I should just ignore "non-einsum" interfaces for now? It's a much harder problem, and probably better solved with a good foundation/experience in Tullio.

@AriMKatz
Copy link

@Wimmerer have you seen https://arxiv.org/abs/2104.05372

@mcabbott
Copy link
Owner

Did you consider something other way around

I did briefly! The approach of https://github.com/shashi/ArrayMeta.jl was to lower index notation to types in the macro, and then do everything in generated functions. And my lesson from tinkering with it is that this is just much harder to work with, so for now Tullio is an experiment to see how far you can get just generating code you can look at. (It's possible this will eventually run out of steam, e.g. while it's easy to do first derivatives up front, if you want an N-th derivative maybe that has to be generated on demand.) But I'm not quite sure if that generated function approach is what you had in mind.

Right now even the tiling and multi-threading just works by dividing the longest index range in half, repeatedly, using runtime size but not type. That seems like the first place that it would pay to know the strides of dense arrays, but in examples where I tried to exploit this, I didn't see clear gains.

Of course using LoopVectorization to generate the innermost kernels means that for SIMD it does know the strides. I presume that Tullio could equally well hand this task to any more compiler-integrated replacement library.

Tensor.indices which is a tuple of these AbstractIndexTypes: Dense, Compressed, Range, Offset, Hash, Singleton, Bitmap.

If the index types of matrix A are returned by axes(A), which means they must be <:AbstractUnitRange but can store whatever else they wish, then it would be very easy for Tullio to propagate these through to... some function like tacolike((i,j,k) -> A[i,k]*B[k,j], (axes(A,1),axes(B,2)), +, (axes(A,2),)), perhaps? Would that capture enough information, or if not, what would it miss? The reduction index is at present just checked for equality, and axes(B,1) thrown away, but this could easily pass through some function which (for these special index types) records both.

For dense arrays, this is enough information. I did try using some map like (i,j,k) -> A[i,k]*B[k,j] with CartesianIndices, but struggled to get it as fast as simple loops.

the higher-order function approach [...] is much less structured than the computations describable by einsum

Can you expand a on this a litle? Are you picturing something like the tacolike thing I wrote, or something more general?

kjolstad-thesis.pdf is the thesis I'm basing this on

I see from a very quick glance at the thesis that, while it builds many ways to run and transform these calculations, it does not say much about how to automatically choose the best one. Tullio at present provides very few knobs to control such things, as crude heuristics (and LV's modelling) seem to work OK. But if this is to be user-specified, that's another level of interface to think about.

@rayegun
Copy link

rayegun commented Aug 17, 2021

@AriMKatz I've seen it but not read it yet, but I will shortly :), you've provided quite a useful reading list!

it does not say much about how to automatically choose the best one.

It does not do much optimization in this thesis. That thesis also doesn't use workspaces etc. There are 5+ more theses from other people in that lab in 2020 that provide some of that. I think maximizing the number of knobs available to Tullio is something this can help provide.

@rayegun
Copy link

rayegun commented Aug 17, 2021

the work has to be done at the correct layer, so that very basic abstractions like function work

I'd be interested to have a call about this, and Tullio at some point in the next 1-2 weeks.

Essentially I have the following question:

How should type information be provided to my code. Generated function? Abuse of multiple dispatch? Is there a better way to do this now with compiler plugins @tkf? For generating the iteration graph I need at least types for the axes (and probably at least 5+ parameters for each for different settings like ordered) and types for operators (for operator properties).

For now I will probably verify that things work using multiple dispatch on a few trivial cases (CSR/CSC), and then move on to a function like taco(tullioExpr, args...), where args... are the actual arrays, that returns the imperative code as an expression. That's a reasonable PoC, and requires most of the iteration lattice/graph infrastructure anyway.

Stages:

  1. Tullio syntax
  2. Concrete IR: forall_i forall_j forall_k a_ij = ...
  3. Scheduling transforms Need type info here -> stage 2
  4. Imperative loops/other backends Need more type info here.

@AriMKatz
Copy link

AriMKatz commented Aug 17, 2021

I think it would be useful to loop in the compiler plugin people @aviatesk @femtomc and @Keno for the call also, because IIUC those layers (to support the stages you mentioned) are still being developed, and I think the idea is for it to co-evolve with applications like this. Can also help answer those questions.

Also Keno had interest in developing a similar TACO like compiler.

Perhaps also the other folks working on structured loop representations/codegen such as @chriselrod and @vchuravy to share ideas and potentially infrastructure.

More prior art:

https://openai.com/blog/triton/ (super interesting approach)
https://github.com/JuliaGPU/GemmKernels.jl
https://github.com/JuliaLabs/Poly.jl

Though I think the dex "pointful" paper is really promising.

@tkf
Copy link

tkf commented Aug 17, 2021

@WimmerE

I'm happy to join :) Please feel free to ping me on discourse/slack/zulip.

As I understand it you're essentially recommending that Tullio (eventually) accept AbstractArrays that are either already represented in terms of a tree* of AbstractIndexs, or can be lowered to that form?

Yes, that's what I was thinking!

On the iteration subject do you think I should just ignore "non-einsum" interfaces for now? It's a much harder problem, and probably better solved with a good foundation/experience in Tullio.

I guess it depends on your interest? I do think "non-einsum" iteration interface is interesting too and that's basically what I do in JuliaFolds. But I think focusing on einsum has large benefits. I wouldn't call einsum "easier." Certain things are simplified (e.g., you probably aren't going to support break) but then you have more optimization opportunities (e.g., SIMD, flexible tiling) and so it might require more machinery in your framework.

Also, array-focused approach can have extension to other directions like handling jagged arrays: e.g., "Jagged, ragged, awkward arrays" by Jim Pivarski - YouTube (https://awkward-array.org) Though it's not a compiler project and rather a conventional Pythonic library that wraps C++ kernels.

@mcabbott

But I'm not quite sure if that generated function approach is what you had in mind.

Right, approach heavily relying on generated function is not quite my taste :) If I only need CFG manipulation, I'd just use higher-order functions (though it has its own pain, as you know). I find it easier to work with since everything is plain Julia code with the full capacity of Julia. I think generated function is in a rather unfortunate middle ground since you can't quite play the role of compiler (e.g., inter-procedural analysis for cost estimation) while not being transparent to the compiler and the user for how the code is generated.

the higher-order function approach [...] is much less structured than the computations describable by einsum

Can you expand a on this a litle? Are you picturing something like the tacolike thing I wrote, or something more general?

Do you mean what's possible with an approach focusing on einsum? What I had mind is somewhat similar to tacolike, but I'd tweak the lowering a little bit so that you can generate more flexible kernel:

struct Axes{InputIndex, DimIndex} end
tacolike((I, J) -> A[I] * B[J], (A, B) (Axes{1, 1}(), Axes{2, 2}()), +, (Axes{1, 2}(),))

(and maybe first do LA = lower(A) or sth)

For example, for dense-sparse dot product, we can iterate over the stored indices, wrap it into a special type I, and dispatch it on getindex (or maybe use Tacolike.get(A, I) for easier ambiguity resolutions). The memory layout of A and B can be analyzed in tacolike using some trait(-ish) functions.

Of course, to really automate the scheduling, you'd need to analyze what the anonymous function and getindex does internally. It requires dealing with Julia IR or lower. But being able to manually specify the scheduling easily is already very useful, as https://halide-lang.org/ has demonstrated. (Which requires an additional "scheduler" input to tacolike.)

@rayegun
Copy link

rayegun commented Aug 17, 2021

Forgive my ignorance here: if tacolike is a function why should I take axes as arguments at all? I have access to A and B, who must store these axes anyway (the Dense axis for instance just stores an Integer size, but the Compressed axis type stores ptr and crd vectors). And the first argument should make clear how I, J, etc are associated to those axes.

@mcabbott
Copy link
Owner

mcabbott commented Aug 17, 2021

why should I take axes as arguments at all?

A and B have 4 axes, and we're only going to iterate over 3 of them. But which 3 would differ if (say) I wrote A' * B' as C[i,j] := A[k,i] * B[j,k]; the anonymous function would change but tacolike can't see inside that. My thinking (and Tullio's) is that if these are explicitly handed to the function, and the one being reduced over is kept separate, then the function can sub-divide them among threads, etc, recursively.

Takafumi's version (if I'm reading it right) just passes tags which tell you which axis of which argument to look at. The reason Tullio doesn't do that is that it wants to allow e.g. A[i+1] - A[i-1], for which it shifts & intersects these axes, so the iteration space doesn't correspond to any of the originals. In that scheme perhaps intersect etc. could be overloaded for these new special axis types. Although the range of allowed operations is not so large, they could also be encoded by generalising this Axes{1,1} marker a bit.

My question here is whether individual axes can contain enough information, or whether to interpret them correctly you've got to see the whole A. I mean re things like this:

Each of these is essentially a different kind of map from positions/coordinates in the previous dimension to positions/coordinates in the next, with values being at the end

@rayegun
Copy link

rayegun commented Aug 17, 2021

Ah I'm operating under the assumption the first argument is an Expr not an anonymous function. Essentially the anonymous function is what's being done within the loops? Ok.

Though to fully complete the project I really do need to know what that anonymous function does. Or at least I need a bunch of knobs that allow some way to specify that higher up. Things like semirings and other operator properties.

E: It seems then that tacolike would only have access to the axis types their ordering, and which appears in input/output. I'd have to think about how much TACO (the original project) does with the actual expression vs just that limited information.

I'd prefer probably to operate on the expression and the type information since that's more or less a direct port of TACO, which frees me to work on other parts. But I'm not sure yet.

@rayegun
Copy link

rayegun commented Aug 17, 2021

My question here is whether individual axes can contain enough information, or whether to interpret them correctly you've got to see the whole A. I mean re things like this:

I don't think I necessarily need info about the whole of A at once, so long as I know about what axes are currently being coiterated from A, B, C, ... But there are lots of transformations that do use the entire iteration graph, and TACO is constructed with full information. I'd have to do a ton more research (which I'd rather spend on other extensions), rather than just porting their Expr -> IR -> Iteration Graph -> Imperative sequence.

@tkf
Copy link

tkf commented Aug 17, 2021

The reason Tullio doesn't do that is that it wants to allow e.g. A[i+1] - A[i-1], for which it shifts & intersects these axes, so the iteration space doesn't correspond to any of the originals.

Ah, it makes sense.

I think there might be a way around this, but my impression is that the anonymous function-based approach won't work anyway since, e.g., you might need to look at A[i,j] * sin(x[i]) and pre-compute sin.(x) or use non-continuous access. (@Wimmerer Is this a part of your motivation for a more AST-based approach?)

@mcabbott
Copy link
Owner

mcabbott commented Aug 17, 2021

pre-compute sin.(x)

Good point. This is something Tullio does not try to do, at all, at the moment. Assuming that sin is pure & that evaluating it N^2 or N times (or 4N with 4 threads...) should be allowed seems fine. But implementing that properly... I can see why you'd want a more formal representation.

(If you write A[i,j] * (sin.(x))[i], BTW, it will pull that out & do the broadcast exactly once. But that's a bit of a hack.)

@rayegun
Copy link

rayegun commented Aug 17, 2021

I hadn't though much about optimizations like that yet, but doing that is definitely a goal.

non-continuous access

What do you mean by this?

Skimming through the papers again I'm fairly convinced I need to know 1. the full expression and 2. the types of each axis (A[i] and b[i] might be totally different axis types, so I really need the full types of A and b).

Further down the line I probably need the types of operators as well, so users can define trait-like properties on them. There's other ways to do that part though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants