-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing custom allocators for CpuStorage #419
Comments
External allocatorsInstead of allowing external allocators, my plan is to have tensors just be a ptr + length view (or openarray if we have openarrays as values nim-lang/RFCs#178). The idea is that in many cases, users will have buffers in vector, matrix or even nd-array format and we would be able to zero-copy between Arraymancer and Numpy, PyTorch, Tensorflow, ... This would also enable custom allocators. The work is actually done but was blocked by a showstopper but 2 months ago (which has been resolved since then): Arraymancer/src/laser/tensor/datatypes.nim Lines 12 to 30 in d3f078a
And an example allocator: Arraymancer/src/laser/tensor/allocator.nim Lines 11 to 30 in d3f078a
Coarrays / PGAS / Distributed computingI'm not familiar with Fortran coarrays. I did find a couple of libraries with PGAS while implementing Weave (https://github.com/mratsim/weave) in particular:
Unfortunately I don't have access to a cluster but Weave was built from the ground up with the architecture to support distributed computing. It's using channels and message-passing even at the single node level (and it's faster than OpenMP / Intel TBB). |
Your approach to external allocators looks like exactly what I need. However, how would a user supply a custom allocator when constructing an array? I've taken a look at weave and what you have done is very impressive (although some of the theoretical background goes over my head). However, if I understand correctly, currently data parallelism is achieved by sending pointers to the captured variables to each thread. Clearly this would not work for distributed memory, as that location in memory on the remote processor would not contain the same data. Instead it would be necessary to actually perform copy-by-values. For large tasks this almost certainly would not be efficient; most of the memory could be kept permanently on the remote processor, with only small amounts needing to be communicated. As an example, consider the heat-diffusion problem you use for benchmarking. You create a separate task for each row in the grid of data. This works just fine with shared memory, as threads can always read the heat values from adjacent rows when evolving the system forward in time. With distributed memory, it becomes necessary to perform a (potentially expensive) remote communication when reading from adjacent rows located on separate processors. Creating a very large number of tasks would thus be unsuitable (unless some very clever macro-magic can be used to optimise grouping of tasks/data). Coppying the whole grid to each task/processor would also likely be too expensive. The conventional way to solve this with MPI is to store a subsection of the grid on each processor and to communicate only the shared boundaries to adjacent processors. This requires one processor to send data and another to receive it, with care taken to avoid deadlocks or inefficient use of blocking communication. Coarrays (and other forms of PGAS) simplify things by making the communication "one-sided". Either a processor "gets" some data from a remote one, or "puts" some data on it, without having to worry about the other processor making a corresponding call. Various types of barriers can be used to ensure processors remain sufficiently in sync. An example of coarrays in use can be found here:
As far as I could tell, none of the libraries you referenced use a PGAS model, although HPX uses something they call an "Active Global Address Space", the details of which I confess I'm a bit hazy on. It appears that the libraries handles a lot of the decisions about which processors on which to store data, what to communicate, what to mirror, etc. In any case, the framework I was planning to use is called OpenSHMEM (OpenMPI provides a free implementation). This is one of the backends which has been used to implement coarrays in Fortran and its functionality closely mirrors what I'd like to implement. I don't think I'll have time tonight, but I'll try to provide an example of what your heat diffusion benchmark in weave would look like if written to use coarrays as I would like to implement them in Nim. |
Allocator interface
Ideally I would reuse what @Araq had in mind for custom allocators for sequences in Nim but AFAIK that has been scraped so I have to think of something. Currently (assuming the branch is merged) you would allocate a buffer and then I have to think on what kind of interface I want to offer for custom allocators for Arraymancer routines. Passing an allocator or a context to each proc would be very tedious and/or strange (reshaping might allocate for example) but using a global might cause issue when we want to mix local tensors and distributed one. If you have suggestion of libraries that have good mixed local + distributed APIs I'm interested otherwise I will look into Tensorflow (which AFAIK has 3+ ways of doing the same thing ....) and PyTorch. Weave - distributed sync
The distributed communication library would have to handle copies (could be an internal one to Weave or MPI). Reduction operations which require synchronization for in-place updates would also need a specific handling, I'm not sure how MPI deals with them. Note that Weave has an experimental generic reduction scheme that guarantees ordering and avoids requiring locks and/or atomics and so can serve as a good basis for distributed reduction: https://github.com/mratsim/weave/blob/052ae40a/weave/parallel_reduce.nim#L240-L260 block:
proc sumReduce(n: int): int =
var waitableSum: Flowvar[int]
# expandMacros:
parallelReduceImpl i in 0 .. n, stride = 1:
reduce(waitableSum):
prologue:
var localSum = 0
fold:
localSum += i
merge(remoteSum):
localSum += sync(remoteSum)
return localSum
result = sync(waitableSum)
init(Weave)
let sum1M = sumReduce(1000000)
echo "Sum reduce(0..1000000): ", sum1M
doAssert sum1M == 500_000_500_000
exit(Weave) To be clear, Weave is a scheduler. You give it tasks and constraints implicit (this task will spawn other tasks) or explicit (pledges/data dependencies, more below) and it will schedule them on the hardware to take the best advantage of the parallelism available and maximize your throughput. The core is built agnostic to data synchronization scheme it just requires that there are channels to send tasks and steal requests between workers and that when a task is scheduled the data required is present (but movement of data can be triggered by sending a task to a thief). Weave: as a base to build distributed parallelism modelsProgramming models like Legion dependent partitioning or DACE that makes optimizing memory-access on distributed and/or heterogeneous architecture a first-class concern could generate the algorithm and copy schedule (what to copy, where and what are the dependencies before triggering a dependent task) for Weave to execute. block: # Delayed computation
proc echoA(pA: Pledge) =
echo "Display A, sleep 1s, create parallel streams 1 and 2"
sleep(1000)
pA.fulfill()
proc echoB1(pB1: Pledge) =
echo "Display B1, sleep 1s"
sleep(1000)
pB1.fulfill()
proc echoB2() =
echo "Display B2, exit stream"
proc echoC1() =
echo "Display C1, exit stream"
proc main() =
echo "Sanity check 3: Dataflow parallelism"
init(Weave)
let pA = newPledge()
let pB1 = newPledge()
spawnDelayed pB1, echoC1()
spawnDelayed pA, echoB2()
spawnDelayed pA, echoB1(pB1)
spawn echoA(pA)
exit(Weave)
main()
block: # Delayed computation with multiple dependencies
proc echoA(pA: Pledge) =
echo "Display A, sleep 1s, create parallel streams 1 and 2"
sleep(1000)
pA.fulfill()
proc echoB1(pB1: Pledge) =
echo "Display B1, sleep 1s"
sleep(1000)
pB1.fulfill()
proc echoB2(pB2: Pledge) =
echo "Display B2, no sleep"
pB2.fulfill()
proc echoC12() =
echo "Display C12, exit stream"
proc main() =
echo "Sanity check 4: Dataflow parallelism with multiple dependencies"
init(Weave)
let pA = newPledge()
let pB1 = newPledge()
let pB2 = newPledge()
spawnDelayed pB1, pB2, echoC12()
spawnDelayed pA, echoB2(pB2)
spawnDelayed pA, echoB1(pB1)
spawn echoA(pA)
exit(Weave)
main() So that the distributed computation is only triggered when data is ready and can notify when they are done with their computation. Heat benchmarkRegarding the heat benchmark it's the well known parallel heat benchmark from Cilk (1996) so it probably already exist in Fortran somewhere. ConclusionIf you have primitives that:
That said that still doesn't solve Arraymancer allocator API woes ;). |
Custom allocators for So, something like the following: import math, arraymancer
const
x = @[1, 2, 3, 4, 5]
y = @[1, 2, 3, 4, 5]
var
vandermonde: seq[seq[int]]
row: seq[int]
vandermonde = newSeq[seq[int]]()
for i, xx in x:
row = newSeq[int]()
vandermonde.add(row)
for j, yy in y:
vandermonde[i].add(xx^yy)
# possibly optionally include a reallocator?
let foo = vandermonde.toTensor(myCustomAllocator, myCustomDeallocator)
# bar would use whatever the default allocator is (e.g., aligned memory)
let bar = vandermonde[1..3].toTensor()
# baz would use the allocators of the first argument (e.g., myCustomAllocator, etc.)
let baz = foo[3..5, _] + bar A slight variation on this would be to define an object type which contains pointers to these procedures and to pass the object instead. This would reduce the number of arguments and ensure that consistent allocators/deallocators are always provided. |
This isn't so much a feature request, more an attempt to gauge the level of interest in working on something like this (plus an offer to help with it).
I am currently thinking through how to implement something like Fortran Coarrays for parallel programming in Nim.* Ideally these could build on top of Arraymancer tensors (although they'd be in an independent library). However, in order to achieve the "Parallel Global Address Space" memory model, I would need to be able to allocate tensor memory using specialised routines provided by low-level communication libraries written in C.
So my question is whether there would be interest in providing some means for the user to specify a custom memory allocator for
CpuStorage
. I haven't completely thought through how to achieve this yet. PerhapsTensor
could be made generic on the storage type as well as the element type, although that might look rather messy for those just wanting to use the default allocator.This could have uses well beyond a coarray library. It would allow experimentation with different types of allocators (as has been suggested in #112) and could allow the user to choose the one most appropriate for their task.
*While libraries already exist for multithreading (shared memory parllelism), I don't know of any which provide for distributed memory parallelism. Shared memory can currently only do of order 100 processors, but some tasks require thousands of cores and must therefore use distributed memory parallel technologies. Of these, the most widely used is MPI, but this is a difficult programming model to work with. Parallel Global Address Space models, such as Fortran's coarrays are conceptually and syntactically much simpler.
The text was updated successfully, but these errors were encountered: