0.2.0

TODO: FFT: clean up code, check that there are no warnings during compilation
TODO: FFT: check why non-power-of-2 performance is so bad
TODO: add memory packing functionality to OperationRecorder
TODO: cool feature: process the list and remove unnecessary allocations, replacing them by creating views
When doing 'pip uninstall' it shows the full list of files, when other modules just show a list of folders. Should be something connected to MANIFEST and package data.

0.3.0 (suitable for beclab)

TODO: add DHT
TODO: add MD5 randoms
TODO: Flatten kernel list before execution, and assign argument numbers
TODO: add "raises" sections to Computation/OperationRecorder methods
TODO: add custom variable names to Transformation constructor (inputs=['Vect1', 'Vect2'], outputs=1, scalars='term1')
DECIDE: move all "raw" computations to their own submodule?
DECIDE: do we need to specify "supports inplace" option for kernels? It seems that FFT does not really need it, because there is no cases when there's only one kernel in list, and it does dot support inplace, so one can safely pass same buffer as output and input. But it may be critical for other kernels.

0.4.0

TODO: move part of core.transformation to a template
TODO: add custom render keywords for transformations (will help e.g. in tigger.transformations)
TODO: create some elementwise computations derived from Elementwise
TODO: document _debug usage
TODO: add "dynamic regime"
TODO: run coverage tests and see if some functionality has to be tested, and check existing testcases for redundancy (fft and vsizes in particular)
TODO: run pylint
TODO: create "fallback" when if _construct_operations() does not catch OutOfResources, it is called again with reduced local size
DECIDE: how to handle cases when the name of a new endpoint requested by connect() is the same as one of the argument names in a nested computation (which the user is not supposed to know or care about)?

1.0.0 (production-quality version... hopefully)

Website/documentation:

TODO: extend starting page (link to issue tracker, quick links to guides, list of algorithms, quick example)

CLUDA:

DECIDE: does the forceful enabling of double precision in OpenCL somehow change the performance for single precision?
DECIDE: Is there a way to get number of shared memory banks and warp size from AMD device?
DECIDE: what are we going to do with OpenCL platforms that do not support intra-block interaction? (for example, Apple's implementation)
DECIDE: make dtypes.result_type() and dtypes.min_scalar_type() depend on device?
DECIDE: change type of id()/size() functions to size_t in case of CUDA?
TODO: find a way to get min_mem_coalesce_width for OpenCL
TODO: add a mechanism to select the best local size based on occupancy

Core:

DECIDE: drop strict positioning of outputs-inputs-params and just set argument types?
CHECK: check for errors in load/stores/param usage when connecting transformations? Alternatively, return more meaningful errors when accessing load/store/parameter with the wrong number.
CHECK: check for errors in load/stores/param usage in kernels? Need to see what errors look like in this case.
CHECK: check correctness of types in Computation.__call__() if _debug is on
CHECK: check that types of arrays passed to prepare_for()/received from _get_base_signature() after creating a basis are supported by GPU (eliminates the need to check it in every computation)
TODO: remove unnecessary whitespace from the transformation code (generated code will look better)
TODO: add a global DEBUG variable that will create all computations in debug mode by default
TODO: add usual transformations and derivation functions for convenience
TODO: take not only CLUDA context as a parameter for computation constructor, but also CommandQueue, opencl context, cuda stream and so on.
TODO: optimize allocation calls throughout the context
TODO: cache results of _construct_operations based on the basis, device_params, argnames and attached transformations
DECIDE: see if it's possible to reuse input/output parameters as bases for temporary allocations. In order to do that the computation (and initially the user) has to provide hints that the input array can be overwritten. Also one has to take into account possible padding and itemsize changes in the transformations. Actually, this looks too convoluted and with the support for shared temporary allocations in the single context it is almost useless (unless you want to FFT half of the video memory).

Computations:

CHECK: need to find a balance between creating more workgroups or making loops inside kernels (can be applied in elementwise kernels)
TODO: add DCMT and 123 randoms
TODO: add bitonic sort
TODO: add filter
TODO: add better block width finder for small matrices in matrixmul
TODO: add radix-3,5,7 for FFT

1.*

CLUDA:

TODO: add support for rational numbers (based on int2)

Core:

DECIDE: Some mechanism to merge together two successive Computation calls. Will require an API to tell tigger that certain computations are executed together, plus some way to determine if the computation is local and elementwise (otherwise the connection will require the change of code).

2.*

Computation provider

Library that by request (perhaps, from other languages) returns kernels and call signatures for algorithms, using Python as a templating engine. Namely, it returns:

A list of kernels to be executed in turn.
Signatures for each of the kernels (with named parameters and their types).
Call parameters for each of the kernels (grid, block, shared memory).
List of memory blocks to allocate and their names (which will be used to pass them to kernels when necessary according to signatures).

Problems:

More involved algorithms cannot be passed between languages this way (the ones that requires synchronization in the middle, like adaptive-step ODE solver, for example).
Need to somehow pass device/context to this library from the caller. The kernels will have to be compiled in order to find out the register usage.
How to pass type derivation lambdas? Possible solution: limit derivations to <same_as>(x), <definite_type>, <complex_for>(x), <real_for>(x) and create some micro-DSL for passing these as strings.

Transformation DSL

Currently transformation code is quite difficult to read and write. Perhaps some DSL can be devised to make it easier? Even better, if that DSL could be applied to kernels too. Take a look at: * Copperhead (Python-based DSL for GPGPU) * CodePy (Python -> AST transformer) * Clyther (subset of Python -> OpenCL code) * https://github.com/mdipierro/mdpcl (same)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO.rst

TODO.rst

0.2.0

0.3.0 (suitable for beclab)

0.4.0

1.0.0 (production-quality version... hopefully)

1.*

2.*

Computation provider

Transformation DSL

Files

TODO.rst

Latest commit

History

TODO.rst

File metadata and controls

0.2.0

0.3.0 (suitable for beclab)

0.4.0

1.0.0 (production-quality version... hopefully)

1.*

2.*

Computation provider

Transformation DSL