Skip to content

Low Level

Marc Zweiler edited this page Nov 17, 2024 · 11 revisions

Table of Contents

ARC: Automatic Reference Counter

ARC describes a way of handling pointer references. Whenever a construct is initialized and referenced, this counter increments by one. If the "object" is dereferenced, the counter decreases by one. If the counter reaches zero, the memory allocated by that object then can savely be freed, because it can be assured, that nothing references this memory any more (otherwise, the counter would not have reached zero).

Flint only uses 64 Bit pointers, the size of the ARC can vary for each data structure. For data, there is a 24 Bit ARC available, while for entities, a 32 Bit ARC exists

DIMA: Dynamic Incremental Memory Allocation

Flint does not use a garbage collector. Instead, it heavily relies on a custom memory allocation method called DIMA, which enables highly efficient memory allocation, deallocation, defragmentation and compacting. How and why DIMA works will be described in this chapter.

For one single type of data, DIMA consists of one head and multiple blocks which in of their own consit of many slots. Flint differentiates between data and entities, but they work very similar internally. Lets first focus on everything about data and then discuss the memory management utilities for entities.

Data

Data Slot

A data slot is a data structure which holds several informations, from left to right:

Data Flags (8 Bit) ARC (24 Bits) The Data itself
1 Bit: Empty Flag
1 Bit: Array Flag
1 Bit: Shared Flag
1 Bit: Immutable Flag
1 Bit: Unused
1 Bit: Unused
1 Bit: Unused
1 Bit: Unused

By knowing the size of a data construct, the size of each slot now is fixed.

The ARC for data slots is smaller (24 Bit) compared to the one used in entities (32 Bit), because it can very easily be assumed, that the same data is referenced by many different pointers not that often. In fact, through the decoupling of data, func and entities, the ARC counter of data will probably never exceed 10 for most use cases. 16 Bit also should be plenty for both data and entities, but it was decided that this separation of ARC sizes will be more future-proof for more applications.

The data itself is stored after the Flags and ARC. This means each data slot has the size of 4 Bytes + the data's normal size. For example, the following data construct

data Vector3i:
    int x = 0;
    int y = 0;
    int z = 0;
    Vector3i(x, y, z);

will take up 4 Bytes + 3 × 4 Bytes = 16 Bytes of memory per slot.

Empty Flag

The empty flag determines whether the slot is empty. This Bit Flag also can be used for Opt types, where the object within that type may be zero. This empty flag will also be used by the block and head to initialize new data, find empty spots for arrays and much more. The Empty Bit is very important for the DIMA paradigm. By making the first bit of the slot the empty flag, performance is improved and latency reduced.

Array Flag

This flag determines whether the slot is part of an array or not. This is mostly important for defragmentation and array resizing.

Shared Flag

This flag determines whether the data inside the slot is shared amongst multiple execution threads. If this bit is set, memory accesses to the data inside this slot are only possible with the use of mutex systems, which guarantee save memory access and deny race conditions.

Immutable Flag

This flag determines whether the data inside the slot is immutable, e.g. only readable but not writable.

Data Block

A block holds all the data slots and, just like the slot itself, holds some metadata.

"ARC" (32 Bit) Block Capacity (32 Bit) The Slots

The "ARC" (32 Bit) at the beginning of a block is similar to the ARC in a slot, but it is no actual ARC. It instead, holds the data of how many slots inside the block are occupied. This counter increases with each data allocation inside the block and decreases with every deallocation of slots. If this counter reaches zero, the whole block will be deallocated, similar to the ARC in slots, hence the name.

The Block Capacity (32 Bit) stores the maximum capacity of the block. This will become important later on, when we discuss the structure of heads.

Data Head

A data head is like a "allocation manager" and it is crucial to how DIMA works. First, lets discuss the structure of a data head.

Data Size (32 Bit) Block Count (32 Bit) "ARC" (32 Bit) Default Data (variable) Block Slots (variable)

The first number (Data Size) holds the size of the to-be-saved data in Bytes. This size determines the size of all later allocated data slots and makes the creation of new blocks much simpler.

Then, the total number of active blocks (32 Bit) is saved. This value is important for resizing the data head when new blocks get created.

After That, the "ARC" (32 Bit) is saved. It contains the total amount of saved data slots distributed to all data blocks. This value is important for automatic defragmentation of data.

Then, the default values for newly created slots are saved here. The default values are only stored once in memory, right here, this makes them fast and accessible when allocating new memory of that data type.

Finally, all the block slots are defined. A block slot contains of a single empty bit, signifying the absence of a block, and a pointer to the block.

DIMA explained

Now, that the data structures are all clear, lets explain how DIMA actually works and how it is both simple and efficient. In the beginning, the data head will be created. It does not have any references to blocks for now, because blocks do not exist yet. When a new memory allocation is queued, the data head goes through all its blocks from left to right.

Initially, the block count is 0. This means, there are no blocks and no data can be allocated. Thus the block head creates a new block and references it at slot 0 of its block slots. The capacity of a block is dependent on its index inside the block slots. For the first slot, a block of capacity 16 will be created. The formula to calculate the block capacity is as follows:

$$\text{block\_capacity} = 2^{4 + \text{slot\_id}}$$

Now, inside the newly created block, it is iterated from left to right for all data slots and the empty bit is checked. If the empty bit is set, this means that this slot is "free". The data which has to be initialized will then be saved into this slot, the empty bit is then unchecked.

This behavior continues until the block becomes full (in this case, when 16 elements are stored). When the block is full, a new block gets created at the slot index 1, it will be able to hold 32 elements of the data.

This is where the Incremental comes from in DIMA. Now lets look at the case where all elements from a block become empty and the block will be deallocated. First, the empty bit in the block slot will be set, and then the storage the block occupied in memory will be completely freed. Now, when the next block creation is queued, "old" blocks get re-created first, the empty bit will be flipped back off and so on. Through this incremental approach, Flint enables very fast and cache-efficient accesses for higher amounts of data. When only 10 data objects are accessed, it does not really matter much if they are continuously or not. When 1.000 data objects are accessed, it does matter indeed.

Defragmentation

As time goes on, the data becomes more and more fragmented through many data blocks. To counter this, defragmentation is used. This is why the "ARC" counter in the data head is needed. A simple rule is applied: When there are less total allocated data objects than space in the biggest data block, an defragmentation is initialized. The defragmentation packs the data slots more densly together and removes empty spaces and then deletes the biggest block. The defragmentation will be done automatically, so developers do not need to worry about it.

Entities

An entity works similar in memory to data, but its structure is generally different from data structures. Entities share functionality but each entity has its own data. This makes the size of entities very predictable: The Size of an entity is directly correlating to the number of used data constructs, e.g. data references. The entity does not store the data itself, it only stores the references to that data. Because a reference, e.g. a pointer does have a fixed size of 64 Bits in Flint, this means that the actual size each entity occupies in memory is $8 Bytes \times n$, where $n$ is the number of different data constructs used in the entity. Here, an example:

data Position:
    int x = 0;
    int y = 0;
    int z = 0;
    Position(x, y, z);

data Target:
    Position position;
    int distance = INF;
    Target(position, distance);

entity Player:
    data:
        Position, Target;
    func:
        ...
    Player(Position, Target);

In this case, the Position data itself has the size of 12 Bytes. The Target data has a size of 12 Bytes too (64 Bit pointer + 32 Bit Integer). The entity itself does however have a size of 16 Bytes in memory (2 × 64 Bit Pointer).

Similar to data, entities are separated in slots, blocks and one head. The slot itself consists of 1 Bit for the empty flag, followed by 7 unused flag bits, and the rest is the size of the entity itself. This structure is crucial, as the CPU loads everything from memory in Bytes, not directly in Bits. This way, each slot stays easily divisible by the CPU.

The entity block works works exactly the same like the data block and has the same bit pattern as it.

The entity head has a very different bit pattern than the one used in data. Instead of saving the default state of data, the entity head saves references to the funcs, because a single func can be used in multiple entities, not only in a single one. This ensures that each func is only saved once, and referenced when needed. The bit pattern of a single entity head is described below:

Data Count (16 Bit) Func Count (16 Bit) Block Count (32 Bit) "ARC" (32 Bit) Func References (variable) Block Slots (variable)

The Data Count is a 16 Bit number saving how many different data objects are referenced by each entity. This number times 8 equals the amount of bytes needed to store a single entity in memory.

The Func Count Is a 16 Bit number saving how many different func constructs are used by the entity. It is limited to 16 Bits just like the data count, because an entity referencing more than 65.536 data or func constructs...well, it is needless to say that at that point something went wrong.

The Block Count: Just like in the entity head, is a 32 Bit number saving how many active blocks are allocated and active.

The "ARC" variable: Works just like in the data head. It stores how many entities are initialized in total.

After that come the func references, referencing the functionality which is shared amongst all entities.

And lastly, exactly like in the data head, come the block slots, where each slot is a single empty bit followed by the reference to the block.

Func

The func construct is highly reusable. It acts upon data and houses functions.

A func is stored in memory as follows:

Func Length (64 Bit) Function Count (16 Bit) Function Offsets (32 Bit * n) Data Count (16 Bit) Data Types (64 Bit * N) The Functions (variable)
  • Func Length (64 Bit): The total length od the func
  • Function Count (variable): The number of functions saved in the func.
  • Function offsets (32 Bit * n): The function offsets is a collection of n offsets (32 Bit), where n is the number of functions saved in the func. Each offset points to the starting block of the function inside the last section.
  • Data Count (16 Bit): The number of data fields the func acts upon. This number is derectly declared by the number of data declarations in the requires(...) part.
  • Data Types (64 Bit * N): Here, the data types the func acts upon are declared explicitely. These types point to the head of each data type respectively. This fields size is 64 Bit × Number of Types.
  • The Functions (variable): Here, all the actual functions are stored.

Through the offset-structure inside the func, a consistent call latency is implemented. Lets say a func has 3 different functions. If the second function wants to be called, the func will be called and passed a number which represents the index of the function. Then, the offset field of that index is inspected and finally, the function field itself is found. Because the first number in the function definition is its size, the callee knows exactly which bytes to copy to get the complete function (arguments, such as the data references, are set after copying).

This procedure ensures that each function call takes the same amount of time, no matter how many functions are declared inside a given func. This is also the reason why the func count and func offsets come before any of the other fields. This way, a "jump" within the func is only required once per function call. Also, the called index could be checked with the saved function count.

Functions

Functions are an interesting topic in flint's design. In flint, a function can return multiple variables, and can have multiple arguments passed to it. But, unlike in OOP, functions are shared between many entitied, namingly those who implement the func in which the function is contained.

Because functions are shared, flint can preallocate local variables of functions. The function is saved in memory with these preallocated local variables. This makes accessing and mutating local variables of the function very cache-efficient, while not introducing too much memory overhead, as functions and func constructs are inherently designed to be reusable and modular.

A function is stored in memory as follows:

Call Adress (64 Bit) Function Length (64 Bit) Code Length (64 Bit) Code Section (variable) Arguments (variable) Local Variables (variable) Return Values (variable)
  • Call Adress (64 Bit): The pointer to the adress the function was called from.
  • Function Length (64 Bit): The space the whole functions structure is taking up in memory.
  • Code Length (64 Bit): The length of the code that will be executable. The references to the arguments, variables and return elements are made by offsets within the instructions.
  • Code Section (variable): This houses the compiled machine code which will be executed when the function is called.
  • Arguments (variable): The arguments for the function call. This could either be primitives or pointer to more complex structures, nothing else. Also, very important, the reference(s) to the data the function acts upon is passed as a pointer reference implicitely in the arguments.
  • Local Variables (variable): The space needes for local variables. Each local variable of the function has a fixed position inside the functions structure. If it is a primitive, it is saved directly here. Complex data structures referenced within the functions are saved as 64 Bit pointer.
  • Return Values (variable): Here the return variables are saved. The first return variable always is an error type, meaning a pointer. After that the other return values are saved.

All functions are pre-allocated in Flint on program startup. When calling a function, the whole structure of the function gets cloned. This makes concurrenct calling of functions both easier, safer and faster. If only primitive types are used as local variables inside the functions, all operations insode the functions can be executed without a single reference to somewhere else outside the function, meaning that function calls are kore or less inherently performant.

Through copying the structure, recursive calls of functions and concurrency are allowed by default.

Arrays

Flint focuses more on the use of rectangle arrays rather than dynamic lists, alltough they do exist too. When an array gets created, the data head first checks if a block has enough empty slots to save the array. When it has enough free blocks, it then checks one slot after another, but only checks the empty bit. Through incrementing a variable every time the slot is continuously empty, it can easily be checked if the array fits inside the block.

Now, there are two possibilities. When the block has enough slots, even though they are not continuous, a defragmentation could be started before saving the array in the block, or, it could be moved to the next head and done again. If no block has enough free continuous slots, a new block gets created.

The array itself is saved as follows:

Block Reference (64 Bit) Array Size (32 Bit) Index Offset (32 Bit) Dimension Count (32 Bit) Slice Length (32 Bit)

In Flint, an array does always have the same size, no matter what. It only has the size of 192 Bits or 24 Bytes. Followed are the components of the array:

  • Block Reference (64 Bit): The reference to the block the array is stored inside
  • Array Size (32 Bit): The size of the array (number of elements)
  • Index Offset (32 Bit): The index at which the array starts in the block
  • Dimension Count (32 Bit): The number of dimensions of the array
  • Slice Length (32 Bit): The length of each array slice

Resizing of the array becomes very efficient too, because only the data slot infront or after the array slices would have to move to another slot, and boom the slot becomes free for the array, making array resizes potentially even faster and more efficient than even in C.

The edge case would be when the array is saved in the beginning or the end of a block.

Inside the block, in every slot that is being occupied by an array the array bit is flipped. One additional memory rulr id, that no arrays are allowed to be adjacent, e.g. between two arrays inside a block there has to be at least one non-array slot. This keeps separating arrays inside the block simpler.

Arrays are saved continuously in row-major format.

Array slots are only referenced by the array itself. Referencing of array-elements is not possible. When writing array[idx] = someEntity;, the dntities values are copied into the array. This means, that no array can reference constructs outside the array. If referencing is needed, lists should be used. But by denying referencrs, it can be adsured that arrays are saved in row-major format while allowing complex data types in arrays.

Weak References

A weak reference is an pointer to an object where the ARC is not incremented through the reference. This means, that circular references are avoifed when using weak references for things like linked lists. But because the reference does not increase the reference counter, it very well could be that the object the reference points to could already be freed from memory and a nullpointer would be the result.

To handle this, flint uses a clever trick based on two facts:

  1. Only data and entity constructs can be references by the programmer, as there are no raw pointers in the language.
  2. These constructs are stored inside slots because of the DIMA memory strategy.

Each slot has an empty bit, which is used for Opt types. This means, that the simple solution for circular references is, to make it mandatory to wrap such a reference inside a Opt type.

SIMD

SIMD (Singe Instruction, Multiple Data) is a powerful calculation method used primarily for vector and array operations. With SIMD, a single instruction is executed on multiple data elements simultaniously. The calculation on many data elements is done within a single cycle in parallel and not in sequence.

As specified, these types of operations occur most likely on vector operations and array manipulations. And Flint does provide (implicit) SIMD capabilities to execute SIMD instructions.

Flint does provide two destinct syntactic declarations, for which SIMS can be executed simultaniously.

The Grouping Operator

The grouping operator inherently is beneficial for vectorized SIMD instructions without introducing much load on the developer themselves.

Here an example:

data Vector3:
    auto int x, y, z;
    Vector3(x, y, z);

int main():
    vec := Vector3(1, 2, 3);

    // Grouping operator '()'
    vec.(x, y, z) *= 3;

In this case, the grouping operator inizially was created to simplify data access of multiple elements on a data module. Only afterwards, it's potential for performance was found. This line, for example, will execute the instruction to multiply each element in the group by three and then the results are saved on the three data fields in the data module.

For SIMD to work on grouping operators, some conditions have to be met:

  1. All elements of the group are required to have the same primitive type
  2. Only primitive types are allowed
  3. Each field is only allowed to be referenced once (vec.(x, x) is invalid, for example)

With these rules, SIMD is applied automatic for grouping operators. Otherwise, the still very cache-efficient "normal" operation is made, for example with this swap, no instruction is made, the data is only moved around:

vec.(x, y, z) = vec.(y, z, x);

par_for

The second beneficial construct created for a different purpose is the par_for iteration loop. Unlike the "normal" for loop (for i:= 0; i < ...; i++:) the par_for loop exclusively works on iterator. It has the exact same syntax as the extended for loop, but will execute the operations with SIMD, if the following conditions are met:

  1. The loop is not allowed to write to variables outside the loops scope. Everything outside the loop is considered immutable.
  2. No other element of the array is allowed to be accessed; neither read nor write.
  3. Only operations on primitive types are allowed.

If at least one of these conditions is not met, the par_for loop will be executed concurrently, which is slower than SIMD, but still faster than iterating sequentially. This means that, if SIMD is not appliccable, a concurrent fallback is provided.

The given set of rules above can only be broken when the third rule is broken. Only in this case, the fallback happens. This is, bevause the first two conditions are crucial to concurrency too. An example could be that we have an array of data modules, where each data module holds some more data. This would result in a pointer call to get the data saved inside the outer data. This means that this loop is no longer executable using SIMD, but with the concurrent fallback only.

When multiple instructions are made inside the loop, multiple SIMD calls are being made with the results of the previous instruction as inputs for the next instruction.

Avoiding LLVM

Flint becomes more and more distinct and different to other languages and the similarities become less and less. Within this file, i have talked about a lot of memory enhancements, changes and more for Flint as a whole. This makes Flint inherently different from other languages using LLVM, thus compatability with the LLVM backend will become increasingly hard, as the project grows. This is why i think we should ditch LLVM completely. Below i provide a few reasons on why this would be beneficial:

  1. LLVM focuses on AST calls and AST Optimizations. Through Flint's very nature of being modular, this Optimization is becoming difficult to implement. A function is pre-allocated and saved to memory directly, which means that the lowest level of optimization is on a per-function basis, not on a AST of multiple functions.
  2. Ditching the Stack and Heap entirely is hard with LLVM, as all languages and all LLVM functions are built around the idea of Stack and Heap.
  3. The custom memory management via DIMA is very hard to achieve with LLVM, also because of the second reason.
  4. The prototype and proof-of-concept can be achieved far easier and quicker without LLVM, eliminating the need to learn a whole ecosystem that may not even fit our needs for Flint.
  5. Through the DIMA-Tree structure of memory, precompiled functions and other clever memory optimizations, a feature can be implemented to save the whole program state to memory. The "default" values of functions are already stored in the compiled program. This could lead to other innovations for savegames or saving and resuming simulations. This idea will show its reliance throughought the development of the language.
  6. LLVM's strength is to provide a unified language behind the scenes. As Flint is very different to other languages, this unifying very well could be a major source of headache.

How to use LLVM nonetheless

I recommend creating a custom compiler for Flint, which uses LLVM only sparingly. The only area where LLVM is, and can still be, beneficial is when optimizing and compiling the code to machine code. This means that LLVM could be implemented in a way that each function will be compiled on its own through LLVM to machine code, which then is saved in the functions structure in the "Code Section". This approach will utilize the benefits of LLVM while also keeping the rest of the compile process LLVM-free.

Even if the compiler would be slow at first, it does not matter. The compiler itself will be written in C++ to make compilation of function calls easier with LLVM.

This needs to be discussed. Maybe we find ourselves finding that LLVM can be used, other than thought initially. But to create a quick prototype, LLVM could and should be ditched entirely (at least for now).