Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple DMA module #579

Closed
agamez opened this issue Apr 14, 2023 · 23 comments · Fixed by #593
Closed

Simple DMA module #579

agamez opened this issue Apr 14, 2023 · 23 comments · Fixed by #593
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@agamez
Copy link
Contributor

agamez commented Apr 14, 2023

Hi!

Hey @agamez!

I think a DMA would be a nice thing to have. I also thought about this in the past but I dropped the idea again because I could not see any relevant use case.

It's not very often that we see DMA controllers in small SoCs, truth be told. And when it exists it's typically used in conjunction to other internal peripherals. For example, why use a FIFO for SPI/I2C when you can just leave your data where it is in memory and make them read the memory directly via the DMA? This only makes sense for high speed protocols, but maybe you want to run SPI on the hundreds of MHz, who knows...

Anyways, here is some random example of why using DMA with SPI is useful sometimes https://hackaday.com/2016/09/26/pic32-dma-spi/

There are some devices that are relatively high speed that would benefit from fast access to DMA, such as ADCs that are continuously streaming.

These would be processor-external. Wouldn't a large FIFO buffer + "fill level interrupt" be more suitable in this case?

Well, for a continuous stream, yes, that'd be the immediate solution. But a large FIFO is a lot of logic that you may not have available or that you are dedicating exclusively to this purpose. If I write directly to the DMEM, I can make this as large as it fits on my device and not care about partitioning of resources (how much logic should I dedicate to the FIFO? Doesn't matter, decide it upon software execution via DMA transfer size).

A simple dual-buffer style DMA that would allow the user code to read from one slot while the DMA is transferring to the second slot in a circular fashion would be really great.

The DMA transfers would still need to be initiated by the CPU, right?

In principle yes, but once setup, the DMA could be self sustaining, initiating the next transfer after the last has finished, so that would make it look like an infinite FIFO, provided the software is able to read each slot before the DMA overwrites its contents.

So maybe just a few registers to set slot A address and size, slot B address and size, an interrupt output and a few bits for control and status (which slot is actively being written to, enable/disable, etc).

Sounds interesting wink So just a minimal DMA with many/one-to-many/one transfer options that is directly programmed (i.e. not a descriptor-based DMA)?

The Xilinx DMA core allows two modes of configuration:

  1. Single transfer mode: 1 control register, 1 status register, 1 source/destination address register, 1 length register. You just define how many data want to transfer and where it is, and an interrupt is raised when it's finished.
  2. Scatter-Gather: the source/destination and length registers are replaced with a register with the address of a struct that define each the address and length of a DMA transaction and has a pointer to the next struct defining the next transaction, kind of a singly linked list describing DMA transactions. If you make the last struct point to the first one, you have a circular list that can work in continuous mode.

My proposal for the neorv32 would be a mix of them, to keep it simple. So instead of having just a single DMA transfer that must be triggered manually every time -and thus make you lose precious time-, the DMA core itself would allow the definition of two different transfers that can be set to swap from one to another, thus allowing continuous transfer.

This core could be used not only by external devices such as the ADC I mention, but also by the SPI/UART/(future ethernet) cores. Of course, it should be bidirectional and allow transfers from and to DMEM. But in the SPI and UART cases it should definitely not be the default, as the DMA core would probably use more logic than a small FIFO which is probably enough most of the time.

Truth be told also, even Xilinx' AXI ethernet lite does not use DMA and does a fine job with a small microblaze to run near to 100Mbps, so maybe it's all overkill anyway...

Originally posted by @agamez in #9 (reply in thread)

@agamez
Copy link
Contributor Author

agamez commented Apr 14, 2023

A first implementation could be just the basics, akin to the PIC32 DMA module for SPI (I2C too?) transfers that would be able to replace the current size-selectable FIFO. It's very usual for an SPI or I2C device to have a relatively large number of registers that need to be written to (last I'm using is cdce6214, a clock generator that requires writing to up to 86 registers), and having to 1) loop over the whole set of registers and 2) wait for each transaction to finish takes quite a long time during which the CPU can't seldom do anything else. Being able to just say "transfer 86 bytes from address X to SPI module" would be great and save a lot of processing time. This is specially interesting if you need to reprogram an SPI/I2C device in an interrupt handler, as you don't want them to take that much time.

@stnolting
Copy link
Owner

I just had a quick view at the PIC32 DMA data sheet. That module is a monster! 😅

This core could be used not only by external devices such as the ADC I mention, but also by the SPI/UART/(future ethernet) cores. Of course, it should be bidirectional and allow transfers from and to DMEM. But in the SPI and UART cases it should definitely not be the default, as the DMA core would probably use more logic than a small FIFO which is probably enough most of the time.

It would also be quite handy if you implement crypto functions or CRC functions within the custom functions subsystem. Then you can process an entire block of data without the CPU.

As I have already mentioned I really like the idea of adding a DMA. 👍 But before thinking about all the details, we should first think about some general things:

Where should we put the DMA? Right after the BUSMUS (adding another BUSMUX there)? Or between CPU and d-cache?

neorv32_bus

How do we configure the DMA transfer(s)? Memory-mapped registers inside the DMA? Maybe providing two sets? Or using a descriptor that is placed in main memory and the DMA fetches that by itself and executes it? This would also allow to make "linked lists" of consecutive transfers but of course would make the hardware more complex.

What would the configuration structure (DMA-internal registers or descriptor in memory) look like?
Just some ideas:

  • source base address, 32-bit
  • destination base address, 32-bit
  • number of elements, 32-bit?
  • source address increment (0=constant, 1=4-byte increment); do we need byte/half-word offsets?
  • destination address increment (same as above)
  • status and control register

@akaeba
Copy link
Collaborator

akaeba commented Apr 15, 2023

Hi Stephan and agamez,

i like also the idea of having a DMA. But please take into account to make it completly removable via Generics. Because in my projects I'm on the limit of Logic Elements. If the core will get much huger with no option of stripping down i will run into an FPGA overload.

The Scatter/Gatter Approach seems to be the more generic one. Perhaps could it be also a nice thing to give the DMA an dedicated RAM for descriptors only. Which is mapped into normal address region like an peripheral?
Some Question:

  • Arbitration: Who will access the DMEM the CPU or the DMA?
  • Is the complete DMEM DMA capable or only some regions? This regions realized as dual port RAM? Could avoid an CPU lock by the DMA
  • Introduce a Multilayer Peripheral Interface?

BR,
Andreas

@stnolting
Copy link
Owner

But please take into account to make it completly removable via Generics.

Absolutely!

Arbitration: Who will access the DMEM the CPU or the DMA?

Both of them 😉
Both should have access to the same physical address space, but maybe the CPU should have priority during concurrent accesses.

Is the complete DMEM DMA capable or only some regions? This regions realized as dual port RAM? Could avoid an CPU lock by the DMA

The entire 32-bit address space (minus the internal IO region?) should also be accessible by the DMA. No need for dual-port memories. The central bus is just shared (time-multiplex) between the CPU and the DMA. Because of the caches the CPU should not "notice" this (in terms of wait cycles).

Introduce a Multilayer Peripheral Interface?

That would help boosting performance, but I think this out of scope now. Furthermore, this would make the bus system a real monster in terms of logic resources 😅

@stnolting stnolting added enhancement New feature or request help wanted Extra attention is needed labels Apr 15, 2023
@agamez
Copy link
Contributor Author

agamez commented Apr 17, 2023

I just had a quick view at the PIC32 DMA data sheet. That module is a monster! sweat_smile

I think it's more complex that it should because it includes several submodules, like the CRC function you mention below: Special Function Module (SFM) mode: LFSR CRC, IP header checksum, but I don't think they should be part of the DMA module itself.

Where should we put the DMA? Right after the BUSMUS (adding another BUSMUX there)? Or between CPU and d-cache?

neorv32_bus

I think the proper place is just connected to the BUS. There must be some mechanism to invalidate the cache, either on hardware or on software. Although hardware management is preferred for performance, I don't think it's required for us and having the software take care of that would be enough. Or mark the DMA region of the memory as non-cacheable.

How do we configure the DMA transfer(s)? Memory-mapped registers inside the DMA? Maybe providing two sets? Or using a descriptor that is placed in main memory and the DMA fetches that by itself and executes it? This would also allow to make "linked lists" of consecutive transfers but of course would make the hardware more complex.

The first method is the simplest, providing just one set is good for some applications (large SPI transfers, for example). But something that requires transferring unlimited data would more than likely require the Scatter-Gather approach, or at the very least, the second set of descriptors to alternate from one set to the other.

As a reference, Xilinx DMA module is configurable for either case, the second one being much larger than the first. Maybe we can approach first the simplest mode, see how it goes, how much area it uses, solve the cache coherency problems and later on think about the scatter-gather approach. There could be two selectable modules via generics, depending on the use case.

What would the configuration structure (DMA-internal registers or descriptor in memory) look like? Just some ideas:

* source base address, 32-bit

* destination base address, 32-bit

* number of elements, 32-bit?

* source address increment (0=constant, 1=4-byte increment); do we need byte/half-word offsets?

* destination address increment (same as above)

* status and control register

Sounds reasonable. DMA usually requires aligned transfers, if you need to transfer 510 or 511 bytes you just simply transfer 512 and let the software deal with the extra padding.

@stnolting
Copy link
Owner

but I don't think they should be part of the DMA module itself.

I agree. Let's keep it simple (for now).

There must be some mechanism to invalidate the cache, either on hardware or on software.

I think this is a general problem the programmer has to be ware of. However, we already have i-cache and d-cache sync instructions. These would have to be issued manually in case they are needed - so no hardware-managed coherency stuff.

The first method is the simplest, providing just one set is good for some applications (large SPI transfers, for example). But something that requires transferring unlimited data would more than likely require the Scatter-Gather approach, or at the very least, the second set of descriptors to alternate from one set to the other.

Maybe we should collect some potential use cases. Then we can discuss what concept we really need.

@agamez
Copy link
Contributor Author

agamez commented Apr 18, 2023

Maybe we should collect some potential use cases. Then we can discuss what concept we really need.

I see these applications:

  • Fast UART transfers. I liked PIC32 SPI concept of "DMA until pattern", which would be probably '\0' so you can just send_uart_dma(string) and forget about it, sending a whole bunch of data (such as a help interface, or think the output of an uboot-like program with a 'printenv' instruction that prints lots of data).
  • Fast SPI and I2C transfers. Programming some devices means sending a lot of data (I already used the example of cdce6214 which requires writing up to 84 registers). Fast SPI access would mean faster retrieves on XIP mode that could leave the SPI available for the user for longer time.
  • I'd really like to expose an IN/OUT interface for external modules. My use case is an ADC/DAC pair, but would also serve to crypto functions implemented externally to the processor, but this would be a great interface for large-data based accelerators, such as crypto functions, CRC calculations, and DSP functions: FFT, windowing, FIR filters, etc...

Both UART and SPI modules already have a FIFO interface available, I2C doesn´t (maybe it should?). So I believe that an I/O interface alike that of a FIFO would nice to have for the dma module. How many (configurable) channels should the DMA module have? Only one, or up to 4? (UART, SPI, I2C, external)

@stnolting
Copy link
Owner

stnolting commented Apr 19, 2023

Fast UART transfers. I liked PIC32 SPI concept of "DMA until pattern", which would be probably '\0' so you can just send_uart_dma(string) and forget about it, sending a whole bunch of data (such as a help interface, or think the output of an uboot-like program with a 'printenv' instruction that prints lots of data).

"Transfer until pattern" would be a cool feature, but I think this would also require a LOT of additional logic... 🙈

If the DMA would support "data alignment", i.e. reading bytes and storing them as zero- or sign-extended words, you could also send an entire string to the UART if you now the length of the string in advance (by setting the "number of elements to transfer" to the length of the string).

Fast SPI and I2C transfers. Programming some devices means sending a lot of data (I already used the example of cdce6214 which requires writing up to 84 registers). Fast SPI access would mean faster retrieves on XIP mode that could leave the SPI available for the user for longer time.

👍

I'd really like to expose an IN/OUT interface for external modules. My use case is an ADC/DAC pair, but would also serve to crypto functions implemented externally to the processor, but this would be a great interface for large-data based accelerators, such as crypto functions, CRC calculations, and DSP functions: FFT, windowing, FIR filters, etc...

Maybe we should re-add something like the old SLINK interfaces?! 😅

Anyway, the DMA would also be able to access processor-external devices via the Wishbone interface port. So fetching data via the DMA from an external ADC would still be possible.

I see some additional use cases:

  • endianness conversion of an array
  • convert an array of bytes/half-words into a sign-/zero-extended array of half-word/words
  • move data arrays to a CRC or crypto core (implemented within the custom functions subsystem)

Both UART and SPI modules already have a FIFO interface available, I2C doesn´t (maybe it should?). So I believe that an I/O interface alike that of a FIFO would nice to have for the dma module.

The current version of the I2C module requires a certain overhead for each individual I2C transaction (same for the onewire module). That's why there are no FIFOs yet (would require very wide FIFOs). But sure this is something we could/should add in the future. 😉

How many (configurable) channels should the DMA module have? Only one, or up to 4? (UART, SPI, I2C, external).

From a hardware point of view there should be only a single channel (so only one DMA transfer can be in progress at once). Having several (parallel) channels would make the design pretty huge + all the bus access gateways would increase the critical path of the system.

At first we should start with DMA that is configured via a single set of memory-mapped registers. Then we could replace this by memory-located descriptors that allow chaining of DMA jobs to "virtually" create an arbitrarily number of channels (obviously, they cannot operate in parallel).


Here is my first proposal for the DMA configuration structure:

  • REG0: Source base address
  • REG1: Destination base address
  • REG2: Transfer type (see below)
  • REG3: Control and status register (see below)
-- transfer type register bits --
constant type_num_lo_c      : natural :=  0; -- r/w: Number of elements to transfer, LSB
constant type_num_hi_c      : natural := 23; -- r/w: Number of elements to transfer, MSB
constant type_src_qsel_lo_c : natural := 24; -- r/w: SRC data quantity, LSB, 00=byte, 01=half-word
constant type_src_qsel_hi_c : natural := 25; -- r/w: SRC data quantity, MSB, 10=word, 11=word
constant type_dst_qsel_lo_c : natural := 26; -- r/w: DST data quantity, LSB, 00=byte, 01=half-word
constant type_dst_qsel_hi_c : natural := 27; -- r/w: DST data quantity, MSB, 10=word, 11=word
constant type_src_inc_c     : natural := 28; -- r/w: SRC constants (0) or incrementing (1) address
constant type_dst_inc_c     : natural := 29; -- r/w: DST constants (0) or incrementing (1) address
constant type_sext_c        : natural := 30; -- r/w: Sign-extend sub-words when set
constant type_endian_c      : natural := 31; -- r/w: Convert Endianness when set
-- control and status register bits --
constant ctrl_en_c       : natural :=  0; -- r/w: DMA enable
constant ctrl_error_rd_c : natural := 29; -- r/-: error during read transfer
constant ctrl_error_wr_c : natural := 30; -- r/-: error during write transfer
constant ctrl_busy_c     : natural := 31; -- r/-: DMA transfer in progress

Any thoughts? 😅

@agamez
Copy link
Contributor Author

agamez commented Apr 20, 2023

"Transfer until pattern" would be a cool feature, but I think this would also require a LOT of additional logic... see_no_evil

If the DMA would support "data alignment", i.e. reading bytes and storing them as zero- or sign-extended words, you could also send an entire string to the UART if you now the length of the string in advance (by setting the "number of elements to transfer" to the length of the string).

That would be more than enough, we usually already know the length of a string prior to send it and we can always strlen() it.

I'd really like to expose an IN/OUT interface for external modules. My use case is an ADC/DAC pair, but would also serve to crypto functions implemented externally to the processor, but this would be a great interface for large-data based accelerators, such as crypto functions, CRC calculations, and DSP functions: FFT, windowing, FIR filters, etc...

Maybe we should re-add something like the old SLINK interfaces?! 😅

Anyway, the DMA would also be able to access processor-external devices via the Wishbone interface port. So fetching data via the DMA from an external ADC would still be possible.

I think the DMA solution would be quite nice. But I'm not entirely sure how you would intend to connect the DMA. Would it have just one connection to the wishbone bus and use it for both accessing the RAM and the external device? Because that would probably be a big strain for the bus, doesn't it?

How many (configurable) channels should the DMA module have? Only one, or up to 4? (UART, SPI, I2C, external).

From a hardware point of view there should be only a single channel (so only one DMA transfer can be in progress at once). Having several (parallel) channels would make the design pretty huge + all the bus access gateways would increase the critical path of the system.

👍

At first we should start with DMA that is configured via a single set of memory-mapped registers. Then we could replace this by memory-located descriptors that allow chaining of DMA jobs to "virtually" create an arbitrarily number of channels (obviously, they cannot operate in parallel).

👍

Here is my first proposal for the DMA configuration structure:

* REG0: Source base address

* REG1: Destination base address

Oh, so even if it's just one channel it would be a simultaneous read/write one? So I understand that you want to use the wishbone bus for both operations at the same time, reading from memory and writing, say, to the SPI peripheral using the same bus? That's unusual, I think.

* REG2: Transfer type (see below)

* REG3: Control and status register (see below)
-- transfer type register bits --
constant type_num_lo_c      : natural :=  0; -- r/w: Number of elements to transfer, LSB
constant type_num_hi_c      : natural := 23; -- r/w: Number of elements to transfer, MSB
constant type_src_qsel_lo_c : natural := 24; -- r/w: SRC data quantity, LSB, 00=byte, 01=half-word
constant type_src_qsel_hi_c : natural := 25; -- r/w: SRC data quantity, MSB, 10=word, 11=word
constant type_dst_qsel_lo_c : natural := 26; -- r/w: DST data quantity, LSB, 00=byte, 01=half-word
constant type_dst_qsel_hi_c : natural := 27; -- r/w: DST data quantity, MSB, 10=word, 11=word
constant type_src_inc_c     : natural := 28; -- r/w: SRC constants (0) or incrementing (1) address
constant type_dst_inc_c     : natural := 29; -- r/w: DST constants (0) or incrementing (1) address
constant type_sext_c        : natural := 30; -- r/w: Sign-extend sub-words when set
constant type_endian_c      : natural := 31; -- r/w: Convert Endianness when set
-- control and status register bits --
constant ctrl_en_c       : natural :=  0; -- r/w: DMA enable
constant ctrl_error_rd_c : natural := 29; -- r/-: error during read transfer
constant ctrl_error_wr_c : natural := 30; -- r/-: error during write transfer
constant ctrl_busy_c     : natural := 31; -- r/-: DMA transfer in progress

Any thoughts? 😅

All in all, I think my main doubt is related to the interconnection of DMA and peripherals, as per my two comments above. The reason you discarded more than one channel was because of bus access contention, but you are pretty much forcing it by having the DMA access peripherals through the same bus as where RAM is located. Which I understand is really the most straighforward way of doing it. It's only that I'm used to Xilinx style DMA modules which on one side present a BUS interface and on the other side they just look like a FIFO, so you can just place, say, an vhdl cosine stream from a cordic generator connected directly to the DMA module and not strain the memory/peripheral bus with twice as much data as needed.

@stnolting
Copy link
Owner

That would be more than enough, we usually already know the length of a string prior to send it and we can always strlen() it.

Good point!

I think the DMA solution would be quite nice. But I'm not entirely sure how you would intend to connect the DMA. Would it have just one connection to the wishbone bus and use it for both accessing the RAM and the external device? Because that would probably be a big strain for the bus, doesn't it?

It would be connected to the processor-internal bus system just like the CPU. So the DMA would have access to the same (and entire) address space that is also accessible by the CPU.

The CPU / the caches would have prioritized access and the DMA needs to add sufficient "idle bus cycles" so the CPU can intervene at any time.

Oh, so even if it's just one channel it would be a simultaneous read/write one? So I understand that you want to use the wishbone bus for both operations at the same time, reading from memory and writing, say, to the SPI peripheral using the same bus? That's unusual, I think.

Maybe we should clarify the definition of the term "channel" 😅

The DMA should have a single read/write access port to the bus system. By this, the DMA can read an element, store it internally, make alignements and/or sign-extensions and then write it back to the bus system.

But only one configured job can be executed at any time. This is what I understood as "single channel" 😉

By this, the DMA would not need any additional data buffer / data memory as indicated by @akaeba.

@stnolting
Copy link
Owner

This is the setup I would suggest:

neorv32_bus

@akaeba
Copy link
Collaborator

akaeba commented Apr 20, 2023

Hi Stephan, could the second bus mux not also be optional?

@agamez
Copy link
Contributor Author

agamez commented Apr 21, 2023

Thanks for the diagram, that clarifies a few things for me!

I still am unsure about this architecture. Let's say we want to transfer 128 bytes through SPI using the DMA. First off, the DMA will read X bytes from memory through the busmux, and then it will write those same X bytes through the same busmux to register 1 that receives data to send...

If this diagram is true to the truth, PIC32 architecture have the DMA module access directly the internals of the peripherals (hence my question before about modules with FIFOs available). But it has the drawback that there is a separate DMA module for each peripheral, which I don't like either (because how many times are you going to be transferring large amounts of data from several peripherals at the same time?)

image

My proposal would be quite different and probably more cumbersome to implement to be honest, because it would entail exposing the internal FIFO interface from the SPI/UART/etc modules and create a different bus with that interface that would be accessible by the DMA... But I understand you not wanting to do this, sounds like a lot of work and maybe it isn't justified in terms of performance 😅

@stnolting
Copy link
Owner

@akaeba

Hi Stephan, could the second bus mux not also be optional?

Correct. The second bus mux would also be replaced by a simple route-through if the DMA is omitted. I just dropped that from the diagram (there would be no generic for the second bus mux - just for the entire DMA complex).

@agamez

I still am unsure about this architecture. Let's say we want to transfer 128 bytes through SPI using the DMA. First off, the DMA will read X bytes from memory through the busmux, and then it will write those same X bytes through the same busmux to register 1 that receives data to send...

The DMA would have no internal memory to buffer transfers as burst accesses are not supported yet. Hence, the DMA would operate element-by-element (my proposal):

  1. Read byte/half-word/word from SRC_ADDRESS
  2. Align/sign-extend/convert the data that has been read
  3. Write byte/half-word/word to DST_ADDRESS
  4. Decrement "number of elements" counter; optionally increment SRC_ADDRESS and DST_ADDRESS
  5. Go to 1. if "number of elements" counter not zero

If this diagram is true to the truth, PIC32 architecture have the DMA module access directly the internals of the peripherals (hence my question before about modules with FIFOs available). But it has the drawback that there is a separate DMA module for each peripheral, which I don't like either (because how many times are you going to be transferring large amounts of data from several peripherals at the same time?)

My proposal would be quite different and probably more cumbersome to implement to be honest, because it would entail exposing the internal FIFO interface from the SPI/UART/etc modules and create a different bus with that interface that would be accessible by the DMA... But I understand you not wanting to do this, sounds like a lot of work and maybe it isn't justified in terms of performance 😅

I am not sure what you are trying to point out here 🙈😅

The DMA would have access to the entire address space. So it would be no problem to - for example - load data from DMEM and store it to some SPI register since all the peripherals are also memory-mapped.

@agamez
Copy link
Contributor Author

agamez commented Apr 21, 2023

The DMA would have no internal memory to buffer transfers as burst accesses are not supported yet. Hence, the DMA would operate element-by-element (my proposal):

1. Read byte/half-word/word from SRC_ADDRESS

2. Align/sign-extend/convert the data that has been read

3. Write byte/half-word/word to DST_ADDRESS

4. Decrement "number of elements" counter; optionally increment SRC_ADDRESS and DST_ADDRESS

5. Go to 1. if "number of elements" counter not zero

If this diagram is true to the truth, PIC32 architecture have the DMA module access directly the internals of the peripherals (hence my question before about modules with FIFOs available). But it has the drawback that there is a separate DMA module for each peripheral, which I don't like either (because how many times are you going to be transferring large amounts of data from several peripherals at the same time?)

My proposal would be quite different and probably more cumbersome to implement to be honest, because it would entail exposing the internal FIFO interface from the SPI/UART/etc modules and create a different bus with that interface that would be accessible by the DMA... But I understand you not wanting to do this, sounds like a lot of work and maybe it isn't justified in terms of performance sweat_smile

I am not sure what you are trying to point out here 🙈 :sweat_smile

Lol, sorry! I come from the DSP world and computer architecture is really not my thing at all, so maybe I'm just talking nonsense. I'm probably not explaining myself properly 😞

What I'm worried about is that you need steps (1) and (3) to happen, and thus you are using the bus two times per word: you read one word, maybe align/sign-extend/convert the data (nor nothing at all with it) and then write back to a different address. That is two BUS accesses for just one word. So, yes, we are freeing the CPU from doing that operations, but we are hindering its access to the memory and other peripherals because the DMA module would be using the bus twice per word. The DMA in PIC32 diagram above only needs one bus access per word, because the DMA module is inside the peripheral itself and can go fetch data from memory and feed it directly to the peripheral without using the bus again.

@stnolting
Copy link
Owner

Lol, sorry! I come from the DSP world and computer architecture is really not my thing at all, so maybe I'm just talking nonsense. I'm probably not explaining myself properly 😞

No worries! I come from the electrical world playing with wires and transistors, so high-level / software concepts do not fit my mind very well 😅

What I'm worried about is that you need steps (1) and (3) to happen, and thus you are using the bus two times per word: you read one word, maybe align/sign-extend/convert the data (nor nothing at all with it) and then write back to a different address. That is two BUS accesses for just one word. So, yes, we are freeing the CPU from doing that operations, but we are hindering its access to the memory and other peripherals because the DMA module would be using the bus twice per word. The DMA in PIC32 diagram above only needs one bus access per word, because the DMA module is inside the peripheral itself and can go fetch data from memory and feed it directly to the peripheral without using the bus again.

That's true. The DMA would block the entire bus while doing its transactions. But this is a general problem with the DMA concept - especially when there is no multi-layer bus that would support multiple point-to-point connections in parallel.

The bus accesses triggered by the DMA would be no faster than the ones triggered by the CPU. But a DMA block transfer would be still more performant as all the overhead (address increment, looping) would be done in hardware.

Of course there will be situation where the CPU has to wait for the DMA before it can fetch further instructions. Again, this is a general problem. But that should not hurt much as the CPU has caches, so it could keep operating on the local instructions/data inside those caches while the DMA blocks the bus system.

Furthermore, the BUSMUX structure I showed above prioritizes the actual bus accesses. Bus request from the CPU / the caches should also be served first.

The DMA should also split a single transfer into two atomic operations: an atomic bus read and an atomic bus write. The CPU/caches can take over bus control in between those atomic DMA accesses so CPU/cache requests would be further prioritized.

@agamez
Copy link
Contributor Author

agamez commented Apr 21, 2023

No worries! I come from the electrical world playing with wires and transistors, so high-level / software concepts do not fit my mind very well 😅

I think I am also quite mind-polluted by all the Xilinx stuff, which is great for somethings and not so great for others... Everyone has its own biases!

Of course there will be situation where the CPU has to wait for the DMA before it can fetch further instructions. Again, this is a general problem. But that should not hurt much as the CPU has caches, so it could keep operating on the local instructions/data inside those caches while the DMA blocks the bus system.

Ah, you see, I had completely forgotten about caches even if it was clearly there in the diagram. So even in a BRAM based design where RAM is exactly as fast as the cache, it could make sense to add a small cache just so the DMA can do its thing and leave the CPU alone without even having the need to access the main bus. So, all in all, I think this looks promising! Thanks a lot for the discussion!

@stnolting
Copy link
Owner

Ah, you see, I had completely forgotten about caches even if it was clearly there in the diagram. So even in a BRAM based design where RAM is exactly as fast as the cache, it could make sense to add a small cache just so the DMA can do its thing and leave the CPU alone without even having the need to access the main bus.

I think so too. Even if we have setup without caches the CPU still has a tiny instruction prefetch buffer. So it can process (non-memory) instruction without making bus requests at all.

So, all in all, I think this looks promising! Thanks a lot for the discussion!

Thank you, too! 👍

Btw, I've started implementing a first version of the DMA in #593.

@agamez
Copy link
Contributor Author

agamez commented Apr 21, 2023

Btw, I've started implementing a first version of the DMA in #593.

You are more productive on a random Friday than I am on any day of the year 😅

@stnolting
Copy link
Owner

Day off here 😎

@stnolting
Copy link
Owner

stnolting commented Apr 21, 2023

I am unsure about the flexibility of the scatter/gather concept and the according data quantities. Do we need to support all possible combinations? If we leave out the sign-/zero-extension and Endianness conversion options, the following basic transfer types (considering only a single element) would be possible. However, I cannot see use cases for all of them.

  • BYTE -> BYTE: fine-grained transfers; e.g. transferring an odd number of bytes (e.g. 99)
  • BYTE -> HALF: ???
  • BYTE -> WORD: writing a string to UART (IO data registers require word-level write access)
  • HALF -> BYTE: ???
  • HALF -> HALF: ???
  • HALF -> WORD: might be suitable for some CRC applications
  • WORD -> BYTE: ???
  • WORD -> HALF: ???
  • WORD -> WORD: block copies, endianness conversion

Especially the "greater to smaller" transfers like WORD-to-BYTE do not make sense to me 🤔

@akaeba
Copy link
Collaborator

akaeba commented Apr 22, 2023

Hi Stephan, what came to My mind, at the Moment Are 24 Bits reserved for the adr inc Counter. Could it a Good idea, to make the size a generic. For instance 16bit would Lead to 64k max chunk size. The result would a smaller adder in count of le. What do you think?

@stnolting
Copy link
Owner

@akaeba

Hi Stephan, what came to My mind, at the Moment Are 24 Bits reserved for the adr inc Counter. Could it a Good idea, to make the size a generic. For instance 16bit would Lead to 64k max chunk size. The result would a smaller adder in count of le. What do you think?

Reducing the size from 24 bits down to 16 bits might save about 2x8 FFs and maybe something like 2x8+3 LUTs. Sure, that would help reducing the total size of the processor, but only in a very very small extend.

But here we have a general question: do we really need 24 bit for this register? Or can we just constrain the counter down to 16 bits? What would be the "average" maximum size for a DMA transfer? I know this is absolutely application-specific, but maybe there is a reasonable trade-off?! 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants