-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple DMA module #579
Comments
A first implementation could be just the basics, akin to the PIC32 DMA module for SPI (I2C too?) transfers that would be able to replace the current size-selectable FIFO. It's very usual for an SPI or I2C device to have a relatively large number of registers that need to be written to (last I'm using is cdce6214, a clock generator that requires writing to up to 86 registers), and having to 1) loop over the whole set of registers and 2) wait for each transaction to finish takes quite a long time during which the CPU can't seldom do anything else. Being able to just say "transfer 86 bytes from address X to SPI module" would be great and save a lot of processing time. This is specially interesting if you need to reprogram an SPI/I2C device in an interrupt handler, as you don't want them to take that much time. |
I just had a quick view at the PIC32 DMA data sheet. That module is a monster! 😅
It would also be quite handy if you implement crypto functions or CRC functions within the custom functions subsystem. Then you can process an entire block of data without the CPU. As I have already mentioned I really like the idea of adding a DMA. 👍 But before thinking about all the details, we should first think about some general things: Where should we put the DMA? Right after the BUSMUS (adding another BUSMUX there)? Or between CPU and d-cache? How do we configure the DMA transfer(s)? Memory-mapped registers inside the DMA? Maybe providing two sets? Or using a descriptor that is placed in main memory and the DMA fetches that by itself and executes it? This would also allow to make "linked lists" of consecutive transfers but of course would make the hardware more complex. What would the configuration structure (DMA-internal registers or descriptor in memory) look like?
|
Hi Stephan and agamez, i like also the idea of having a DMA. But please take into account to make it completly removable via Generics. Because in my projects I'm on the limit of Logic Elements. If the core will get much huger with no option of stripping down i will run into an FPGA overload. The Scatter/Gatter Approach seems to be the more generic one. Perhaps could it be also a nice thing to give the DMA an dedicated RAM for descriptors only. Which is mapped into normal address region like an peripheral?
BR, |
Absolutely!
Both of them 😉
The entire 32-bit address space (minus the internal IO region?) should also be accessible by the DMA. No need for dual-port memories. The central bus is just shared (time-multiplex) between the CPU and the DMA. Because of the caches the CPU should not "notice" this (in terms of wait cycles).
That would help boosting performance, but I think this out of scope now. Furthermore, this would make the bus system a real monster in terms of logic resources 😅 |
I think it's more complex that it should because it includes several submodules, like the CRC function you mention below: Special Function Module (SFM) mode: LFSR CRC, IP header checksum, but I don't think they should be part of the DMA module itself.
I think the proper place is just connected to the BUS. There must be some mechanism to invalidate the cache, either on hardware or on software. Although hardware management is preferred for performance, I don't think it's required for us and having the software take care of that would be enough. Or mark the DMA region of the memory as non-cacheable.
The first method is the simplest, providing just one set is good for some applications (large SPI transfers, for example). But something that requires transferring unlimited data would more than likely require the Scatter-Gather approach, or at the very least, the second set of descriptors to alternate from one set to the other. As a reference, Xilinx DMA module is configurable for either case, the second one being much larger than the first. Maybe we can approach first the simplest mode, see how it goes, how much area it uses, solve the cache coherency problems and later on think about the scatter-gather approach. There could be two selectable modules via generics, depending on the use case.
Sounds reasonable. DMA usually requires aligned transfers, if you need to transfer 510 or 511 bytes you just simply transfer 512 and let the software deal with the extra padding. |
I agree. Let's keep it simple (for now).
I think this is a general problem the programmer has to be ware of. However, we already have i-cache and d-cache sync instructions. These would have to be issued manually in case they are needed - so no hardware-managed coherency stuff.
Maybe we should collect some potential use cases. Then we can discuss what concept we really need. |
I see these applications:
Both UART and SPI modules already have a FIFO interface available, I2C doesn´t (maybe it should?). So I believe that an I/O interface alike that of a FIFO would nice to have for the dma module. How many (configurable) channels should the DMA module have? Only one, or up to 4? (UART, SPI, I2C, external) |
"Transfer until pattern" would be a cool feature, but I think this would also require a LOT of additional logic... 🙈 If the DMA would support "data alignment", i.e. reading bytes and storing them as zero- or sign-extended words, you could also send an entire string to the UART if you now the length of the string in advance (by setting the "number of elements to transfer" to the length of the string).
👍
Maybe we should re-add something like the old SLINK interfaces?! 😅 Anyway, the DMA would also be able to access processor-external devices via the Wishbone interface port. So fetching data via the DMA from an external ADC would still be possible. I see some additional use cases:
The current version of the I2C module requires a certain overhead for each individual I2C transaction (same for the onewire module). That's why there are no FIFOs yet (would require very wide FIFOs). But sure this is something we could/should add in the future. 😉
From a hardware point of view there should be only a single channel (so only one DMA transfer can be in progress at once). Having several (parallel) channels would make the design pretty huge + all the bus access gateways would increase the critical path of the system. At first we should start with DMA that is configured via a single set of memory-mapped registers. Then we could replace this by memory-located descriptors that allow chaining of DMA jobs to "virtually" create an arbitrarily number of channels (obviously, they cannot operate in parallel). Here is my first proposal for the DMA configuration structure:
-- transfer type register bits --
constant type_num_lo_c : natural := 0; -- r/w: Number of elements to transfer, LSB
constant type_num_hi_c : natural := 23; -- r/w: Number of elements to transfer, MSB
constant type_src_qsel_lo_c : natural := 24; -- r/w: SRC data quantity, LSB, 00=byte, 01=half-word
constant type_src_qsel_hi_c : natural := 25; -- r/w: SRC data quantity, MSB, 10=word, 11=word
constant type_dst_qsel_lo_c : natural := 26; -- r/w: DST data quantity, LSB, 00=byte, 01=half-word
constant type_dst_qsel_hi_c : natural := 27; -- r/w: DST data quantity, MSB, 10=word, 11=word
constant type_src_inc_c : natural := 28; -- r/w: SRC constants (0) or incrementing (1) address
constant type_dst_inc_c : natural := 29; -- r/w: DST constants (0) or incrementing (1) address
constant type_sext_c : natural := 30; -- r/w: Sign-extend sub-words when set
constant type_endian_c : natural := 31; -- r/w: Convert Endianness when set -- control and status register bits --
constant ctrl_en_c : natural := 0; -- r/w: DMA enable
constant ctrl_error_rd_c : natural := 29; -- r/-: error during read transfer
constant ctrl_error_wr_c : natural := 30; -- r/-: error during write transfer
constant ctrl_busy_c : natural := 31; -- r/-: DMA transfer in progress Any thoughts? 😅 |
That would be more than enough, we usually already know the length of a string prior to send it and we can always strlen() it.
I think the DMA solution would be quite nice. But I'm not entirely sure how you would intend to connect the DMA. Would it have just one connection to the wishbone bus and use it for both accessing the RAM and the external device? Because that would probably be a big strain for the bus, doesn't it?
👍
👍
Oh, so even if it's just one channel it would be a simultaneous read/write one? So I understand that you want to use the wishbone bus for both operations at the same time, reading from memory and writing, say, to the SPI peripheral using the same bus? That's unusual, I think.
All in all, I think my main doubt is related to the interconnection of DMA and peripherals, as per my two comments above. The reason you discarded more than one channel was because of bus access contention, but you are pretty much forcing it by having the DMA access peripherals through the same bus as where RAM is located. Which I understand is really the most straighforward way of doing it. It's only that I'm used to Xilinx style DMA modules which on one side present a BUS interface and on the other side they just look like a FIFO, so you can just place, say, an vhdl cosine stream from a cordic generator connected directly to the DMA module and not strain the memory/peripheral bus with twice as much data as needed. |
Good point!
It would be connected to the processor-internal bus system just like the CPU. So the DMA would have access to the same (and entire) address space that is also accessible by the CPU. The CPU / the caches would have prioritized access and the DMA needs to add sufficient "idle bus cycles" so the CPU can intervene at any time.
Maybe we should clarify the definition of the term "channel" 😅 The DMA should have a single read/write access port to the bus system. By this, the DMA can read an element, store it internally, make alignements and/or sign-extensions and then write it back to the bus system. But only one configured job can be executed at any time. This is what I understood as "single channel" 😉 By this, the DMA would not need any additional data buffer / data memory as indicated by @akaeba. |
Hi Stephan, could the second bus mux not also be optional? |
Thanks for the diagram, that clarifies a few things for me! I still am unsure about this architecture. Let's say we want to transfer 128 bytes through SPI using the DMA. First off, the DMA will read X bytes from memory through the busmux, and then it will write those same X bytes through the same busmux to register 1 that receives data to send... If this diagram is true to the truth, PIC32 architecture have the DMA module access directly the internals of the peripherals (hence my question before about modules with FIFOs available). But it has the drawback that there is a separate DMA module for each peripheral, which I don't like either (because how many times are you going to be transferring large amounts of data from several peripherals at the same time?) My proposal would be quite different and probably more cumbersome to implement to be honest, because it would entail exposing the internal FIFO interface from the SPI/UART/etc modules and create a different bus with that interface that would be accessible by the DMA... But I understand you not wanting to do this, sounds like a lot of work and maybe it isn't justified in terms of performance 😅 |
Correct. The second bus mux would also be replaced by a simple route-through if the DMA is omitted. I just dropped that from the diagram (there would be no generic for the second bus mux - just for the entire DMA complex).
The DMA would have no internal memory to buffer transfers as burst accesses are not supported yet. Hence, the DMA would operate element-by-element (my proposal):
I am not sure what you are trying to point out here 🙈😅 The DMA would have access to the entire address space. So it would be no problem to - for example - load data from DMEM and store it to some SPI register since all the peripherals are also memory-mapped. |
Lol, sorry! I come from the DSP world and computer architecture is really not my thing at all, so maybe I'm just talking nonsense. I'm probably not explaining myself properly 😞 What I'm worried about is that you need steps (1) and (3) to happen, and thus you are using the bus two times per word: you read one word, maybe align/sign-extend/convert the data (nor nothing at all with it) and then write back to a different address. That is two BUS accesses for just one word. So, yes, we are freeing the CPU from doing that operations, but we are hindering its access to the memory and other peripherals because the DMA module would be using the bus twice per word. The DMA in PIC32 diagram above only needs one bus access per word, because the DMA module is inside the peripheral itself and can go fetch data from memory and feed it directly to the peripheral without using the bus again. |
No worries! I come from the electrical world playing with wires and transistors, so high-level / software concepts do not fit my mind very well 😅
That's true. The DMA would block the entire bus while doing its transactions. But this is a general problem with the DMA concept - especially when there is no multi-layer bus that would support multiple point-to-point connections in parallel. The bus accesses triggered by the DMA would be no faster than the ones triggered by the CPU. But a DMA block transfer would be still more performant as all the overhead (address increment, looping) would be done in hardware. Of course there will be situation where the CPU has to wait for the DMA before it can fetch further instructions. Again, this is a general problem. But that should not hurt much as the CPU has caches, so it could keep operating on the local instructions/data inside those caches while the DMA blocks the bus system. Furthermore, the BUSMUX structure I showed above prioritizes the actual bus accesses. Bus request from the CPU / the caches should also be served first. The DMA should also split a single transfer into two atomic operations: an atomic bus read and an atomic bus write. The CPU/caches can take over bus control in between those atomic DMA accesses so CPU/cache requests would be further prioritized. |
I think I am also quite mind-polluted by all the Xilinx stuff, which is great for somethings and not so great for others... Everyone has its own biases!
Ah, you see, I had completely forgotten about caches even if it was clearly there in the diagram. So even in a BRAM based design where RAM is exactly as fast as the cache, it could make sense to add a small cache just so the DMA can do its thing and leave the CPU alone without even having the need to access the main bus. So, all in all, I think this looks promising! Thanks a lot for the discussion! |
I think so too. Even if we have setup without caches the CPU still has a tiny instruction prefetch buffer. So it can process (non-memory) instruction without making bus requests at all.
Thank you, too! 👍 Btw, I've started implementing a first version of the DMA in #593. |
You are more productive on a random Friday than I am on any day of the year 😅 |
Day off here 😎 |
I am unsure about the flexibility of the scatter/gather concept and the according data quantities. Do we need to support all possible combinations? If we leave out the sign-/zero-extension and Endianness conversion options, the following basic transfer types (considering only a single element) would be possible. However, I cannot see use cases for all of them.
Especially the "greater to smaller" transfers like WORD-to-BYTE do not make sense to me 🤔 |
Hi Stephan, what came to My mind, at the Moment Are 24 Bits reserved for the adr inc Counter. Could it a Good idea, to make the size a generic. For instance 16bit would Lead to 64k max chunk size. The result would a smaller adder in count of le. What do you think? |
Reducing the size from 24 bits down to 16 bits might save about 2x8 FFs and maybe something like 2x8+3 LUTs. Sure, that would help reducing the total size of the processor, but only in a very very small extend. But here we have a general question: do we really need 24 bit for this register? Or can we just constrain the counter down to 16 bits? What would be the "average" maximum size for a DMA transfer? I know this is absolutely application-specific, but maybe there is a reasonable trade-off?! 😅 |
Hi!
It's not very often that we see DMA controllers in small SoCs, truth be told. And when it exists it's typically used in conjunction to other internal peripherals. For example, why use a FIFO for SPI/I2C when you can just leave your data where it is in memory and make them read the memory directly via the DMA? This only makes sense for high speed protocols, but maybe you want to run SPI on the hundreds of MHz, who knows...
Anyways, here is some random example of why using DMA with SPI is useful sometimes https://hackaday.com/2016/09/26/pic32-dma-spi/
Well, for a continuous stream, yes, that'd be the immediate solution. But a large FIFO is a lot of logic that you may not have available or that you are dedicating exclusively to this purpose. If I write directly to the DMEM, I can make this as large as it fits on my device and not care about partitioning of resources (how much logic should I dedicate to the FIFO? Doesn't matter, decide it upon software execution via DMA transfer size).
In principle yes, but once setup, the DMA could be self sustaining, initiating the next transfer after the last has finished, so that would make it look like an infinite FIFO, provided the software is able to read each slot before the DMA overwrites its contents.
The Xilinx DMA core allows two modes of configuration:
My proposal for the neorv32 would be a mix of them, to keep it simple. So instead of having just a single DMA transfer that must be triggered manually every time -and thus make you lose precious time-, the DMA core itself would allow the definition of two different transfers that can be set to swap from one to another, thus allowing continuous transfer.
This core could be used not only by external devices such as the ADC I mention, but also by the SPI/UART/(future ethernet) cores. Of course, it should be bidirectional and allow transfers from and to DMEM. But in the SPI and UART cases it should definitely not be the default, as the DMA core would probably use more logic than a small FIFO which is probably enough most of the time.
Truth be told also, even Xilinx' AXI ethernet lite does not use DMA and does a fine job with a small microblaze to run near to 100Mbps, so maybe it's all overkill anyway...
Originally posted by @agamez in #9 (reply in thread)
The text was updated successfully, but these errors were encountered: