Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMA #13

Open
notro opened this issue Mar 20, 2015 · 122 comments
Open

DMA #13

notro opened this issue Mar 20, 2015 · 122 comments

Comments

@notro
Copy link

notro commented Mar 20, 2015

I have started to look at spi-bcm2835 and DMA. My FBTFT drivers are now included in raspberrypi/linux, but I need a DMA SPI driver there as well, so I can stop releasing my custom kernel.
To get DMA channels from Device Tree, I had to make a change to drivers/dma/bcm2708-dmaengine.c first. With that in place this patch gives me channels and the subsystem now calls into can_dma()

diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
index 6916745..8e5e3b3c 100644
--- a/drivers/spi/spi-bcm2835.c
+++ b/drivers/spi/spi-bcm2835.c
@@ -26,6 +26,7 @@
 #include <linux/clk.h>
 #include <linux/completion.h>
 #include <linux/delay.h>
+#include <linux/dmaengine.h>
 #include <linux/err.h>
 #include <linux/interrupt.h>
 #include <linux/io.h>
@@ -298,6 +299,92 @@ out:
    return 0;
 }

+static bool bcm2835_spi_can_dma(struct spi_master *master,
+               struct spi_device *spi,
+               struct spi_transfer *transfer)
+{
+// struct bcm2835_spi *bs = spi_master_get_devdata(master);
+
+printk("%s\n", __func__);
+// if (transfer->len > XX)
+//     return true;
+   return false;
+}
+
+static void bcm2835_spi_dma_exit(struct spi_master *master)
+{
+printk("%s\n", __func__);
+   if (master->dma_rx) {
+       dma_release_channel(master->dma_rx);
+       master->dma_rx = NULL;
+   }
+
+   if (master->dma_tx) {
+       dma_release_channel(master->dma_tx);
+       master->dma_tx = NULL;
+   }
+}
+
+#define MAX_DMA_LEN    SZ_64K
+static int bcm2835_spi_dma_init(struct device *dev, struct spi_master *master)
+{
+   struct bcm2835_spi *bs = spi_master_get_devdata(master);
+   struct dma_slave_config slave_config = {};
+   int ret;
+
+printk("%s\n", __func__);
+   /* Prepare for TX DMA */
+   master->dma_tx = dma_request_slave_channel(dev, "tx");
+   if (!master->dma_tx) {
+       dev_err(dev, "cannot get the TX DMA channel!\n");
+       ret = -EINVAL;
+       goto err;
+   }
+
+   slave_config.direction = DMA_MEM_TO_DEV;
+   slave_config.dst_addr = (u32) (bs->regs + BCM2835_SPI_FIFO);
+   slave_config.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+   slave_config.slave_id = 6; /* DREQ channel 6 = SPI TX */
+// slave_config.dst_maxburst = 128; /* FIFO depth */
+   ret = dmaengine_slave_config(master->dma_tx, &slave_config);
+   if (ret) {
+       dev_err(dev, "error in TX dma configuration.\n");
+       printk("ret=%i\n", ret);
+       goto err;
+   }
+
+   /* Prepare for RX */
+   master->dma_rx = dma_request_slave_channel(dev, "rx");
+   if (!master->dma_rx) {
+       dev_dbg(dev, "cannot get the DMA channel.\n");
+       ret = -EINVAL;
+       goto err;
+   }
+
+   slave_config.direction = DMA_DEV_TO_MEM;
+   slave_config.src_addr = (u32) (bs->regs + BCM2835_SPI_FIFO);
+   slave_config.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+   slave_config.slave_id = 7; /* DREQ channel 7 = SPI RX */
+// slave_config.src_maxburst = 128; /* FIFO depth */
+   ret = dmaengine_slave_config(master->dma_rx, &slave_config);
+   if (ret) {
+       dev_err(dev, "error in RX dma configuration.\n");
+       goto err;
+   }
+
+// init_completion(&spi_imx->dma_rx_completion);
+// init_completion(&spi_imx->dma_tx_completion);
+   master->can_dma = bcm2835_spi_can_dma;
+   master->max_dma_len = MAX_DMA_LEN;
+// master->flags = SPI_MASTER_MUST_RX | SPI_MASTER_MUST_TX;
+
+   return 0;
+err:
+   bcm2835_spi_dma_exit(master);
+
+   return ret;
+}
+
 static int bcm2835_spi_probe(struct platform_device *pdev)
 {
    struct spi_master *master;
@@ -353,6 +440,9 @@ static int bcm2835_spi_probe(struct platform_device *pdev)
        goto out_clk_disable;
    }

+   if (bcm2835_spi_dma_init(&pdev->dev, master))
+       dev_err(&pdev->dev, "dma setup error, falling back to pio\n");
+
    /* initialise the hardware */
    bcm2835_wr(bs, BCM2835_SPI_CS,
           BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX);
@@ -382,6 +472,7 @@ static int bcm2835_spi_remove(struct platform_device *pdev)
           BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX);

    clk_disable_unprepare(bs->clk);
+   bcm2835_spi_dma_exit(master);

    return 0;
 }

I used this commit to get me going: raspberrypi/linux@f62cacc
This is also helpful: https://github.com/raspberrypi/linux/blob/rpi-3.18.y/drivers/mmc/host/bcm2835-mmc.c

There's of course a lot missing here, but since you are much more versed in SPI and DMA, I'm wondering how far down on your list DMA is now?

bcm2708-dmaengine.c is in need of a cleanup + patching to get this to work, and bcm2835-dma.c will need a patch to get slave config support, so there's a bit of work here before we are home with SPI DMA in mainline.

The nice thing about the can_dma feature, is that the subsystem does the DMA mapping, which means we should be able to get DMA transfer with spidev as well. Since all transfers are checked to see if they are eligble for DMA. http://lxr.free-electrons.com/ident?i=spi_map_buf

This will also fix my problem of sending a SPI message with a 1 byte transfer and a 4k transfer.
This currently fails in spi-bcm2708 dma (crashes driver), but when transfers are looked at individually I guess it will work. Currently I work around this byte prepending by copying to a new buffer.

@notro
Copy link
Author

notro commented Mar 20, 2015

I just discovered that slave_id is set from Device Tree:

- #dma-cells: Must be <1>, the cell in the dmas property of the
        client device represents the DREQ number.

https://www.kernel.org/doc/Documentation/devicetree/bindings/dma/bcm2835-dma.txt

So this should probably be correct:

    fragment@0 {
        target = <&spi0>;
        __overlay__ {
            status = "okay";
            compatible = "brcm,bcm2835-spi";

            dmas = <&dma 7>,
                   <&dma 6>;
            dma-names = "rx", "tx";
        };
    };

@msperl
Copy link
Owner

msperl commented Mar 20, 2015

How far down my list?
Well, for most parts I try to get the quick wins upstream first.

The first 4 patches are out, but not all are merged.

The next will be the optimization reducing the interrupt count:

  • push as much data into the fifo as possible
  • fill the fifo without interrupt on start of transfer

Then I will try to argue for some polling for short transfers (we got a lot of irq overhead and saving here could just as well be spent polling...)

Then we can discuss dma for big transfers.

I still doubt that we will ever get the fully dma pipelined version into upstream...

Martin

@msperl
Copy link
Owner

msperl commented Mar 21, 2015

having a look it looks promising what you have started.
I will look into what it does and what it does not do hopefully tomorrow...
One issue I see is that the "falling back to pio" may make people complain, because there is - as of now - no requirement for DMA, so we need to be backwards compatible and minimize messages - I say probably...

@msperl
Copy link
Owner

msperl commented Mar 21, 2015

@notro: Things I want to understand from you and your use-case:

  • are your transfers mostly "single" transfers of big sizes? The code above states a limit of SZ_64K, but I guess that a TFT would be producing bigger transfers than 64k, so I wonder how you intend to achieve this
  • how often do you run these transfers/second?
  • are you running this already DMA mapped (so dma_tx set)?
  • are you running the same transfer repeatedly or are you reusing the transfer and the frame buffer? I wonder because the "checking" of spi_message is taking some time In my case with 3 transfers 17us alone (mostly probably because of a cache miss on ARM for code+data) - a second message with 2 transfers then takes only 5.3us. So I wonder if the "optimize" option for repeated transfers might make any sense from your perspective... It is minor for long transfers, but it may still take some time that is unnecessarily spent
    Finally: I got an adafruit display st7735+SD-card (never have used it) - so I wonder how complex is a simple test-setup so that I can check everything is working on my own? Any pointers?

@msperl
Copy link
Owner

msperl commented Mar 21, 2015

In the meantime I found your wiki - will try to test it tomorrow.
With this (plus the enc28j60) I should have 4 distinct devices to test the spi-driver with...

@msperl
Copy link
Owner

msperl commented Mar 21, 2015

Actually i see that the device actually supports 9 bit so we might also try to investigate that one - but only if you want to help with the fb driver side making things possible...
I do not know how things would work out - especially I am not sure how DMA would play - this is not specified in the document - besides the bit for LoSSI DMA, so it would be trial and error.

@notro
Copy link
Author

notro commented Mar 21, 2015

are your transfers mostly "single" transfers of big sizes?

I do single transfers, but in some experimental code, I have tried sending a control byte as one transfer and the payload in the next, in one SPI message, but that crashes spi-bcm2078 DMA.
The driver doesn't discriminate between transfers, it's all DMA or nothing on the SPI message level.

The pixels in the framebuffer are 16-bit rgb565 little endian on the Pi.
The LCD controllers are big endian. So the pixels are byteswapped and sendt in a 4k DMA buffer.
This buffer size can be change (txbuflen) but is rarely done.
So the whole of video memory is not sent at once, it's sent in chunks. And only parts that have changed (in PAGE_SIZE resolution).

Here's a test I did 2 years ago with different transmit buffer lengths. Numbers are full frame update time in ms.

                                                  Standard
Driver          vmem        1024        2048        4096        8192        16384        32768        65536        131072          Gain from standard
------------------------------------------------------------------------------------------------------------------------------------------------------
adafruit22fb    77440       56.11       51.18      48.08        46.84       46.14        45.62        45.79        45.79               2.29 5%
sainsmart18fb   40960       17.98       16.58      15.84        15.32       15.18        15.08        14.92          -                 0.92 6.1%

SPI controllers that support 16-bit width (BBB), will get the same buffer treatment.
These are some very Pi specific drivers, hence me trying to rewrite it bottom up.
No need to do byte swapping when native 16-bit is available.

how often do you run these transfers/second?

That depends. With a high fps= value and video playback, it runs almost continously.
There is a minimum pause of 1 jiffy between updates, but updates goes out as fast as buffer copy and SPI allows it.

are you running this already DMA mapped (so dma_tx set)?

yes.
https://github.com/raspberrypi/linux/blob/rpi-3.18.y/drivers/staging/fbtft/fbtft-io.c#L7

are you running the same transfer repeatedly or are you reusing the transfer and the frame buffer?

I reuse the DMA buffer and it's mapping, but not the SPI message.

Actually i see that the device actually supports 9 bit so we might also try to investigate that one - but only if you want to help with the fb driver side making things possible...

If you run with the stock spi-bcm2708, it uses 9-bit natively. So a kernel auquire with: sudo rpi-update, will give you support for the display using the 9-bit mode.
If you use my kernel with the DMA spi-bcm2708: sudo REPO_URI=https://github.com/notro/rpi-firmware rpi-update, then you get 9-bit emulation:
https://github.com/raspberrypi/linux/blob/rpi-3.18.y/drivers/staging/fbtft/fbtft-core.c#L1446
https://github.com/raspberrypi/linux/blob/rpi-3.18.y/drivers/staging/fbtft/fbtft-io.c#L43

I think we should just drop LoSSI. It's not generic 9-bit and there is this other SPI controller available on the header now which supports 2-32 bits (but lacks a driver).
9-bit displays are rare, and controllers that support 9-bit interface mode also has support for an 8-bit + DC pin mode.
Watterott had a display with the 9-bit configuration, but switched to 8-bit when I said that 9-bit would not be supported with DMA in FBTFT.
(I have done a rewrite of FBTFT that's more generic, and here 9-bit DMA is supported, but it's far from production quality.)
But back to 9-bit and the Pi, FBTFT emulates 9-bit by sending 9x8-bit as the smallest unit. This can be done because zero is a no-op, and the display line width is divisible by 8.
It uses double buffering to do this, but for resolutions below 240x320, this isn't a problem.

About DMA buffer size:
There was a case where someone had a PiTFT display and a usb audio card at the same time, and got distortions in audio when the display was updated every second (clock).
Changing txbuflen to 256 solved the audio problem.
USB also uses DMA, so there had to be competition for some resource.
Do you know what could have caused this behaviour?

Noralf.

@msperl
Copy link
Owner

msperl commented Mar 21, 2015

Thanks for all that info.
So I will try to run the setup of the display tomorrow.

I was actually assuming you would be doing a full frame-transfer, which would make you require more transfer more bytes in one go.
So this essentially means you run 10 spi transfers to update a 160x120 display at 2 bytes which amounts to 38400 bytes of payload.

At 30fps that would be 300 spi transfers/second and a corresponding amount of interrupts.
And as we have quite some irq overhead this means lots of cpu wasted.

As for usb latencies with audio: I guess it would again be related to interrupt latencies that block/delay usb transmits.
Dma itself should only be running in bursts on 16 bytes (because of Dreq triggering when the spi-tx irq would trigger) so it would not occupy the bus for long periods of time.

But note again that the irq handler has about 10us overhead (assuming cache-miss for the interrupt code) so with 300 spi transfers we would have 3ms spent only with irq overhead.
So if some of those coincide with usb interrupts then it would become interesting and could trigger clicks...

But imo the irq story is still "ugly", so I am still surprised that the time-critical usb-irqhandling has not been moved to the rtos that is essentially the firmware.

Why? Because the videocore has a vectored irq-table, which is something that the arm does not have - it just has 2 interrupts and has to figure out what was actually triggering the issue. Then it has to mask that source to do its work, and afterwards unmask it.

If I understood correctly they now think of splitting the usb interrupts into separate fast interrupt handler and configure on each core a different handler, which really makes things faster...

One final thing: the fb modules also work in the 4.0rcX, so that I can test it there - we need that for upstreaming!

So in the end I shall be applying my logic analyzer against the lines and see what shows up...

@notro
Copy link
Author

notro commented Mar 21, 2015

As for usb latencies with audio: I guess it would again be related to interrupt latencies that block/delay usb transmits.
Dma itself should only be running in bursts on 16 bytes (because of Dreq triggering when the spi-tx irq would trigger) so it would not occupy the bus for long periods of time.

This is strange, because lowering the transmit buffer size to 256 bytes solved the problem, but this means 4 times as many SPI interrupts as with a 4K buffer.
Doesn't look like an interrupt congestion problem?
I didn't know that DMA was run in bursts of 16 bytes (or is it 16 32-bit words?). I assume the same goes for USB?
Maybe this is where the 16 byte FIFO depth in the datasheet comes from.

Why? Because the videocore has a vectored irq-table, which is something that the arm does not have - it just has 2 interrupts and has to figure out what was actually triggering the issue. Then it has to mask that source to do its work, and afterwards unmask it.

Clearly this processor was never meant to be used in a general purpose computer :-)

So in the end I shall be applying my logic analyzer against the lines and see what shows up...

Nice to be able to get that ultimate proof, takes away a lot of guessing.

Here's some performance numbers: https://github.com/notro/fbtft/wiki/Performance
I must say that ~12MB/s over SPI is quite good.

@msperl
Copy link
Owner

msperl commented Mar 21, 2015

So I misread your last message.
The only thing I could think of is that there are only a few dma's that can be active at one time - but the manual does not read that way.
The other option would be is that the dma is using too much of the axi bus bandwidth limiting the time the cpu has to respond.
It would be interesting to replicate this and instrument the kernel to see where we waste time...

But let us first get to a situation where we have again something we can use - also with the rpi2...

I will try to figure out the dt for my display device and then I can try to start the dma for real.
Actually your use-case is fairly easy, but I still would leave out 9bit for now...

@msperl
Copy link
Owner

msperl commented Mar 22, 2015

BTW: with this test you have been mentioning that introduced "jitter" on USB audio - did that user also have the HDMI output running? I wonder because the way I understand it the HDMI output is driven via DMA as well (just by the firmware). So these multiple sources may produce congestion on the connection to SRAM, which would impact all code-paths that are not in L1/L2 hence the "latencies".
But again: that is something that I am guessing - reproducing is the only thing that may may really shed light on the situation.

Gathering more details would be helpfull for the future to run these kinds of tests...

@notro
Copy link
Author

notro commented Mar 22, 2015

I have heard of only two instances of this. See: raspberrypi/linux#888 (comment)
My comment further down has a forum link.

@msperl
Copy link
Owner

msperl commented Mar 22, 2015

OK - got it working with DT alone
here the things I needed to make the DT work on the 4.0rc3 kernel:

&gpio {
                brcm,pins = <13 12>;
                brcm,function = <1>; /* output*/
};
&spi {
        fb@1 {
                reg = <1>;
                status = "okay";
                compatible = "spi,fb_st7735r";
                pinctrl-names = "default";
                pinctrl-0 = <&fb0_pins>;
                spi-max-frequency = <8000000>;
                reset-gpios = <&gpio 12 1>;
                dc-gpios = <&gpio 13 0>;
                buswidth = <8>;
        };
};

Here also the screenshot of the "unpatched" driver:
tft-stock-kernel

This shows a full refresh including all the interrupt gaps and overheads.
So anything where D20-handle is high it means we are inside an interrupt.
You also see those chunks of 12 bytes and then a 3.5us delay.
This is obviously not "optimal".

@msperl
Copy link
Owner

msperl commented Mar 22, 2015

BTW: the "simple" optimization to write as many bytes as possible immediately results in 64 bytes getting written, then you get an interrupt
tft-stock-kernel write_as_much_as_possible

It still means we spend a lot of time in the IRQ handler, but it is an improvement already...

Continuing towards DMA...

@notro
Copy link
Author

notro commented Mar 22, 2015

With very small, couple of bytes transfers, would it be faster to use polling instead of a completion interrupt?

Some details about a full frame display update:

First set_addr_win is called.
This sets the window in chip GRAM that will get the update.
SPI transfers:
0x2A with DC=0
4 bytes with DC=1
0x2B with DC=0
4 bytes with DC=1
0x2C with DC=0

Then the framebuffer is sent in 4k chunks with DC=1

Continuing towards DMA...

can_dma won't work in mainline because bcm2835-dma lacks device_prep_slave_sg. bcm2708-dmaengine has this.
When we have a spi-bcm2835 that works with DMA, then I can patch bcm2835-dma, because I will have a known working client to test with.

I'm hoping to get together a PR for bcm2708-dmaengine next week to make it hand out DMA channels from Device Tree.
I hope to get bcm2835-mmc working with Device Tree as well.

@msperl
Copy link
Owner

msperl commented Mar 22, 2015

yes - for very short request (total transfer time <10us) it is beneficial to run the driver in polling mode - the interrupt overhead is in the order of 10us, so avoiding that helps speeding up responses...

@msperl
Copy link
Owner

msperl commented Mar 25, 2015

Note that I have got now also patches for polling and filling the FIFO prior to enabling interrupts cutting down on latencies.

I have also figured out that the spi_sync now is quite optimized and does no longer wake the spi-worker-thread in case that there is nothing in the queue - so it became more efficient than with 3.12 making those "spi_async" no longer an absolute necessity for high-performance drivers...
This also means the CPU-accounting has moved to the side of the driver-thread itself or the process that triggered the spi activity...

It still misses the consolidation of handling all transfers in interrupt with a wakeup ONLY for the end of message - not for the end of EACH spi_transfer inside a spi_message.

@msperl
Copy link
Owner

msperl commented Mar 25, 2015

one other thing: would it not make sense to run the Command and Data as separate SPI-devices, where the core would schedule an extra GPIO up/down?

@notro
Copy link
Author

notro commented Mar 25, 2015

No, almost all of the transfer time is for pixel data and the gpio stays high.
6 SPI messages for a full display update:
1 byte
4 bytes
1 byte
4 bytes
1 byte
2 bytes x resolution

@msperl
Copy link
Owner

msperl commented Mar 28, 2015

I have just received an USB Sound-card yesterday for testing the latency issues and I have to tell you that the card works with no jitter with the foundation kernel but lots of minimal jitter with upstream kernel which runs 8k interrupts/s on the USB side.

Note that there is no accounting for fast-interrupts, so I can not say if fiq are running 8000 times/second.

Tests were done using an identical ModelB+ with SPI drivers not installed - only mpg123 was running.

To me it looks as if this is related to interrupt-latencies, that become more pronounced with the upstream kernel.

Maybe one interpretation why this could impact us with big DMA-transfers negatively:
I see lots of differences between execution/reaction times for the first time some code is used and a subsequent execution of the same code (at least during SPI transfers) - this is most likley related to the code not sitting in the cache of the CPU, so the CPU has to fetch the code from SRAM (at slower speeds).
The longer the transfer times are with DMA the more likley the code will have been evicted from CPU-Cache and as a consequence the SPI interrupt will need to run to make things work and that may mean a delay to USB interrupt handling and that may mean jitter.

Also the USB interrupts are also evicting SPI interrupt code from cache and vice versa.

So the shorter the DMA the more likely the irq-codepath is still in cache and the shorter the response-time.

Now let us get to the DMA case and then we can look at it in detail how it affects the interrupt latencies with my instrumented kernel pulling GPIOs low/high.

@notro
Copy link
Author

notro commented Mar 28, 2015

The mainline usb driver doesn't use fiq's last time I checked.

@msperl
Copy link
Owner

msperl commented Mar 28, 2015

It is (at least on the RPI2 with 3.18.8-v7+):

root@rasp2:~# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
 16:          0          0          0          0   ARMCTRL  16  bcm2708_fb dma
 24:       1230          0          0          0   ARMCTRL  24  DMA IRQ
 25:      15394          0          0          0   ARMCTRL  25  DMA IRQ
 32:   14598357          0          0          0   ARMCTRL  32  dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb1
 49:          0          0          0          0   ARMCTRL  49  3f200000.gpio:bank0
 50:          0          0          0          0   ARMCTRL  50  3f200000.gpio:bank1
 65:         95          0          0          0   ARMCTRL  65  ARM Mailbox IRQ
 66:          6          0          0          0   ARMCTRL  66  VCHIQ doorbell
 75:          1          0          0          0   ARMCTRL  75
 79:          0          0          0          0   ARMCTRL  79  3f804000.i2c
 83:          7          0          0          0   ARMCTRL  83  uart-pl011
 84:     102299          0          0          0   ARMCTRL  84  mmc0
 99:     352608     618454     100002      61597   ARMCTRL  99  arch_timer
FIQ:              usb_fiq
IPI0:          0          0          0          0  CPU wakeup interrupts
IPI1:          0          0          0          0  Timer broadcast interrupts
IPI2:     256492      73903     307575     301522  Rescheduling interrupts
IPI3:         12         14         19         12  Function call interrupts
IPI4:         17          4         25        226  Single function call interrupts
IPI5:          0          0          0          0  CPU stop interrupts
IPI6:          0          0          0          0  IRQ work interrupts
IPI7:          0          0          0          0  completion interrupts
Err:          0

@msperl
Copy link
Owner

msperl commented Apr 12, 2015

Most other stuff got merged, so we can start address this - as soon as we reach 64 bytes this starts to make sense to use it...

@msperl
Copy link
Owner

msperl commented Apr 27, 2015

Ok, i have started now on the dma implementation and have found a few things that may be of interest:

A) my basic setup code is similar to what you have posted, but there is one thing that is strange: I only can get both the rx and tx channels on first load of the module a reload only gives me the first channel requested. Need to investigate why this happens, because a reboot is not really acceptable... @notro: have you seen something like this?
B) mapping the data requires about 6us for a single spi_transfer and another 2-3us for the unmapping, so there is some expensive overhead, that maybe should be hidden behind some spi-transfers that run in parallel. but that would potentially require a change in the spi-framework to make it possible... So this means that is_dma_mapped probably still gives an advantage.

@notro
Copy link
Author

notro commented Apr 27, 2015

I only can get both the rx and tx channels on first load of the module a reload only gives me the first channel requested.

I haven't done things like this except for the patch at the start of this issue, so no I haven't seen it.

So this means that is_dma_mapped probably still gives an advantage.

Yes, in the case where the client has already mapped it.
But spidev for instance, would jump up and down for joy if SPI core would map it's buffers :-)

Is it possible to support both cases?

can_dma() checks the flag and returns false if is_dma_mapped.
The driver checks the flag and does a DMA transfer if set.

@notro
Copy link
Author

notro commented Apr 27, 2015

If/when you have tested my dmaengine patch, it would be great if you replied with a Tested-by.
It's been 10 days since I sent it and no reply yet.

@msperl
Copy link
Owner

msperl commented Apr 28, 2015

Had been busy elsewhere - now I have hit an issue during registration.
As soon as I get something where I can do DMA I will sign off on it.

In principle it is possible to support both cases - for the is_dma_mapped you only have to create the corresponding scatter/gather list with that premapped entry...

As for spidev: the biggest thing here is that spidev still does a copy to/from userspace into a bounce-buffer of (by default) 4096 bytes. So it would be better served by adding is_dma_mapped to that code instead...

@msperl
Copy link
Owner

msperl commented Apr 28, 2015

seems as if that issue is resolved - now continuing to work on the DMA implementation itself.

@notro
Copy link
Author

notro commented May 12, 2015

I didn't look into the why.
There is one display that runs at 128MHz: https://github.com/notro/fbtft/wiki/Performance#mz61581-pi-ext
I couldn't get it to work yesterday, but I didn't spend any time to check that the display is still working with the fbtft kernel.
So I tried the rpi-display at 64MHz instead, which it could handle.

You can run at any speed with nothing connected to test the software side.

How do I get those statistics to dmesg?

$ echo "32" | sudo tee /sys/class/graphics/fb1/debug

Ref: https://github.com/notro/fbtft/wiki/Debug

@msperl
Copy link
Owner

msperl commented May 13, 2015

As for the time taken - I thought: first get the basics right (which lots of people will be using) before we continue the DMA portion.
On top the tiny patches hopefully gave Mark a means to come to know my coding and accept bigger patches easier after this "learning to know" phase...

@msperl
Copy link
Owner

msperl commented May 13, 2015

quick test:

48MHz:

[   66.307823] graphics fb1: fb_st7735r frame buffer, 128x160, 40 KiB video memory, 32 KiB DMA buffer memory, fps=20, spi32766.4 at 48 MHz
[  105.478838] fb_st7735r spi32766.4: Display update: 3684 kB/s (10.855 ms), fps=0 (0.000 ms)
[  117.199022] fb_st7735r spi32766.4: Display update: 3670 kB/s (10.896 ms), fps=0 (11720.039 ms)
[  117.275178] fb_st7735r spi32766.4: Display update: 2348 kB/s (17.028 ms), fps=14 (70.022 ms)
[  117.356701] fb_st7735r spi32766.4: Display update: 2156 kB/s (18.549 ms), fps=12 (80.001 ms)
[  117.435060] fb_st7735r spi32766.4: Display update: 2371 kB/s (16.865 ms), fps=12 (80.042 ms)
[  117.515214] fb_st7735r spi32766.4: Display update: 2338 kB/s (17.096 ms), fps=12 (79.923 ms)
[  117.595655] fb_st7735r spi32766.4: Display update: 2283 kB/s (17.519 ms), fps=12 (80.001 ms)

64MHz:

[   95.278095] graphics fb1: fb_st7735r frame buffer, 128x160, 40 KiB video memory, 32 KiB DMA buffer memory, fps=20, spi32766.4 at 64 MHz
[  281.408188] fb_st7735r spi32766.4: Display update: 4656 kB/s (8.589 ms), fps=0 (0.000 ms)
[  327.948564] fb_st7735r spi32766.4: Display update: 4660 kB/s (8.582 ms), fps=0 (46539.985 ms)
[  328.008728] fb_st7735r spi32766.4: Display update: 4588 kB/s (8.716 ms), fps=16 (60.028 ms)
[  328.088729] fb_st7735r spi32766.4: Display update: 4587 kB/s (8.717 ms), fps=12 (79.999 ms)
[  328.168718] fb_st7735r spi32766.4: Display update: 4594 kB/s (8.705 ms), fps=12 (79.999 ms)
[  328.248745] fb_st7735r spi32766.4: Display update: 4587 kB/s (8.717 ms), fps=12 (80.014 ms)

64MHz did not show any image on the display (but It might be too fast for that display)

in both cases I first ran a fbi to display an image.

So it seems to me as if the overhead for DMA becomes expensive!
So the best you can IMO do right now is:

  • create bigger dma-buffers
  • if they are above 64kb then split the message into multiple transfers of <65536 bytes.
    this avoids some of the round trips.
    If neccessary we still can look into how we could enable transfers >65535 bytes with dma - but that comes on a per need basis.

For now I want to remove that memory remapping overhead and report spi statistics via sysfs...

l1k added a commit to RevolutionPi/linux that referenced this issue May 11, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"1 = Do not perform source reads.
     In addition, destination writes will zero all the write strobes.
     This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Frank Pavlic <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue May 11, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"1 = Do not perform source reads.
     In addition, destination writes will zero all the write strobes.
     This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Frank Pavlic <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to RevolutionPi/linux that referenced this issue May 11, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"1 = Do not perform source reads.
     In addition, destination writes will zero all the write strobes.
     This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Frank Pavlic <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue May 13, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Frank Pavlic <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to RevolutionPi/linux that referenced this issue May 22, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Frank Pavlic <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Jun 10, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Frank Pavlic <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Jun 12, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to RevolutionPi/linux that referenced this issue Jun 16, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Jun 16, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Jun 27, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Jun 27, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Make use of the feature if a cyclic transaction copies from the zero
page. This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Jun 28, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Jun 28, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Jun 28, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Jul 3, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to RevolutionPi/linux that referenced this issue Jul 3, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Aug 3, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Aug 24, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Tested-by: Noralf Trønnes <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Acked-by: Vinod Koul <[email protected]>
Acked-by: Stefan Wahren <[email protected]>
Acked-by: Martin Sperl <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to RevolutionPi/linux that referenced this issue Aug 28, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Cc: Martin Sperl <[email protected]>
Cc: Noralf Trønnes <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to l1k/linux that referenced this issue Sep 10, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Tested-by: Noralf Trønnes <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Acked-by: Vinod Koul <[email protected]>
Acked-by: Stefan Wahren <[email protected]>
Acked-by: Martin Sperl <[email protected]>
Cc: Florian Kauer <[email protected]>
torvalds pushed a commit to torvalds/linux that referenced this issue Sep 16, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Tested-by: Noralf Trønnes <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Acked-by: Vinod Koul <[email protected]>
Acked-by: Stefan Wahren <[email protected]>
Acked-by: Martin Sperl <[email protected]>
Cc: Florian Kauer <[email protected]>
Link: https://lore.kernel.org/r/b2286c904408745192e4beb3de3c88f73e4a7210.1568187525.git.lukas@wunner.de
Signed-off-by: Mark Brown <[email protected]>
l1k added a commit to RevolutionPi/linux that referenced this issue Oct 3, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Tested-by: Noralf Trønnes <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Acked-by: Vinod Koul <[email protected]>
Acked-by: Stefan Wahren <[email protected]>
Acked-by: Martin Sperl <[email protected]>
Cc: Florian Kauer <[email protected]>
starnight pushed a commit to endlessm/linux that referenced this issue Nov 13, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Tested-by: Noralf Trønnes <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Acked-by: Vinod Koul <[email protected]>
Acked-by: Stefan Wahren <[email protected]>
Acked-by: Martin Sperl <[email protected]>
Cc: Florian Kauer <[email protected]>
Link: https://lore.kernel.org/r/b2286c904408745192e4beb3de3c88f73e4a7210.1568187525.git.lukas@wunner.de
Signed-off-by: Mark Brown <[email protected]>
l1k added a commit to RevolutionPi/linux that referenced this issue Dec 15, 2019
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Tested-by: Noralf Trønnes <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Acked-by: Vinod Koul <[email protected]>
Acked-by: Stefan Wahren <[email protected]>
Acked-by: Martin Sperl <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to RevolutionPi/linux that referenced this issue Apr 8, 2020
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Tested-by: Noralf Trønnes <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Acked-by: Vinod Koul <[email protected]>
Acked-by: Stefan Wahren <[email protected]>
Acked-by: Martin Sperl <[email protected]>
Cc: Florian Kauer <[email protected]>
l1k added a commit to RevolutionPi/linux that referenced this issue Apr 9, 2020
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:

"Do not perform source reads.
 In addition, destination writes will zero all the write strobes.
 This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf

The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.

Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.

A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
msperl/spi-bcm2835#13 (comment)

Tested-by: Nuno Sá <[email protected]>
Tested-by: Noralf Trønnes <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Acked-by: Vinod Koul <[email protected]>
Acked-by: Stefan Wahren <[email protected]>
Acked-by: Martin Sperl <[email protected]>
Cc: Florian Kauer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants