Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty 2-CH CAN FD HAT Rev2.1 #5083

Open
DavidBoJ opened this issue Jul 4, 2022 · 29 comments

Comments

@DavidBoJ
Copy link

DavidBoJ commented Jul 4, 2022

Describe the bug

In my application the /var/log/syslog is filled up with:
mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty
After a while the disk is full and the system can crash. Is there any way I can disable the logging from the CAN bus?

I have two slaves on my network to which it is difficult to establish a connection. With two other slaves it seems to work (But I didn't check the logs)
I have controlled the two slaves with Pican2 in the past but not with Bullseye
The physical network is 1.2m
Any suggestions @marckleinebudde https://github.com/marckleinebudde ?

Steps to reproduce the behaviour

Difficult to reproduce the exact same result every time.
But 2 slaves Nanotec motor drivers with node id 1 and 2 [CL4-E-2-12-5VDI]
An application that initializes these with an SDO
The configuration of the HAT is seen under the system description.

Device (s)

Raspberry Pi 4 Mod. B

System

2-CH CAN FD HAT Rev2.1

ip -d link show dev can0
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
link/can promiscuity 0 minmtu 0 maxmtu 0
can state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
bitrate 250000 sample-point 0.875
tq 25 prop-seg 69 phase-seg1 70 phase-seg2 20 sjw 1
mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
clock 40000000 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

uname -a
Linux cilix-19 5.15.32-v7l+1538 SMP Thu Mar 31 19:39:41 BST 2022 armv7l GNU/Linux
On a Raspberry Pi4

pi@cilix-19:~ $ cat /etc/rpi-issue
Raspberry Pi reference 2022-04-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 226b479f8d32919c9fe36dd5b4c20c02682f8180, stage2
pi@cilix-19:~ $ vcgencmd version
Mar 24 2022 13:19:26
Copyright (c) 2012 Broadcom
version e5a963efa66a1974127860b42e913d2374139ff5 (clean) (release) (start)

Logs

No response

Additional context

No response

@marckleinebudde
Copy link
Contributor

How often do you see this event?

mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty

The driver throws this message if the chip does't behave as the driver expects. It's unclear if this is a bug in the driver or in the chip. It doesn't happen that often (during my testing), the driver recovers and I haven't had time to debug this issue. Can you describe you use case, maybe hit sheds time light to the problem?

You can make the driver silent by changing the netdev_info() to a netdev_dbg:

--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-tef.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-tef.c
@@ -72,7 +72,7 @@ mcp251xfd_handle_tefif_recover(const struct mcp251xfd_priv *priv, const u32 seq)
                return -ENOBUFS;
        }
 
-       netdev_info(priv->ndev,
+       netdev_dbg(priv->ndev,
                    "Transmit Event FIFO buffer %s. (seq=0x%08x, tef_tail=0x%08x, tef_head=0x%08x, tx_head=0x%08x).\n",
                    tef_sta & MCP251XFD_REG_TEFSTA_TEFFIF ?
                    "full" : tef_sta & MCP251XFD_REG_TEFSTA_TEFNEIF ?

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 5, 2022

I have problems as soon I connect two slaves (CANopen devices not supporting FD) to my network with node id 1 and 2.
I think it works with only one slave. I also had two other slaves which worked reasonably stable (However I did not do any longtime tests).
In other words, a specially crafted network seems to cause the error. I do not exclude that one of the slaves is faulty.
My (CODESYS) application tries to initialize the slaves by sending/receiving a SDO and when that fails it tries again and again and it very rarely gets over the initialization. Maybe CODESYS does not access the driver properly or maybe somehow bypasses it?
It is the first time I use Bullseye and this 2-CH CAN FD HAT.
I have two identical CAN FD HATs, I have problems with both. If they are defects then a production batch error has caused it or the faulty network has caused damage to the chip or corrupted the driver.
Tomorrow, I will set up a python test between ch0 and ch1 and see if the driver still is valid and I will try to switch it on and off several times. By the way, I didn't follow the waveshare instruction to install the bcm2835 library since the bcm2835 library already is part of Bullseye.

@marckleinebudde
Copy link
Contributor

The bcm2835 library is not needed by the kernel driver for the mcp251xfd.

How often do you get event the mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty events?

My (CODESYS) application tries to initialize the slaves by sending/receiving a SDO and when that fails it tries again and again and it very rarely gets over the initialization.

What does exactly happen when the init fails? Is there a timeout? Goes the bus into bus off? Can you send me the output of candump -l any,0~0,#FFFFFFFF when the application fails?

Maybe CODESYS does not access the driver properly or maybe somehow bypasses it?

Do you know if CODESYS uses the regular can0 network interface?

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 6, 2022

The first thing I did was to simplify my CODESYS application so only SDO initialization takes place, and only PDO rx/tx is possible after the initialization. All code removed.
Next I stopped the application before the flash got full. And a closer look in /var/log/syslog gave the following:

Jul 6 13:29:13 cilix-19 kernel: [ 465.249474] Disabling IRQ #82
Jul 6 13:30:57 cilix-19 kernel: [ 569.069795] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:57 cilix-19 kernel: [ 569.070135] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 6 13:30:57 cilix-19 kernel: [ 569.179548] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:57 cilix-19 kernel: [ 569.289559] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:57 cilix-19 kernel: [ 569.399558] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:57 cilix-19 kernel: [ 569.509334] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:57 cilix-19 kernel: [ 569.619364] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:57 cilix-19 kernel: [ 569.729539] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:57 cilix-19 kernel: [ 569.839605] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:58 cilix-19 kernel: [ 569.949563] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:58 cilix-19 kernel: [ 570.059586] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:58 cilix-19 kernel: [ 570.169766] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:58 cilix-19 kernel: [ 570.279505] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:58 cilix-19 kernel: [ 570.389495] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:58 cilix-19 kernel: [ 570.499618] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:58 cilix-19 kernel: [ 570.609629] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:58 cilix-19 kernel: [ 570.719642] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:58 cilix-19 kernel: [ 570.829666] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:59 cilix-19 kernel: [ 570.939639] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:59 cilix-19 kernel: [ 571.049636] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:59 cilix-19 kernel: [ 571.159558] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:59 cilix-19 kernel: [ 571.269237] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:59 cilix-19 kernel: [ 571.379625] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:59 cilix-19 kernel: [ 571.489790] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:59 cilix-19 kernel: [ 571.599422] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:59 cilix-19 kernel: [ 571.719633] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:59 cilix-19 kernel: [ 571.829616] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:00 cilix-19 kernel: [ 571.939431] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:00 cilix-19 kernel: [ 572.049609] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:00 cilix-19 kernel: [ 572.159617] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:00 cilix-19 kernel: [ 572.269666] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:00 cilix-19 kernel: [ 572.379428] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:00 cilix-19 kernel: [ 572.489630] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:00 cilix-19 kernel: [ 572.599608] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:02 cilix-19 kernel: [ 573.929815] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:02 cilix-19 kernel: [ 574.039658] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:02 cilix-19 kernel: [ 574.149618] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:02 cilix-19 kernel: [ 574.259624] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:03 cilix-19 kernel: [ 575.479874] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:03 cilix-19 kernel: [ 575.589693] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:03 cilix-19 kernel: [ 575.699641] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:03 cilix-19 kernel: [ 575.809891] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:05 cilix-19 kernel: [ 577.238413] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 6 13:31:06 cilix-19 kernel: [ 578.139924] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:06 cilix-19 kernel: [ 578.249926] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:06 cilix-19 kernel: [ 578.468439] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.468677] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 6 13:31:06 cilix-19 kernel: [ 578.469061] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.469330] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.469608] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.469876] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.470142] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).

From here on it goes fast and the flash is filled up. It seems
Jul 6 13:30:57 cilix-19 kernel: [ 569.069795] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
creates a domino effect.

When I do a warm reset in CODESYS (it clears all variables and stops the application) I get
Message from syslogd@cilix-19 at Jul 6 13:21:54 ...
kernel:[ 25.987276] Disabling IRQ #82

Is that acceptable?
I have.
cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
26: 0 0 0 0 GICv2 29 Level arch_timer
27: 201651 123206 44138 7212 GICv2 30 Level arch_timer
30: 0 0 0 0 GICv2 107 Level fe004000.txp
31: 440 0 0 0 GICv2 65 Level fe00b880.mailbox
34: 6770 0 0 0 GICv2 153 Level uart-pl011
35: 0 0 0 0 GICv2 150 Level fe204000.spi
36: 0 0 0 0 GICv2 125 Level fe215080.spi
37: 0 0 0 0 GICv2 129 Level vc4 hvs
40: 342 0 0 0 GICv2 114 Level DMA IRQ
42: 8 0 0 0 GICv2 116 Level DMA IRQ
43: 0 0 0 0 GICv2 117 Level DMA IRQ
44: 0 0 0 0 GICv2 118 Level DMA IRQ
45: 0 0 0 0 GICv2 119 Level DMA IRQ
47: 0 0 0 0 GICv2 141 Level vc4 crtc
48: 0 0 0 0 GICv2 142 Level vc4 crtc, vc4 crtc
49: 0 0 0 0 GICv2 133 Level vc4 crtc
50: 0 0 0 0 GICv2 138 Level vc4 crtc
51: 0 0 0 0 interrupt-controller@7ef00100 0 Edge vc4 hdmi cec tx
52: 0 0 0 0 interrupt-controller@7ef00100 1 Edge vc4 hdmi cec rx
55: 0 0 0 0 interrupt-controller@7ef00100 4 Edge vc4 hdmi hpd connected
56: 0 0 0 0 interrupt-controller@7ef00100 5 Edge vc4 hdmi hpd disconnected
57: 0 0 0 0 interrupt-controller@7ef00100 8 Edge vc4 hdmi cec tx
58: 0 0 0 0 interrupt-controller@7ef00100 7 Edge vc4 hdmi cec rx
61: 0 0 0 0 interrupt-controller@7ef00100 10 Edge vc4 hdmi hpd connected
62: 0 0 0 0 interrupt-controller@7ef00100 11 Edge vc4 hdmi hpd disconnected
63: 73 0 0 0 GICv2 66 Level VCHIQ doorbell
64: 11201 0 0 0 GICv2 158 Level mmc1, mmc0
65: 0 0 0 0 GICv2 48 Level arm-pmu
66: 0 0 0 0 GICv2 49 Level arm-pmu
67: 0 0 0 0 GICv2 50 Level arm-pmu
68: 0 0 0 0 GICv2 51 Level arm-pmu
71: 843 0 0 0 GICv2 189 Level eth0
72: 31 0 0 0 GICv2 190 Level eth0
78: 0 0 0 0 GICv2 106 Level v3d
79: 0 0 0 0 GICv2 175 Level PCIe PME
80: 38 0 0 0 BRCM STB PCIe MSI 524288 Edge xhci_hcd
82: 100001 0 0 0 pinctrl-bcm2835 25 Level spi0.0
IPI0: 0 0 0 0 CPU wakeup interrupts
IPI1: 0 0 0 0 Timer broadcast interrupts
IPI2: 174 158 197 164 Rescheduling interrupts
IPI3: 3947 122297 217123 215392 Function call interrupts
IPI4: 0 0 0 0 CPU stop interrupts
IPI5: 726 135 186 132 IRQ work interrupts
IPI6: 0 0 0 0 completion interrupts
Err: 0

You requested the result of "candump -l any,0~0,#FFFFFFFF" here it is:
less candump-2022-07-06_163011.log
(1657121412.229819) can0 20000004#0008000000007F00
(1657121414.649739) can0 20000004#0040000000005F00
(1657121417.292012) can0 20000004#0001000000000000
(1657121417.401347) can0 20000004#0001000000000000
(1657121418.722186) can0 20000004#0001000000000000
(1657121418.831834) can0 20000004#0001000000000000
(1657121418.941893) can0 20000004#0001000000000000
(1657121419.051251) can0 20000004#0001000000000000
(1657121419.161551) can0 20000004#0001000000000000
(1657121419.271232) can0 20000004#0001000000000000
(1657121419.381527) can0 20000004#0001000000000000
(1657121419.491764) can0 20000004#0001000000000000
(1657121419.601958) can0 20000004#0001000000000000
(1657121419.711429) can0 20000004#0001000000000000

I still think that a special crafted network creates the fault, and I do not exclude my Nanotec motor driver to be faulty.

@marckleinebudde
Copy link
Contributor

marckleinebudde commented Jul 7, 2022

Jul 6 13:29:13 cilix-19 kernel: [ 465.249474] Disabling IRQ #82
Jul 6 13:30:57 cilix-19 kernel: [ 569.069795] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:57 cilix-19 kernel: [ 569.070135] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 6 13:30:57 cilix-19 kernel: [ 569.179548] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.

[...]

Jul 6 13:31:03 cilix-19 kernel: [ 575.809891] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:05 cilix-19 kernel: [ 577.238413] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 6 13:31:06 cilix-19 kernel: [ 578.139924] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:06 cilix-19 kernel: [ 578.249926] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:06 cilix-19 kernel: [ 578.468439] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.468677] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 6 13:31:06 cilix-19 kernel: [ 578.469061] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.469330] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.469608] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.469876] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.470142] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).

From here on it goes fast and the flash is filled up. It seems
Jul 6 13:30:57 cilix-19 kernel: [ 569.069795] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
creates a domino effect.

Yes, that's at least the first message from the driver itself. But the message directly before this is more important:

Jul 6 13:29:13 cilix-19 kernel: [ 465.249474] Disabling IRQ #82

From /proc/interrupts we see that IRQ 82 is...

82: 100001 0 0 0 pinctrl-bcm2835 25 Level spi0.0

...the interrupt line between the MCP2518FD chip and the raspi. That's not good.

  • Are there more error messages directly before the Disabling IRQ #82?
  • Which hardware are you exactly using?
  • What's you config.txt entry?
  • In you use case, if can1 configured and up and running, too?

When I do a warm reset in CODESYS (it clears all variables and stops the application) I get
Message from syslogd@cilix-19 at Jul 6 13:21:54 ...

kernel:[ 25.987276] Disabling IRQ #82

Is that acceptable?

No, see above.

I have.

cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
26: 0 0 0 0 GICv2 29 Level arch_timer
27: 201651 123206 44138 7212 GICv2 30 Level arch_timer
30: 0 0 0 0 GICv2 107 Level fe004000.txp
31: 440 0 0 0 GICv2 65 Level fe00b880.mailbox
34: 6770 0 0 0 GICv2 153 Level uart-pl011
35: 0 0 0 0 GICv2 150 Level fe204000.spi
36: 0 0 0 0 GICv2 125 Level fe215080.spi
37: 0 0 0 0 GICv2 129 Level vc4 hvs
40: 342 0 0 0 GICv2 114 Level DMA IRQ
42: 8 0 0 0 GICv2 116 Level DMA IRQ
43: 0 0 0 0 GICv2 117 Level DMA IRQ
44: 0 0 0 0 GICv2 118 Level DMA IRQ
45: 0 0 0 0 GICv2 119 Level DMA IRQ
47: 0 0 0 0 GICv2 141 Level vc4 crtc
48: 0 0 0 0 GICv2 142 Level vc4 crtc, vc4 crtc
49: 0 0 0 0 GICv2 133 Level vc4 crtc
50: 0 0 0 0 GICv2 138 Level vc4 crtc
51: 0 0 0 0 interrupt-controller@7ef00100 0 Edge vc4 hdmi cec tx
52: 0 0 0 0 interrupt-controller@7ef00100 1 Edge vc4 hdmi cec rx
55: 0 0 0 0 interrupt-controller@7ef00100 4 Edge vc4 hdmi hpd connected
56: 0 0 0 0 interrupt-controller@7ef00100 5 Edge vc4 hdmi hpd disconnected
57: 0 0 0 0 interrupt-controller@7ef00100 8 Edge vc4 hdmi cec tx
58: 0 0 0 0 interrupt-controller@7ef00100 7 Edge vc4 hdmi cec rx
61: 0 0 0 0 interrupt-controller@7ef00100 10 Edge vc4 hdmi hpd connected
62: 0 0 0 0 interrupt-controller@7ef00100 11 Edge vc4 hdmi hpd disconnected
63: 73 0 0 0 GICv2 66 Level VCHIQ doorbell
64: 11201 0 0 0 GICv2 158 Level mmc1, mmc0
65: 0 0 0 0 GICv2 48 Level arm-pmu
66: 0 0 0 0 GICv2 49 Level arm-pmu
67: 0 0 0 0 GICv2 50 Level arm-pmu
68: 0 0 0 0 GICv2 51 Level arm-pmu
71: 843 0 0 0 GICv2 189 Level eth0
72: 31 0 0 0 GICv2 190 Level eth0
78: 0 0 0 0 GICv2 106 Level v3d
79: 0 0 0 0 GICv2 175 Level PCIe PME
80: 38 0 0 0 BRCM STB PCIe MSI 524288 Edge xhci_hcd
82: 100001 0 0 0 pinctrl-bcm2835 25 Level spi0.0
IPI0: 0 0 0 0 CPU wakeup interrupts
IPI1: 0 0 0 0 Timer broadcast interrupts
IPI2: 174 158 197 164 Rescheduling interrupts
IPI3: 3947 122297 217123 215392 Function call interrupts
IPI4: 0 0 0 0 CPU stop interrupts
IPI5: 726 135 186 132 IRQ work interrupts
IPI6: 0 0 0 0 completion interrupts
Err: 0

You requested the result of "candump -l any,0~0,#FFFFFFFF" here it is:
less candump-2022-07-06_163011.log

(1657121412.229819) can0 20000004#0008000000007F00
(1657121414.649739) can0 20000004#0040000000005F00
(1657121417.292012) can0 20000004#0001000000000000
(1657121417.401347) can0 20000004#0001000000000000
(1657121418.722186) can0 20000004#0001000000000000
(1657121418.831834) can0 20000004#0001000000000000
(1657121418.941893) can0 20000004#0001000000000000
(1657121419.051251) can0 20000004#0001000000000000
(1657121419.161551) can0 20000004#0001000000000000
(1657121419.271232) can0 20000004#0001000000000000
(1657121419.381527) can0 20000004#0001000000000000
(1657121419.491764) can0 20000004#0001000000000000
(1657121419.601958) can0 20000004#0001000000000000
(1657121419.711429) can0 20000004#0001000000000000

Ok, there are some error messages from the controller, but I forgot to give you the command line to let candump decode the error message, sorry. Try this one instead:

`candump any,0~0,#FFFFFFFF -exdtA

I still think that a special crafted network creates the fault, and I do not exclude my Nanotec motor driver to be faulty.

There are some CRC write errors in the log:

Jul 6 13:31:06 cilix-19 kernel: [ 578.468677] mcp251xfd spi0.0 can0: CRC write command format error.

That means the SPI message form the raspi to the mcp2518fd got corrupted somehow. Is it possible that your motor driver creates EMI and destroys the SPI message? Are you using a shared power supply for the raspi and the motors?

Marc

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 7, 2022

My system is very simple given in my first post. I have only one HAT the CAN bus controller and I am only using can0
I have updated my first post so firmware version can be seen. I am aware about the EMI problems, and the power supply to Pi is not the same as the one to the motor drivers, and the motors are of course not yet energized.
I swear that the following worked before I installed codesys and did the CANopen initializing.
pi@cilix-19:~ $ sudo ip link set can1 up type can bitrate 250000
Cannot find device "can1"

Something must have corrupted the CANbus driver, I see the following options.

  1. My image has not been stable from the start.(Is there a way to verify the image or installed driver?)
  2. My codesys application before I simplified it, corrupted the image.
  3. CODESYS has caused it
  4. My can bus network has generated faulty signals which somehow caused it
  5. The heavy logging, filling up the flash until the system crashes combined with the above could cause it
  6. The hardware CANbus chip is faulty
    Does my config.txt look alright? I have added the start of the syslog, so you can see what happens before the disabling.
    config.txt
    syslog.txt

candump any,0~0,#FFFFFFFF -exdtA
gave the following result:
(1657180670.772316) can0 20000004#0008000000007F00 R
(1657180671.433302) can0 20000004#0040000000005B00 R
(1657180671.881853) can0 20000004#0020000000008300 R
(1657180672.433488) can0 20000004#0008000000007A00 R
(1657180672.873334) can0 20000004#0040000000005900 R
(1657180675.634347) can0 20000004#0001000000000000 R
(1657180675.744045) can0 20000004#0001000000000000 R
(1657180675.853950) can0 20000004#0001000000000000 R
(1657180675.963906) can0 20000004#0001000000000000 R
(1657180676.074172) can0 20000004#0001000000000000 R
(1657180676.183875) can0 20000004#0001000000000000 R
(1657180676.294112) can0 20000004#0001000000000000 R
(1657180676.403747) can0 20000004#0001000000000000 R
(1657180676.513908) can0 20000004#0001000000000000 R
(1657180676.623605) can0 20000004#0001000000000000 R
(1657180676.733859) can0 20000004#0001000000000000 R
(1657180676.843880) can0 20000004#0001000000000000 R
(1657180676.954108) can0 20000004#0001000000000000 R
(1657180677.064234) can0 20000004#0001000000000000 R

@marckleinebudde
Copy link
Contributor

Can you try to disable the CODESYS altogether and/or flash a new µSD card with a fresh system.

Jul  7 08:44:57 cilix-19 kernel: [   20.976582] can: controller area network core
Jul  7 08:44:57 cilix-19 kernel: [   20.976667] NET: Registered PF_CAN protocol family
Jul  7 08:44:57 cilix-19 kernel: [   20.986872] can: raw protocol
Jul  7 08:44:58 cilix-19 kernel: [   22.046365] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
Jul  7 08:44:59 cilix-19 kernel: [   22.261509] mcp251xfd spi0.0 can0: CRC read error at address 0x001c (length=4, data=00 00 00 00, CRC=0x0000) retrying.
Jul  7 08:44:59 cilix-19 kernel: [   22.261608] mcp251xfd spi0.0 can0: CRC write command format error.
Jul  7 08:44:59 cilix-19 kernel: [   22.361504] mcp251xfd spi0.0 can0: CRC read error at address 0x001c (length=4, data=00 00 00 00, CRC=0x0000) retrying.
Jul  7 08:44:59 cilix-19 kernel: [   22.361598] mcp251xfd spi0.0 can0: CRC write command format error.

From this log we see that the SPI controller doesn't read anything from the mcp2518fd controller, as the data and crc is 00. Please make sure that no other component touches the chip select and the MISO/MOSI pins.

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 7, 2022

pi@cilix-19:~ $ uname -a
Linux cilix-19 5.15.32-v7l+ #1538 SMP Thu Mar 31 19:39:41 BST 2022 armv7l GNU/Linux

pi@cilix-19:~ $ cat /etc/rpi-issue
Raspberry Pi reference 2022-04-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 226b479f8d32919c9fe36dd5b4c20c02682f8180, stage2

pi@cilix-19:~ $ vcgencmd version
Mar 24 2022 13:19:26
Copyright (c) 2012 Broadcom
version e5a963efa66a1974127860b42e913d2374139ff5 (clean) (release) (start)

I did apt-get upgarde and sudo apt-get --with-new-pkgs upgrade and apt-get install can-utils
Here what I get in syslog:

Jul 7 12:29:49 cilix-19 systemd[1]: Starting Permit User Sessions...
Jul 7 12:29:49 cilix-19 systemd[1]: Finished Save/Restore Sound Card State.
Jul 7 12:29:49 cilix-19 kernel: [ 7.795561] spi_master spi0: will run message pump with realtime priority
Jul 7 12:29:49 cilix-19 systemd[1]: Started /etc/rc.local Compatibility.
Jul 7 12:29:49 cilix-19 systemd[1]: Finished Permit User Sessions.
Jul 7 12:29:49 cilix-19 kernel: [ 7.817746] mcp251xfd spi0.0 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized.
Jul 7 12:29:49 cilix-19 kernel: [ 7.818144] spi_master spi1: will run message pump with realtime priority
Jul 7 12:29:49 cilix-19 systemd[1]: Started User Login Management.
Jul 7 12:29:49 cilix-19 systemd[1]: Condition check resulted in Manage Sound Card State (restore and store) being skipped.
Jul 7 12:29:49 cilix-19 systemd[1]: Reached target Sound Card.
Jul 7 12:29:49 cilix-19 systemd[1]: Started Getty on tty1.
Jul 7 12:29:49 cilix-19 systemd[1]: Reached target Login Prompts.
Jul 7 12:29:49 cilix-19 systemd[1]: Starting Load/Save RF Kill Switch Status...
Jul 7 12:29:49 cilix-19 kernel: [ 7.855262] mcp251xfd spi1.0 (unnamed net_device) (uninitialized): Failed to detect MCP251xFD (osc=0x00000000).
Jul 7 12:29:49 cilix-19 kernel: [ 7.861660] brcmfmac: brcmf_cfg80211_set_power_mgmt: power save enabled

Clearly my HAT is faulty especially for can1. That is very impressive since I never have used can1, nothing has been connected to it. I have no application using can1. Am I wrong or do you see something?

I had another device (always buy 2 when you need one) and from dmesg:
[ 6.538390] Registered IR keymap rc-cec
[ 6.561491] CAN device driver interface
[ 6.575168] spi_master spi0: will run message pump with realtime priority
[ 6.604233] rc rc0: vc4 as /devices/platform/soc/fef00700.hdmi/rc/rc0
[ 6.604836] input: vc4 as /devices/platform/soc/fef00700.hdmi/rc/rc0/input1
[ 6.645419] mcp251xfd spi0.0 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized.
[ 6.646480] spi_master spi1: will run message pump with realtime priority
[ 6.795480] vc4-drm gpu: bound fe400000.hvs (ops vc4_hvs_ops [vc4])
[ 6.798587] Registered IR keymap rc-cec
[ 6.838292] rc rc0: vc4 as /devices/platform/soc/fef00700.hdmi/rc/rc0
[ 6.895713] input: vc4 as /devices/platform/soc/fef00700.hdmi/rc/rc0/input2
[ 6.945243] mcp251xfd spi1.0 can1: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized.

So it seems that the last is working but not the first. However I have one more test to do, because I am not fully convinced.

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 7, 2022

I waited a while then I changed controller back to the first and now that one is also working as seen from dmesg

[ 6.714808] CAN device driver interface
[ 6.754550] spi_master spi0: will run message pump with realtime priority
[ 6.812408] random: crng init done
[ 6.812430] random: 7 urandom warning(s) missed due to ratelimiting
[ 6.881806] mcp251xfd spi0.0 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CR C_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) succ essfully initialized.
[ 6.893529] spi_master spi1: will run message pump with realtime priority
[ 6.924229] mcp251xfd spi1.0 can1: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CR C_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) succ essfully initialized.

What we have is this:

  1. something happens which makes the CAN controller faulty
  2. The faulty state is remember so no reboot/power off/on change that state and can1 is not registered
  3. Replace CAN controller with a new one
  4. The new system works with the new CANbus controller
  5. Switch back to the first CANbus controller
  6. The system works again with the first CANbus controller

Can you explain that?

@marckleinebudde
Copy link
Contributor

marckleinebudde commented Jul 7, 2022

Do you use the same SD card? First let's get the both CAN interfaces detected properly, then do some tests between can0 and can1, finally add CODESYS.

To test between can0 and can1, connect both, make sure the bus is terminated, then:

canfdtest -v can0

and on another terminal:

canfdtest -vg can1

That should run without problems. Use Ctrl+c to abort after 1 hour or so.

Another test would be:

cansequence -rv can1

On another terminal (Edit: fixed interface name):

cangen can0 -Di -L1 -I2 -p10 -g 1

The -g parameter specifies the gap between CAN frames. You can decrease the number (i.e. to -g 0.1 or even -g 0) to increase the load. If you restart the cangen process, the receiver cansequence will print a single error message, that's OK. But there should be no other errors.

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 7, 2022

I created as described a new image on a new SD card.
The first CANbus controller gave the disabling interrupt message so I concluded it indeed is faulty. Next I started the tests with the good controller. The first test with canfdtest worked fine. The next test I changed your command to (since I suppose I am not going to use a virtual network):

pi@cilix-19:~ $ pi@cilix-19:~ $ cangen can0 -Di -L1 -I2 -p10 -g 1

I get without errors not on the receiving interface either
sequence wrap around ..
I suppose everything is alright. Next, I want to see if my troublesome network kills the controller, and before I install CODESYS I just want to connect my network. My motor driver will only send BOOT messages I suppose. I have no application handling the messages, so I assume the controllers message buffer will be full, so maybe I will get some error messages maybe also CRC error messages, but nothing should be destroyed. I fear the problem is caused when I switch on the motor drivers so I will do that multiple times.

@marckleinebudde
Copy link
Contributor

I created as described a new image on a new SD card.
The first CANbus controller gave the disabling interrupt message so I concluded it indeed is faulty.

...or the config.txt is not correct. I think we'll soon have a proper waveshare overlay file for the rev2.1 boards, too.

Next I started the tests with the good controller. The first test with canfdtest worked fine. The next test I changed your command to (since I suppose I am not going to use a virtual network):

Doh! Right, I've fixed that.

pi@cilix-19:~ $ pi@cilix-19:~ $ cangen can0 -Di -L1 -I2 -p10 -g 1

I get without errors not on the receiving interface either
sequence wrap around ..

Fine - sequence wrap around comes every 256 rx'ed CAN messages.

I suppose everything is alright. Next, I want to see if my troublesome network kills the controller, and before I install CODESYS I just want to connect my network. My motor driver will only send BOOT messages I suppose. I have no application handling the messages, so I assume the controllers message buffer will be full, so maybe I will get some error messages maybe also CRC error messages,

You should not get any CRC error messages from the driver in the kernel log. Maybe when you connect the CAN bus....

but nothing should be destroyed. I fear the problem is caused when I switch on the motor drivers so I will do that multiple times.

Ok - On my desk I've a setup that doesn't like when I plug one of my CAN-USB adapters to the USB port, results in CRC errors in the SPI communication.

Anyhow - If the driver in your setup goes reproducible into the Transmit Event FIFO buffer not empty loop we can think of a workaround. For proper debugging we need 2 CAN interfaces on the same bus, the 2nd one is for sniffing the bus.

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 7, 2022

I switch my motor driver on/off several times and my network doesn't seem to create any errors. I do not get "Transmit Event ..."
I didn't get any CRC error either
You can see my config.txt in one of the previous post. I will move on with CODESYS but without any application.
Could I use can1 for sniffing? I have no other at the moment. I will order a proper USB can bus interface tomorrow.

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 7, 2022

I know we are using bcm2835 so I am concerned with these warnings seen with dmesg but I do not know all these modules are they in anyway related to CAN ?

[ 4.663823] snd_bcm2835: module is from the staging directory, the quality is unknown, you have been warned.
[ 4.677381] videodev: Linux video capture interface: v2.00
[ 4.684988] bcm2835_vc_sm_cma_probe: Videocore shared memory driver
[ 4.685020] [vc_sm_connected_init]: start
[ 4.687470] [vc_sm_connected_init]: installed successfully
[ 4.693615] bcm2835_audio bcm2835_audio: card created with 8 channels
[ 4.743658] bcm2835_mmal_vchiq: module is from the staging directory, the quality is unknown, you have been warned.
[ 4.753904] bcm2835_mmal_vchiq: module is from the staging directory, the quality is unknown, you have been warned.
[ 4.755651] bcm2835_codec: module is from the staging directory, the quality is unknown, you have been warned.
[ 4.763710] bcm2835_mmal_vchiq: module is from the staging directory, the quality is unknown, you have been warned.
[ 4.765499] bcm2835_isp: module is from the staging directory, the quality is unknown, you have been warned.
[ 4.774756] bcm2835-codec bcm2835-codec: Device registered as /dev/video10
[ 4.774808] bcm2835-codec bcm2835-codec: Loaded V4L2 decode
[ 4.788810] bcm2835-codec bcm2835-codec: Device registered as /dev/video11
[ 4.788861] bcm2835-codec bcm2835-codec: Loaded V4L2 encode
[ 4.799079] bcm2835_v4l2: module is from the staging directory, the quality is unknown, you have been warned.

@marckleinebudde
Copy link
Contributor

All unrelated to CAN. Should be no problem.

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 7, 2022

I have now installed my very simple application no code, no GUI
In dmesg I have

[ 18.650204] can: controller area network core
[ 18.650280] NET: Registered PF_CAN protocol family
[ 18.658379] can: raw protocol
[ 19.131987] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[ 19.839699] mcp251xfd spi0.0 can0: CRC write command format error.
[ 31.832449] cam-dummy-reg: disabling

I have no CANOpen SYNC enabled

The Pi which is master has real problem receiving a BOOTUP message from the slaves but eventually
node 2 got in OPERATIONAL mode. For Node 1 the master gets timeout for the initialization with the SDO's it writes to the node.
The two nodes are identical except for their id (same SDOs and PDO)

candump any,0~0,#FFFFFFFF -exdtA
(2022-07-07 18:38:59.790332) can0 RX - - 20000004 [8] 00 08 00 00 00 00 6 0 00 ERRORFRAME
controller-problem{tx-error-warning}
error-counter-tx-rx{{96}{0}}
(2022-07-07 18:38:59.790340) can0 RX - - 20000004 [8] 00 20 00 00 00 00 8 0 00 ERRORFRAME
controller-problem{tx-error-passive}
error-counter-tx-rx{{128}{0}}
(2022-07-07 18:38:59.790344) can0 RX - - 20000004 [8] 00 08 00 00 00 00 7 F 00 ERRORFRAME
controller-problem{tx-error-warning}
error-counter-tx-rx{{127}{0}}

However after long time trying to boot node 1 I get (I didn't have any candump running unfortunately)
Jul 7 18:23:33 cilix-19 bthelper[837]: Changing power off succeeded
Jul 7 18:23:33 cilix-19 bthelper[648]: Changing power on succeeded
Jul 7 18:23:33 cilix-19 kernel: [ 19.839699] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 7 18:23:50 cilix-19 systemd-timesyncd[735]: Initial synchronization to time server 152.115.59.245:123 (0.debian.pool.ntp.org).
Jul 7 18:23:53 cilix-19 dhcpcd[737]: eth0: no IPv6 Routers available
Jul 7 18:23:59 cilix-19 kernel: [ 31.832449] cam-dummy-reg: disabling
Jul 7 18:24:03 cilix-19 systemd[1]: systemd-fsckd.service: Succeeded.
Jul 7 18:24:12 cilix-19 systemd[1]: systemd-hostnamed.service: Succeeded.
Jul 7 18:25:39 cilix-19 systemd[1]: Started Session 3 of user pi.
Jul 7 18:29:12 cilix-19 kernel: [ 344.405975] IPv6: ADDRCONF(NETDEV_CHANGE): can1: link becomes ready
Jul 7 18:31:52 cilix-19 systemd[1]: Started Session 4 of user pi.
Jul 7 18:35:27 cilix-19 kernel: [ 719.634487] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 7 18:35:27 cilix-19 kernel: [ 719.635126] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005).
Jul 7 18:35:27 cilix-19 kernel: [ 719.635218] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005).
Jul 7 18:35:27 cilix-19 kernel: [ 719.635307] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005).
Jul 7 18:35:27 cilix-19 kernel: [ 719.635396] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005).
Jul 7 18:35:27 cilix-19 kernel: [ 719.635485] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005).
Jul 7 18:35:27 cilix-19 kernel: [ 719.635574] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005).
Jul 7 18:35:27 cilix-19 kernel: [ 719.635736] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005).
Jul 7 18:35:27 cilix-19 kernel: [ 719.635825] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005).

And so on. However my controller didn't seem to be faulty. I stopped codesys deleted the logs and I can still start a can1 network.
It seems that codesys attempt to initialize a CAN node again and again brings it into state which eventually generates Transmit Event FIFO" so many that Pi stops functioning normally and maybe even destroys the controller in some cases?
That fact can make it difficult to debug it would be good if the logging somehow could be limited
Tomorrow I will do your test with cangen to be sure it still is working.

@marckleinebudde
Copy link
Contributor

Does this happen during boot?

[ 18.650204] can: controller area network core
[ 18.650280] NET: Registered PF_CAN protocol family
[ 18.658379] can: raw protocol
[ 19.131987] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[ 19.839699] mcp251xfd spi0.0 can0: CRC write command format error.
[ 31.832449] cam-dummy-reg: disabling

Can you try this config.txt?

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 7, 2022

Yes, but remember I now have CODESYS installed and my application which sets up the can0 automatically at boot. I do not know if that has an impact on what dmesg shows. Interesting config.file I will try that tomorrow

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 8, 2022

My controller is still working testing it with cangen.
However I have another system a waveshare carrier board with cm4 and isolated CAN 2.0B
https://www.waveshare.com/wiki/Template:Compute_Module_4_PoE_4G_Module_Spec
Here a snippet of /boot/config.txt:
dtparam=i2c_arm=on
#dtparam=i2s=on
dtparam=spi=on
#CAN bus settings
dtoverlay=mcp2515-can0,oscillator=16000000,interrupt=25
dtoverlay=spi-bcm2835-overlay

It works right out of the box. But it is not Bullseye.
pi@raspberrypi:/var/log$ uname -a
Linux raspberrypi 5.4.51-v7l+ #1327 SMP Thu Jul 23 11:04:39 BST 2020 armv7l GNU/Linux

What do you think? Is mcp2515 more tolerant/robust than mcp251xfd?
Is it Bullseye with its new socketCAN which causes the problem?

@marckleinebudde
Copy link
Contributor

Please post your complete config.txt.

It works right out of the box. But it is not Bullseye.

Please post the error message.

What do you think? Is mcp2515 more tolerant/robust than mcp251xfd?

No

Is it Bullseye with its new socketCAN which causes the problem?

No

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 8, 2022

Here it is I renamed the file so we don't get confused with the mcp251xfd.
config_cm4.txt
What error messages? There is no errors in the syslog related to mcp2515 or can0 and codesys with my simplified application works.

@marckleinebudde
Copy link
Contributor

Okay - now try my config.txt from #5083 (comment)

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 8, 2022

I want to point out that the carrier board has good lightning-proof, and ESD protection. I think better than the HAT CAN FD controller and I am still concerned about a faulty signals from the motor driver.

In syslog I have:

Jul 8 11:17:36 cilix-19 kernel: [ 663.705295] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 8 11:17:36 cilix-19 kernel: [ 663.803021] mcp251xfd spi0.0 can0: CRC read error at address 0x0744 (length=28, data=00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 77 e7 e4 4e 00 00 00 00 00 00 00 00, CRC=0x0000) retrying.
Jul 8 11:17:36 cilix-19 kernel: [ 663.805184] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 8 11:17:36 cilix-19 kernel: [ 664.003097] mcp251xfd spi0.0 can0: CRC read error at address 0x0698 (length=200, data=81 00 00 00 08 00 00 00 ef 9a 53 4f 00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 84 f1 53 4f 00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 4f 9f 54 4f 00 00 00 08 00 00 00 00 81 00 00 00, CRC=0x0000) retrying.
Jul 8 11:17:36 cilix-19 kernel: [ 664.005392] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 8 11:17:36 cilix-19 kernel: [ 664.104171] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 8 11:17:37 cilix-19 kernel: [ 664.303960] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 8 11:17:37 cilix-19 kernel: [ 664.367022] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 8 11:17:37 cilix-19 kernel: [ 664.403547] mcp251xfd spi0.0 can0: CRC read error at address 0x04e0 (length=252, data=81 00 00 00 08 00 00 00 a0 71 52 50 00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 35 c8 52 50 00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 00 76 53 50 00 00 00 08 00 00 00 00 81 00 00 00, CRC=0x0000) retrying.
Jul 8 11:17:37 cilix-19 kernel: [ 664.404910] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 8 11:17:37 cilix-19 kernel: [ 664.549939] mcp251xfd spi0.0 can0: RX-0: FIFO overflow
...
...
Jul 8 11:20:06 cilix-19 kernel: [ 813.814254] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d).
Jul 8 11:20:06 cilix-19 kernel: [ 813.814365] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d).
Jul 8 11:20:06 cilix-19 kernel: [ 813.814477] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d).
Jul 8 11:20:06 cilix-19 kernel: [ 813.814588] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d).
Jul 8 11:20:06 cilix-19 kernel: [ 813.814700] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d).
Jul 8 11:20:06 cilix-19 kernel: [ 813.814811] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d).
Jul 8 11:20:06 cilix-19 kernel: [ 813.814923] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d).
Jul 8 11:20:06 cilix-19 kernel: [ 813.815034] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d).

and so on

@marckleinebudde
Copy link
Contributor

Which config.txt have you used for that?

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 8, 2022

Your suggested config.txt on Pi 4

@marckleinebudde
Copy link
Contributor

Ok. Next try: Please change use this in the config.txt:

dtoverlay=spi1-1cs-overlay,cs0_spidev=false
dtoverlay=mcp251xfd,spi0-0,interrupt=25,speed=10000000
dtoverlay=mcp251xfd,spi1-0,interrupt=24,speed=10000000

Please send your boot log, including the

MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized.

line.

Do you have a scope? Can you measure the frequency of the SPI-CLK line?

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 8, 2022

I have not tried out your last proposal (speed=10000000) because I discovered that your config.txt file didn't allow me enable a can1 network. I get
pi@cilix-19:~ $ sudo ip link set can1 up type can bitrate 250000
Cannot find device "can1"
So I reverted back to the original setup.
Is it possible that the driver for mcp251xfd somehow expect something related to FD though I set up the network without FD on? The motor drivers do not support FD
I think I will modify my carrier board so it run Bullseye with mcp2515 then I will do a cangen test with the Pi 4 wich uses mcp251xfd
What about that?

@DavidBoJ
Copy link
Author

DavidBoJ commented Jul 9, 2022

I have now flashed my carrier board with Bullseye I have not CODESYS installed, so it is as simple as possible.
Then I connected my Pi 4 with the FD HAT to the carrier board (The CAN bus).
Next I tried to do a canfdtest
Pi 4
canfdtest -v can0
Carrier board
canfdtest -vg can0

It didn't work I got a lot of NNNNNNN...
I used
sudo ip link set can0 up type can bitrate 250000
to setup the network on Pi and carrier board. Conclusion:

For the time being the 2-CH CAN FD HAT Rev2.1 with the present driver cannot communicate with a device with MCP2515.
Can it actually comply to CAN 2.0B?

@marckleinebudde
Copy link
Contributor

FYI: See https://lore.kernel.org/all/[email protected]/ for a discussion on setting the sjw to 50% of phase-seg2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants