Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiving concurrently with two mcp2515 is unstable #1490

Closed
hummlbach opened this issue May 24, 2016 · 51 comments
Closed

Receiving concurrently with two mcp2515 is unstable #1490

hummlbach opened this issue May 24, 2016 · 51 comments

Comments

@hummlbach
Copy link

hummlbach commented May 24, 2016

We're facing a problem using two mcp2515 concurrently (with the PiCAN DUO board): while receiving on both CAN channels (in parallel, from two seperate busses) one CAN interface stops working after some time. More precisely one of the CAN interfaces stops receiving - no interrupts seen anymore in /proc/interrupts - but the chip still acknowledges the frames on the bus. After putting the interface down and up again it continues to work for some short amount of time before it stops working again (and so on).
This c-programm should reproduce the problem:
https://gist.github.com/hummlbach/5e9be81f27db91220cec0b919ebb9fec
The following quick and dirty fix in pinctrl-bcm2835.c figured out by another customer of the PiCAN DUO is supposed to work around the problem (haven't tried it myself):

    irq_mapped = irq_linear_revmap(pc->irq_domain, gpio);
    if( irq_mapped == 190 || irq_mapped == 191)
    {
        generic_handle_irq(190);
        generic_handle_irq(191);
    } else {
        generic_handle_irq(irq_linear_revmap(irq_mapped);             
    }

where 190 and 191 are the MCP2515s interrupts.

@pelwell
Copy link
Contributor

pelwell commented May 24, 2016

That fragment isn't syntactically correct - can you check it again, or point me at your source?

@hummlbach
Copy link
Author

hummlbach commented May 24, 2016

Oh you're right - there's missing a closing parenthesis in the forelast line.
I'm sorry, I received it like that from Sukkin Pang ([email protected]) who received it from another customer...

@pelwell
Copy link
Contributor

pelwell commented May 25, 2016

It's also incorrect because irq_linear_revmap takes two parameters, but the second invocation only gives it one.

@hummlbach
Copy link
Author

Okay... If I should guess, I would say originally it could have been... :)

static int bcm2835_gpio_irq_handle_bank(struct bcm2835_pinctrl *pc,
                    unsigned int bank, u32 mask)
{
    unsigned long events;
    unsigned offset;
    unsigned gpio;
    unsigned int type;
    int irq_mapped;

    events = bcm2835_gpio_rd(pc, GPEDS0 + bank * 4);
    events &= mask;
    events &= pc->enabled_irq_map[bank];
    for_each_set_bit(offset, &events, 32) {
        gpio = (32 * bank) + offset;
        type = pc->irq_type[gpio];
        irq_mapped = irq_linear_revmap(pc->irq_domain, gpio);
        if (irq_mapped == 190 || irq_mapped == 191)
        {
            generic_handle_irq(190);
            generic_handle_irq(191);
        }
        else
        {
            generic_handle_irq(irq_mapped);
        }
    }

    return (events != 0);
}

(so whenever one of the two mcp2515 interrupts has fired, both are handled...(?))

@pelwell
Copy link
Contributor

pelwell commented May 25, 2016

That version makes some kind of sense, but there's no way I'd apply a patch like that. Apart from the hard-coded numbers that will probably only work for the PiCAN DUO, I think the patch is in the wrong driver. If this is an mcp2515 problem then the fix should be in the mcp2515 driver - likewise for SPI. If it really is a generatl pinctrl interrupt problem then this patch won't fix it.

I don't have a PiCAN DUO or any SPI devices at the moment, so I can't experiment. Do you know if the devices actually stop requesting interrupts? If you can work out which channel stops, you can read the state of the pins while it's in the stuck state using raspi-gpio get 24 or raspi-gpio get 25.

@pelwell
Copy link
Contributor

pelwell commented May 26, 2016

If I wanted to test this out on a PiCAN DUO, what is the minimum test setup?

@hummlbach
Copy link
Author

That version makes some kind of sense, but there's no way I'd apply a patch like that.

Yes obv, it was just meant as a first hint to the cause of the problem.

Do you know if the devices actually stop requesting interrupts? If you can work out which channel stops, you can read the state of the pins while it's in the stuck state using raspi-gpio get 24 or raspi-gpio get 25.

Unfortunately I'm on vacation now, but I could check it the week after next.

If I wanted to test this out on a PiCAN DUO, what is the minimum test setup?

I've reproduced it with two PiCAN DUO connected to each other (can0 to can0 and can1 to can1).
But actually I would be surprised, if you were not able to reproduce it with one PiCAN DUO connected can0 to can1. (You need to set a "jumper" to enable the resistor. I had the baudrate set to 500000.)

@hummlbach
Copy link
Author

hummlbach commented Jun 10, 2016

Unfortunately I was not able to reproduce the issue by using just one PiCAN DUO and connecting can0 to can1. So I think one needs at least two PiCAN DUO or one PiCAN DUO and two PiCAN (single) or any other devices s.t. can0 and can1 receive independently on different busses.

Here is the output of raspi-gpio in the stuck state:

pi@raspberrypi:~ $ raspi-gpio get 25
GPIO 25: level=1 fsel=0 func=INPUT
pi@raspberrypi:~ $ raspi-gpio get 22
GPIO 22: level=0 fsel=0 func=INPUT

What puzzles me is that /proc/interrrupts tells me that there are no more interrupts on GPIO 22

pi@raspberrypi:~ $ cat /proc/interrupts
...
502:     777226          0          0          0  pinctrl-bcm2835  22 Edge      mcp251x
505:    3303211          0          0          0  pinctrl-bcm2835  25 Edge      mcp251x
...
pi@raspberrypi:~ $ cat /proc/interrupts
...
502:     777226          0          0          0  pinctrl-bcm2835  22 Edge      mcp251x
505:    3551295          0          0          0  pinctrl-bcm2835  25 Edge      mcp251x
...

but can1 stopped working

pi@raspberrypi:~ $ ifconfig
can0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          UP RUNNING  MTU:16  Metric:1
          RX packets:2754446 errors:89 dropped:0 overruns:0 frame:89
          TX packets:1072295 errors:2 dropped:2 overruns:0 carrier:2
          collisions:0 txqueuelen:10
          RX bytes:21916359 (20.9 MiB)  TX bytes:8578360 (8.1 MiB)

can1      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          UP RUNNING  MTU:16  Metric:1
          RX packets:524279 errors:40 dropped:0 overruns:0 frame:40
          TX packets:253184 errors:4 dropped:4 overruns:0 carrier:4
          collisions:0 txqueuelen:10
          RX bytes:4176095 (3.9 MiB)  TX bytes:2025472 (1.9 MiB)

and:

pi@raspberrypi:~ $ ifconfig
can0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          UP RUNNING  MTU:16  Metric:1
          RX packets:3091347 errors:97 dropped:0 overruns:0 frame:97
          TX packets:1235421 errors:2 dropped:2 overruns:0 carrier:2
          collisions:0 txqueuelen:10
          RX bytes:24593273 (23.4 MiB)  TX bytes:9883368 (9.4 MiB)

can1      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          UP RUNNING  MTU:16  Metric:1
          RX packets:524279 errors:40 dropped:0 overruns:0 frame:40
          TX packets:253184 errors:4 dropped:4 overruns:0 carrier:4
          collisions:0 txqueuelen:10
          RX bytes:4176095 (3.9 MiB)  TX bytes:2025472 (1.9 MiB)

while the following is set in /boot/config.txt:

dtoverlay=mcp2515-can0,oscillator=16000000,interrupt=22
dtoverlay=mcp2515-can1,oscillator=16000000,interrupt=25

(If I switch the interrupt pins in /boot/config.txt [so that it's actually wrong] we get many errors [as expected?] but the interfaces don't get stuck anymore. Actually this is now a CanBerry Dual board...)

@albertodesouza
Copy link

I am experiencing the same problem reported by hummlbach in the beginning of this issue report. Does the change in pinctrl-bcm2835.c reported above works? I believe that is the same problem reported in this Thesis as well (page Section 6.2.2, page 50): http://cta.physik.uzh.ch/public/theses/files/2014-WidmerTimothy-Bachelor.pdf

@hummlbach
Copy link
Author

Yes the workaround works at least. I'm not experiencing a "system crash" as they mention. But yes the "interrupt gets lost and is not cleared" (so that we never receive anything again on one of the interfaces) sounds to be the same problem.
Two things I would like to point out once more:

  • The problem does not show up if the interrupts occur synchronously
  • In my example interrupt 22 (belonging to can0) is not cleared so does not fire anymore but can1 stops receiving?!?

@pelwell
Copy link
Contributor

pelwell commented Nov 16, 2016

I've tried emulating this with a hacked-up driver that registers for the interrupt and does nothing (or just waits a bit) in the handler, and a bit of VPU code that writes to two other GPIOs wired to the interrupt pins with a chosen delay between them, and so far I've not seen a problem.

Somebody smarter than me (of which there are many) may be able to just read the code and figure it out, but the Linux interrupt handling, especially the threaded model, made my head hurt when I skimmed it. The first step would be to understand the state it gets into - why the interrupts aren't getting through. Is it that the software is expecting them but the hardware has got into a bad state (not acknowledging one of the detected transitions, etc.), or has the hardware not been primed at all?

@pelwell
Copy link
Contributor

pelwell commented Nov 18, 2016

To help answer the question above I've added a raw mode to raspi-gpio that dumps out the GPIO register map in hex. You can download a compiled version here.

I would be grateful if somebody who can reproduce the stall with two MCP2515s can get their Pi into that state, run ./raspi-gpio raw, and upload the output. If you can get results from more than one run that would be even better.

@hummlbach
Copy link
Author

hummlbach commented Nov 19, 2016

Hello Phil,
I think you accidentally compiled for the wrong architecture:
pi@raspberrypi:~ $ file raspi-gpio
raspi-gpio: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.24, BuildID[sha1]=1caa5f031588df0e362b725e5787abe08c89181c, not stripped

@pelwell
Copy link
Contributor

pelwell commented Nov 19, 2016

You're right - try now.

@hummlbach
Copy link
Author

Here some raspi-gpio raw outputs when one of the interfaces got stuck.

can1 "stopped":

pi@raspberrypi:~ $ ./raspi-gpio get 25
GPIO 25: level=1 fsel=0 func=INPUT
pi@raspberrypi:~ $ ./raspi-gpio get 22
GPIO 22: level=0 fsel=0 func=INPUT
pi@raspberrypi:~ $ ./raspi-gpio raw
00: 21200000 00064064 00000009 01000048
10: 2422400c 00000924 00000000 6770696f
20: 6770696f 6770696f 6770696f 6770696f
30: 6770696f b200c1ff 003e5c1c 00000000
40: 02000000 00000000 00000000 00000000
50: 00000000 00000000 02400000 00000000
60: 00000000 00000000 00000000 00000000
70: 00000000 00000000 00000000 00000000
80: 00000000 00000000 00000000 00000000
90: 00000000 00000002 00000000 00000000

can0 "stopped":

pi@raspberrypi:~ $ ./raspi-gpio get 25
GPIO 25: level=0 fsel=0 func=INPUT
pi@raspberrypi:~ $ ./raspi-gpio get 22
GPIO 22: level=1 fsel=0 func=INPUT
pi@raspberrypi:~ $ ./raspi-gpio raw
00: 21200000 00064064 00000009 01000048
10: 2422400c 00000924 00000000 6770696f
20: 6770696f 6770696f 6770696f 6770696f
30: 6770696f b040c1ff 003e4c1c 00000000
40: 00000000 00000000 00000000 00000000
50: 00000000 00000000 02400000 00000000
60: 00000000 00000000 00000000 00000000
70: 00000000 00000000 00000000 00000000
80: 00000000 00000000 00000000 00000000
90: 00000000 00000002 00000000 00000000

can0 "stopped":

pi@raspberrypi:~ $ ./raspi-gpio get 25
GPIO 25: level=0 fsel=0 func=INPUT
pi@raspberrypi:~ $ ./raspi-gpio get 22
GPIO 22: level=1 fsel=0 func=INPUT
pi@raspberrypi:~ $ ./raspi-gpio raw
00: 21200000 00064064 00000009 01000048
10: 2422400c 00000924 00000000 6770696f
20: 6770696f 6770696f 6770696f 6770696f
30: 6770696f b040c1ff 003e5c1c 00000000
40: 00000000 00000000 00000000 00000000
50: 00000000 00000000 02400000 00000000
60: 00000000 00000000 00000000 00000000
70: 00000000 00000000 00000000 00000000
80: 00000000 00000000 00000000 00000000
90: 00000000 00000002 00000000 00000000

can1 "stopped":

pi@raspberrypi:~ $ ./raspi-gpio get 25
GPIO 25: level=1 fsel=0 func=INPUT
pi@raspberrypi:~ $ ./raspi-gpio get 22
GPIO 22: level=0 fsel=0 func=INPUT
pi@raspberrypi:~ $ ./raspi-gpio raw
00: 21200000 00064064 00000009 01000048
10: 2422400c 00000924 00000000 6770696f
20: 6770696f 6770696f 6770696f 6770696f
30: 6770696f b200c1ff 003e4c1c 00000000
40: 00000000 00000000 00000000 00000000
50: 00000000 00000000 02400000 00000000
60: 00000000 00000000 00000000 00000000
70: 00000000 00000000 00000000 00000000
80: 00000000 00000000 00000000 00000000
90: 00000000 00000002 00000000 00000000

@pelwell
Copy link
Contributor

pelwell commented Nov 19, 2016

Thanks - that's just what I was hoping for. I can see differences, I'll just need some time to analyse them.

@albertodesouza
Copy link

I have solved the problem using a bit of hardware. The circuit in the image attached only allows one interruption per time. The other, if in the same time (or near), waits to be served. I have tested it for hours and it worked OK.

circuito_interrupcao

@pelwell
Copy link
Contributor

pelwell commented Nov 19, 2016

That's also useful - I need to change my test setup to ensure that the interrupts are overlapping as well as close.

@albertodesouza
Copy link

Sorry. I forgot to add the resistors in the circuit (my version of the circuit is manuscripted in a piece of paper in my lab. :) ). They are necessary because the 74HC11 is powered by 3.3V (because it is connected to the Raspberry Pi, which is powered by 3.3V), but the MCP2515s are powered by 5V.

circuito_interrupcao

@pelwell
Copy link
Contributor

pelwell commented Nov 22, 2016

Here's a quick decode of the significant fields in those raw dumps. I'm assuming that gpio 22 is can0 and gpio 22 is can1, as per your earlier post:

A) can1 stopped:
 GPLEV0 b200c1ff  (GPIO22 low, GPIO 25 high)
 GPEDS0 02000000  (bit 25 set) - can1 event detected
 GPFEN0 02400000  (Falling edge events enabled on 22 and 25)

This suggests that the interrupt line for can1 has been pulled low then returned (it is currently high), which caused the falling edge event to be latched, but the driver doesn't seem to be acting on it. Meanwhile, can0 seems to be trying to interrupt (gpio22 is low).

B) can0 stopped:
 GPLEV0 b040c1ff  (GPIO22 high, GPIO 25 low)
 GPEDS0 00000000
 GPFEN0 02400000  (Falling edge events enabled on 22 and 25)

The interrupt lines are reversed, so can1 seems to be trying to interrupt while can0 isn't.
Meanwhile there are no events, so no interrupt will be delivered.

C) can0 stopped:
 GPLEV0 b040c1ff  (GPIO22 high, GPIO 25 low)
 GPEDS0 00000000
 GPFEN0 02400000  (Falling edge events enabled on 22 and 25)

The same as B.

D) can1 stopped:
 GPLEV0 b200c1ff  (GPIO22 low, GPIO 25 high)
 GPEDS0 00000000
 GPFEN0 02400000  (Falling edge events enabled on 22 and 25)

Like A, but without the pending edge event.

I'm bothered by the fact that the first of these dumps is different (the pending falling edge event), and puzzled that the "stopped" channel is not asserting its interrupt - I had expected the opposite. Perhaps failing to handle an earlier interrupt in a timely manner has caused the channel to give up.

I can't really say I'm any closer to a solution, but it's good to have some hard evidence to eliminate the impossibilities.

@hummlbach
Copy link
Author

hummlbach commented Nov 23, 2016

Yes since my /boot/config.txt contains these lines

dtoverlay=mcp2515-can0-overlay,oscillator=16000000,interrupt=22
dtoverlay=mcp2515-can1-overlay,oscillator=16000000,interrupt=25

i would assume that can0 has irq 22 and can1 has irq 25 but printk tells something different:

[ 2309.918542] can can1, irq 502, buffer 0
[ 2380.406143] can can0, irq 505, buffer 0
[ 2380.411112] can can1, irq 502, buffer 0
[ 2380.413662] can can0, irq 505, buffer 0
[ 2380.417985] can can1, irq 502, buffer 0
[ 2380.421192] can can0, irq 505, buffer 0
[ 2380.425527] can can1, irq 502, buffer 0
[ 2380.425531] can can0, irq 505, buffer 0
[ 2380.433856] can can1, irq 502, buffer 0
[ 2380.433880] can can1, irq 502, buffer 0
[ 2380.442047] can can1, irq 502, buffer 1
[ 2380.442077] can can0, irq 505, buffer 1

Doesn't it?

(I've added this line printk(KERN_DEBUG "can %s, irq %u, buffer 0", net->name, spi->irq); and another one to mcp251x_can_ist)

@hummlbach
Copy link
Author

hummlbach commented Nov 24, 2016

If I configure only can0

dtoverlay=mcp2515-can0-overlay,oscillator=16000000,interrupt=22
#dtoverlay=mcp2515-can1-overlay,oscillator=16000000,interrupt=25

the output is as expected:

[   54.821568] can can0, irq 502
[   54.821568] can can0, irq 502
[   54.826535] can can0, irq 502
[   54.831784] can can0, irq 502
[   54.836591] can can0, irq 502
[   54.841539] can can0, irq 502

or at least as i would expect it...

@pelwell
Copy link
Contributor

pelwell commented Nov 25, 2016

Is it possible that the can bus numbering is based on order of initialisation rather than SPI CS? What happens if you just use the -can1 overlay? Do you get any relevant output in the dmesg log in the -can0, -can1 and both cases?

@pelwell
Copy link
Contributor

pelwell commented Nov 25, 2016

What does ls -l /sys/class/net/can* show?

@hummlbach
Copy link
Author

hummlbach commented Nov 28, 2016

Okay, yes it seems that something depends on the order of initialisation here...
I changed the order of the overlays to:

dtoverlay=mcp2515-can1-overlay,oscillator=16000000,interrupt=25
dtoverlay=mcp2515-can0-overlay,oscillator=16000000,interrupt=22

and now the irq maps to the interfaces as I would expect it from the lines above... ;)

I'm bothered by the fact that the first of these dumps is different (the pending falling edge event), and puzzled that the "stopped" channel is not asserting its interrupt - I had expected the opposite.

So since the irq did not map to the channels as I assumed, the "stopped" channel actually is asserting its interrupt and the driver doesn't react on it right?

@mbr
Copy link

mbr commented Dec 12, 2016

I believe we're hitting the same issue in production here, I've even mailed the poor socketcan maintainer about it.

Is there a working software fix available or do I have to solder @albertodesouza's solution? Is interchanging the overlay-order actually enough or was that just for the printk output?

@pelwell
Copy link
Contributor

pelwell commented Dec 12, 2016

I can't give a date when it will be understood and fixed, especially at this time of year - you will have to make a call on the best course of action.

If anyone can get one of these systems into a stuck state, does momentarily pulling the IRQ line of the stuck device high wake it up? In other words, are we simply missing the falling edge interrupt due to a timing-related bug?

@mbr
Copy link

mbr commented Dec 12, 2016

I can't give a date when it will be understood and fixed, especially at this time of year - you will have to make a call on the best course of action.

The circuit above keeps getting more attractive; as an alternative we're considering (no offense!) swapping in a Beagle Bone Black, which does not use SPI to connect its CAN-ICs. I would assume the issue is confined at least to this specific setup (Pi + SPI + PICAN)?

If anyone can get one of these systems into a stuck state, does momentarily pulling the IRQ line of the stuck device high wake it up? In other words, are we simply missing the falling edge interrupt due to a timing-related bug?

We can somewhat realiably reproduce the issue (takes between 5 minutes and an hour during normal operation), but we have no special setup to influence any of the pins at the moment. If shorting two pins on the header is enough and this input is valuable, I'll see what I can do.

@albertodesouza
Copy link

In the circuit I have posted it is written 74HC11 (3 NAND gates), but I have used the 74HC00 (4 NAND gates). My mistake... Sorry about that.

@albertodesouza
Copy link

The correct hardware (with the name of the chip changed).
hardware_adicional

@JamesH65
Copy link
Contributor

@pelwell This still seems to be current - any thoughts?

@JamesH65 JamesH65 added the Waiting for internal comment Waiting for comment from a member of the Raspberry Pi engineering team label May 18, 2017
@pelwell
Copy link
Contributor

pelwell commented May 18, 2017

It's one of several complex issues that may take a significant amount of time to fully understand and will (it appears) benefit a very small number of people. All the code involved is open source, so there's nothing stopping anybody else from having a go.

Having said that, leave it open in case I run out of things to do.

@JamesH65 JamesH65 added Assigned for implementation/action and removed Waiting for internal comment Waiting for comment from a member of the Raspberry Pi engineering team labels May 19, 2017
@mbr
Copy link

mbr commented May 19, 2017

@pelwell This still seems to be current - any thoughts?

FWIW, we've switched to a BBB and a custom laid out CAN cape. The built-in CAN interfaces worked a lot better than the MCPs, since they also come with a larger buffer. Porting the code has been easy -- it pretty much ran unmodified.

@bggardner
Copy link

Is this really two separate issues, with the first being the instability, and the second being the order of initialization? While I can't comment on the instability, I've been pulling my hair out on the ordering issue, and can confirm the first mcp2515-can* listing gets assigned to spi0.1/CE1 and the second get assigned to spi0.0/CE0. So, you could run into an instance where can0 is assigned to spi0.1/CE1 (and likewise for can1), which is not intuitive or intended. Anyways...should a separate issue be opened just for the ordering issue, and are there other overlays that may have the same problem?

@hummlbach
Copy link
Author

I'm not quite sure anymore, but I think I fixed the order in the configuration s.t. the interfaces were assigned as expected and the instability problem remained.

@bggardner
Copy link

By "fixed", I assume you mean the workaround of putting mcp2515-can1 first, as the ordering problem still exists in the latest release. I should also mention that the overlays README specifically states can0 will be assigned to spi0.0 and can1 will be assigned to spi0.1, which is not guaranteed, as we have shown.

@MrDadaGuy
Copy link

MrDadaGuy commented Aug 4, 2017

I also have this issue - with two concurrent CAN-Bus connections, one stops responding after a couple of minutes.

Is this information still relevant (http://www.elinux.org/RPi_CANBus)? It hasn't been changed since 2014. It recommends I build a custom kernel, but even without this (Raspbian Jessie), I am able to get the PiCAN Duo to work at least for a time. In fact, lsmod shows mcp251x and can_dev

Also on elinux they recommend an asynchronous driver for MCP2515, but the Widmer Thesis PDF linked above he says this makes things worse (also this is from 2014, things may have changed?):

Interrupt conflict: The most serious and not yet solved problem is the interrupt
conflict in multi bus mode as described in section 6.2.2. Reducing
SPI speed and pause between transmission increases stability, but does not
solve the problem. Updating raspian to version 3.12.28+ and using the improved
kernel modules (see section 4.2) mcp2515a.ko (instead mcp251x.ko)
and spi-bcm2835dma.ko (instead spi-bcm2708.ko) didn’t improve the system.
These new kernel modules were written because people faced some latency
problems when using high speed CAN together with the Raspberry Pi. The
new modules do not only not improve the system, CAN frames were corrupted
using them. Data and identifier are not transferred correctly anymore.

@ahmedawad1
Copy link

I encountered the same problem as well. This time I connected one mcp2515 on SPI0 and another one on SPI1. Two sperated spi busses and with 2 different can busses. Is this problem with mcp2515 driver only or is it general with any two or more devices generate interrupts on the GPIO?

@pelwell
Copy link
Contributor

pelwell commented Aug 10, 2017

I'm inclined to think it is a bug in the mcp2515 driver - I'm not aware of any similar issues using other SPI devices. Your test is interesting because it also uses two different interrupt lines, so the only real points of commonality between the two mcp2515s that aren't standard Linux components are the interrupt controller driver and the mcp2515 driver (SPI1 and SPI2 use the spi-bcm2835aux driver, while SPI0 uses spi-bcm2835). Given how much the INTC driver gets used I'd be very surprised if there were a bug there, which leaves us with only one option.

@ahmedawad1
Copy link

ahmedawad1 commented Aug 21, 2017

I did another test in which I did not include any SPI devices. It is a pure test of the GPIO. I connected 5 clock signal generators to 5 corresponding pins on the GPIO header, each signal has rate 20 msec with 50% duty cycle. The test simply counts the rising edges on these pins when they are configured as interrupts. I let the test run for more than 2 days and at the end i found that:

BCM7 lost 18 ticks from the clock generator.
BCM3 lost 10 ticks from the clock generator.
BCM22 lost 8 ticks from the clock generator.
BCM2 lost 11 ticks from the clock generator.
BCM0 lost 7 ticks from the clock generator.

I think this problem lies within the GPIO interrupt handler itself and has nothing to do with mcp2515 or the SPI. However, my CPU load was less than 6%, so there is no chance that my cpu was overloaded that it can miss an interrupt which happens each 20 msec. I used wiringPi for setting the interrupts on the pins at rising edges.

@pelwell
Copy link
Contributor

pelwell commented Aug 21, 2017

That's an interesting test and set of results, but I have a problem with this statement:

my CPU load was less than 6%, so there is no chance that my cpu was overloaded that it can miss an interrupt which happens each 20 msec

CPU loads are measured across intervals measured in seconds, not milliseconds. A 20ms spike only amounts to 2% of a second, so it wouldn't stand out in a CPU usage trace.

I like the idea behind your test, and I think it could be improved by instrumenting the GPIO interrupt handler itself. With a square wave input signal at a constant frequency it should be easy to spot and log any irregularities. Your mention of wiringPi makes me think your test runs in userspace - by putting the logging in the driver we could see at which point the signal was dropped.

It's interesting that you are using five different signal generators (if I have understood you correctly) - have you tried using the same signal source for all pins, or deliberately using different frequencies?

@ahmedawad1
Copy link

if I have understood you correctly

Yes correct, I used 5 signal generators.

have you tried using the same signal source for all pins, or deliberately using different frequencies?

These new test cases worth looking at:
1- 2 clock generators with 2 ms (Duty 50%) are connected to different GPIO pins. The GPIO pins lost some counts
2- 2 clock generators, one with 2 msec and another with 1 msec , both 50% duty. The GPIO pins lost ticks also.
3- 1 clock generator with 1 msec (Duty 50%) which is connected to both GPIO pins and both pins lost counts.

Your mention of wiringPi makes me think your test runs in userspace

in the last 3 mentioned test cases, i didn't use wiringPi. I exported the pins as interrupts from /sys/class/gpio . I used some thing similar to Link

If only one clock source (1ms period ) connected to only one pin. Then the GPIO counts it correctly and i din't lose any ticks.

@pelwell
Copy link
Contributor

pelwell commented Aug 21, 2017

i didn't use wiringPi. I exported the pins as interrupts from /sys/class/gpio . I used some thing similar to [ link to C application ]

But this is still user-space code, so one can imagine how an interrupt could be decoded by the driver and yet not make it through to the application. Have you tried reading /proc/interrupts before and after the tests?

@ahmedawad1
Copy link

Have you tried reading /proc/interrupts before and after the tests?

Yes, I rebooted the Pi to reset /proc/interrupts. Then started this user space-program once again. 2 signal genrators with 1 ms are connected to 2 gpio pins. After some time, I lost ticks and the value in /proc/interrupts does equal the value counted by the user-space program. There is no difference between what the kernel counts from interrupts and those values read by that user space program!

@pelwell
Copy link
Contributor

pelwell commented Aug 21, 2017

Thanks - that's all useful information. Your use-case is a lot simpler than the dual mcp2515 setup, so there's more of a chance it will get looked at.

@P33M
Copy link
Contributor

P33M commented Aug 29, 2017

@ahmedawad1 Can you modify your test slightly?

Would it be possible to get your signal generators to generate two same-frequency signals with a variable phase relationship between them?

The idea is to vary the time between edges on each generator such that you can sweep the phase through 360°. You may find that a particular phase relationship causes much more edges to be "lost" as you enter some critical timing window.

Losing ~11 edges in 2 days is a bug that's going to be impossible to track down, so deliberately provoking the bug with a "bad" combination of inputs is something desirable.

@ahmedawad1
Copy link

ahmedawad1 commented Aug 30, 2017

Yes I can modify it to have 2 signals generated from the same source with different phases. The challenge is to record the phase difference when the signal is lost at the Pi side. I will try to do as soon as possible.
As you can see when i told @pelwell of my results:

one clock generator with 1 msec (Duty 50%) which is connected to both GPIO pins and both pins lost counts.

This case has phase shift 0 and means that interrupts happen at the same time. I want also to tell that lately i tested that dirty fix mentioned by @hummlbach and it worked with 2 mcp2515 at the same SPI bus SPI0, i had the can bus load of 108% that is transmitting CAN messages every 1 msec from each CAN bus. I will redo the last test of the 2 GPIO pins connected from the same source as above and see whether it loses frames or not.

@BuFran
Copy link

BuFran commented Jan 10, 2018

Confirmed problem.

In my setup, two connected MCP2515's over SPI0, and custom CE-pins with bus load ~80% was hang one can bus after 5mins with debian's packaged kernel 4.9.41 (get with apt-get update/upgrade).

After updating kernel with sudo rpi-update to latest 4.9.75+ that is already patched with #2175, the problems has gone. Uptime 25hours without can bus hang.

my setup:

dtoverlay=spi0-cs,cs0_pin=22,cs1_pin=27
dtoverlay=mcp2515-can0,oscillator=16000000,interrupt=5
dtoverlay=mcp2515-can1,oscillator=16000000,interrupt=6

Thanks !

@pelwell
Copy link
Contributor

pelwell commented Jan 10, 2018

Thanks for the feedback. For any other watchers, the patch in question is actually #2267, and it changes the interrupt type to be level-triggered to match the datasheet and to avoid potentially missing an interrupt.

@pelwell pelwell closed this as completed Jan 10, 2018
@leforban
Copy link

Hello sorry for digging into such an old post but I would like to have news about this issue. We experienced this problem in summer 2018 and decided to not use the second can bus. However we would like to activate it and I have performed some tests today with a fresh 18.04.4 ubuntu mate environnement running 4.15 kernel. Fortunately it works like a charm since 3-4 hours. Does this mean the issue has been solved? if yes how to know when it has been solved and include into the kernel?

@pelwell
Copy link
Contributor

pelwell commented Mar 17, 2020

Just read the previous two posts - they answer your question. Clicking on #2267 (or just hovering over the link) shows it was fixed in rpi-4.9.y. A bit more digging would tell you that it was accepted upstream as of Linux 5.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests