-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Receiving concurrently with two mcp2515 is unstable #1490
Comments
That fragment isn't syntactically correct - can you check it again, or point me at your source? |
Oh you're right - there's missing a closing parenthesis in the forelast line. |
It's also incorrect because irq_linear_revmap takes two parameters, but the second invocation only gives it one. |
Okay... If I should guess, I would say originally it could have been... :)
(so whenever one of the two mcp2515 interrupts has fired, both are handled...(?)) |
That version makes some kind of sense, but there's no way I'd apply a patch like that. Apart from the hard-coded numbers that will probably only work for the PiCAN DUO, I think the patch is in the wrong driver. If this is an mcp2515 problem then the fix should be in the mcp2515 driver - likewise for SPI. If it really is a generatl pinctrl interrupt problem then this patch won't fix it. I don't have a PiCAN DUO or any SPI devices at the moment, so I can't experiment. Do you know if the devices actually stop requesting interrupts? If you can work out which channel stops, you can read the state of the pins while it's in the stuck state using |
If I wanted to test this out on a PiCAN DUO, what is the minimum test setup? |
Yes obv, it was just meant as a first hint to the cause of the problem.
Unfortunately I'm on vacation now, but I could check it the week after next.
I've reproduced it with two PiCAN DUO connected to each other (can0 to can0 and can1 to can1). |
Unfortunately I was not able to reproduce the issue by using just one PiCAN DUO and connecting can0 to can1. So I think one needs at least two PiCAN DUO or one PiCAN DUO and two PiCAN (single) or any other devices s.t. can0 and can1 receive independently on different busses. Here is the output of raspi-gpio in the stuck state:
What puzzles me is that
but can1 stopped working
and:
while the following is set in
(If I switch the interrupt pins in |
I am experiencing the same problem reported by hummlbach in the beginning of this issue report. Does the change in pinctrl-bcm2835.c reported above works? I believe that is the same problem reported in this Thesis as well (page Section 6.2.2, page 50): http://cta.physik.uzh.ch/public/theses/files/2014-WidmerTimothy-Bachelor.pdf |
Yes the workaround works at least. I'm not experiencing a "system crash" as they mention. But yes the "interrupt gets lost and is not cleared" (so that we never receive anything again on one of the interfaces) sounds to be the same problem.
|
I've tried emulating this with a hacked-up driver that registers for the interrupt and does nothing (or just waits a bit) in the handler, and a bit of VPU code that writes to two other GPIOs wired to the interrupt pins with a chosen delay between them, and so far I've not seen a problem. Somebody smarter than me (of which there are many) may be able to just read the code and figure it out, but the Linux interrupt handling, especially the threaded model, made my head hurt when I skimmed it. The first step would be to understand the state it gets into - why the interrupts aren't getting through. Is it that the software is expecting them but the hardware has got into a bad state (not acknowledging one of the detected transitions, etc.), or has the hardware not been primed at all? |
To help answer the question above I've added a raw mode to raspi-gpio that dumps out the GPIO register map in hex. You can download a compiled version here. I would be grateful if somebody who can reproduce the stall with two MCP2515s can get their Pi into that state, run |
Hello Phil, |
You're right - try now. |
Here some raspi-gpio raw outputs when one of the interfaces got stuck. can1 "stopped":
can0 "stopped":
can0 "stopped":
can1 "stopped":
|
Thanks - that's just what I was hoping for. I can see differences, I'll just need some time to analyse them. |
That's also useful - I need to change my test setup to ensure that the interrupts are overlapping as well as close. |
Here's a quick decode of the significant fields in those raw dumps. I'm assuming that gpio 22 is can0 and gpio 22 is can1, as per your earlier post:
This suggests that the interrupt line for can1 has been pulled low then returned (it is currently high), which caused the falling edge event to be latched, but the driver doesn't seem to be acting on it. Meanwhile, can0 seems to be trying to interrupt (gpio22 is low).
The interrupt lines are reversed, so can1 seems to be trying to interrupt while can0 isn't.
The same as B.
Like A, but without the pending edge event. I'm bothered by the fact that the first of these dumps is different (the pending falling edge event), and puzzled that the "stopped" channel is not asserting its interrupt - I had expected the opposite. Perhaps failing to handle an earlier interrupt in a timely manner has caused the channel to give up. I can't really say I'm any closer to a solution, but it's good to have some hard evidence to eliminate the impossibilities. |
Yes since my
i would assume that can0 has irq 22 and can1 has irq 25 but printk tells something different:
Doesn't it? (I've added this line |
If I configure only can0
the output is as expected:
or at least as i would expect it... |
Is it possible that the can bus numbering is based on order of initialisation rather than SPI CS? What happens if you just use the -can1 overlay? Do you get any relevant output in the dmesg log in the -can0, -can1 and both cases? |
What does |
Okay, yes it seems that something depends on the order of initialisation here...
and now the irq maps to the interfaces as I would expect it from the lines above... ;)
So since the irq did not map to the channels as I assumed, the "stopped" channel actually is asserting its interrupt and the driver doesn't react on it right? |
I believe we're hitting the same issue in production here, I've even mailed the poor socketcan maintainer about it. Is there a working software fix available or do I have to solder @albertodesouza's solution? Is interchanging the overlay-order actually enough or was that just for the printk output? |
I can't give a date when it will be understood and fixed, especially at this time of year - you will have to make a call on the best course of action. If anyone can get one of these systems into a stuck state, does momentarily pulling the IRQ line of the stuck device high wake it up? In other words, are we simply missing the falling edge interrupt due to a timing-related bug? |
The circuit above keeps getting more attractive; as an alternative we're considering (no offense!) swapping in a Beagle Bone Black, which does not use SPI to connect its CAN-ICs. I would assume the issue is confined at least to this specific setup (Pi + SPI + PICAN)?
We can somewhat realiably reproduce the issue (takes between 5 minutes and an hour during normal operation), but we have no special setup to influence any of the pins at the moment. If shorting two pins on the header is enough and this input is valuable, I'll see what I can do. |
In the circuit I have posted it is written 74HC11 (3 NAND gates), but I have used the 74HC00 (4 NAND gates). My mistake... Sorry about that. |
@pelwell This still seems to be current - any thoughts? |
It's one of several complex issues that may take a significant amount of time to fully understand and will (it appears) benefit a very small number of people. All the code involved is open source, so there's nothing stopping anybody else from having a go. Having said that, leave it open in case I run out of things to do. |
FWIW, we've switched to a BBB and a custom laid out CAN cape. The built-in CAN interfaces worked a lot better than the MCPs, since they also come with a larger buffer. Porting the code has been easy -- it pretty much ran unmodified. |
Is this really two separate issues, with the first being the instability, and the second being the order of initialization? While I can't comment on the instability, I've been pulling my hair out on the ordering issue, and can confirm the first |
I'm not quite sure anymore, but I think I fixed the order in the configuration s.t. the interfaces were assigned as expected and the instability problem remained. |
By "fixed", I assume you mean the workaround of putting |
I also have this issue - with two concurrent CAN-Bus connections, one stops responding after a couple of minutes. Is this information still relevant (http://www.elinux.org/RPi_CANBus)? It hasn't been changed since 2014. It recommends I build a custom kernel, but even without this (Raspbian Jessie), I am able to get the PiCAN Duo to work at least for a time. In fact, Also on elinux they recommend an asynchronous driver for MCP2515, but the Widmer Thesis PDF linked above he says this makes things worse (also this is from 2014, things may have changed?):
|
I encountered the same problem as well. This time I connected one mcp2515 on SPI0 and another one on SPI1. Two sperated spi busses and with 2 different can busses. Is this problem with mcp2515 driver only or is it general with any two or more devices generate interrupts on the GPIO? |
I'm inclined to think it is a bug in the mcp2515 driver - I'm not aware of any similar issues using other SPI devices. Your test is interesting because it also uses two different interrupt lines, so the only real points of commonality between the two mcp2515s that aren't standard Linux components are the interrupt controller driver and the mcp2515 driver (SPI1 and SPI2 use the |
I did another test in which I did not include any SPI devices. It is a pure test of the GPIO. I connected 5 clock signal generators to 5 corresponding pins on the GPIO header, each signal has rate 20 msec with 50% duty cycle. The test simply counts the rising edges on these pins when they are configured as interrupts. I let the test run for more than 2 days and at the end i found that: BCM7 lost 18 ticks from the clock generator. I think this problem lies within the GPIO interrupt handler itself and has nothing to do with mcp2515 or the SPI. However, my CPU load was less than 6%, so there is no chance that my cpu was overloaded that it can miss an interrupt which happens each 20 msec. I used wiringPi for setting the interrupts on the pins at rising edges. |
That's an interesting test and set of results, but I have a problem with this statement:
CPU loads are measured across intervals measured in seconds, not milliseconds. A 20ms spike only amounts to 2% of a second, so it wouldn't stand out in a CPU usage trace. I like the idea behind your test, and I think it could be improved by instrumenting the GPIO interrupt handler itself. With a square wave input signal at a constant frequency it should be easy to spot and log any irregularities. Your mention of wiringPi makes me think your test runs in userspace - by putting the logging in the driver we could see at which point the signal was dropped. It's interesting that you are using five different signal generators (if I have understood you correctly) - have you tried using the same signal source for all pins, or deliberately using different frequencies? |
Yes correct, I used 5 signal generators.
These new test cases worth looking at:
in the last 3 mentioned test cases, i didn't use wiringPi. I exported the pins as interrupts from /sys/class/gpio . I used some thing similar to Link If only one clock source (1ms period ) connected to only one pin. Then the GPIO counts it correctly and i din't lose any ticks. |
But this is still user-space code, so one can imagine how an interrupt could be decoded by the driver and yet not make it through to the application. Have you tried reading /proc/interrupts before and after the tests? |
Yes, I rebooted the Pi to reset /proc/interrupts. Then started this user space-program once again. 2 signal genrators with 1 ms are connected to 2 gpio pins. After some time, I lost ticks and the value in /proc/interrupts does equal the value counted by the user-space program. There is no difference between what the kernel counts from interrupts and those values read by that user space program! |
Thanks - that's all useful information. Your use-case is a lot simpler than the dual mcp2515 setup, so there's more of a chance it will get looked at. |
@ahmedawad1 Can you modify your test slightly? Would it be possible to get your signal generators to generate two same-frequency signals with a variable phase relationship between them? The idea is to vary the time between edges on each generator such that you can sweep the phase through 360°. You may find that a particular phase relationship causes much more edges to be "lost" as you enter some critical timing window. Losing ~11 edges in 2 days is a bug that's going to be impossible to track down, so deliberately provoking the bug with a "bad" combination of inputs is something desirable. |
Yes I can modify it to have 2 signals generated from the same source with different phases. The challenge is to record the phase difference when the signal is lost at the Pi side. I will try to do as soon as possible.
This case has phase shift 0 and means that interrupts happen at the same time. I want also to tell that lately i tested that dirty fix mentioned by @hummlbach and it worked with 2 mcp2515 at the same SPI bus SPI0, i had the can bus load of 108% that is transmitting CAN messages every 1 msec from each CAN bus. I will redo the last test of the 2 GPIO pins connected from the same source as above and see whether it loses frames or not. |
Confirmed problem. In my setup, two connected MCP2515's over SPI0, and custom CE-pins with bus load ~80% was hang one can bus after 5mins with debian's packaged kernel 4.9.41 (get with After updating kernel with my setup:
Thanks ! |
Thanks for the feedback. For any other watchers, the patch in question is actually #2267, and it changes the interrupt type to be level-triggered to match the datasheet and to avoid potentially missing an interrupt. |
Hello sorry for digging into such an old post but I would like to have news about this issue. We experienced this problem in summer 2018 and decided to not use the second can bus. However we would like to activate it and I have performed some tests today with a fresh 18.04.4 ubuntu mate environnement running 4.15 kernel. Fortunately it works like a charm since 3-4 hours. Does this mean the issue has been solved? if yes how to know when it has been solved and include into the kernel? |
Just read the previous two posts - they answer your question. Clicking on #2267 (or just hovering over the link) shows it was fixed in rpi-4.9.y. A bit more digging would tell you that it was accepted upstream as of Linux 5.4. |
We're facing a problem using two mcp2515 concurrently (with the PiCAN DUO board): while receiving on both CAN channels (in parallel, from two seperate busses) one CAN interface stops working after some time. More precisely one of the CAN interfaces stops receiving - no interrupts seen anymore in /proc/interrupts - but the chip still acknowledges the frames on the bus. After putting the interface down and up again it continues to work for some short amount of time before it stops working again (and so on).
This c-programm should reproduce the problem:
https://gist.github.com/hummlbach/5e9be81f27db91220cec0b919ebb9fec
The following quick and dirty fix in pinctrl-bcm2835.c figured out by another customer of the PiCAN DUO is supposed to work around the problem (haven't tried it myself):
where 190 and 191 are the MCP2515s interrupts.
The text was updated successfully, but these errors were encountered: