-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I2C Broadcom bug workaround. #254
Comments
If a workaround in the driver makes it work better for the majority of users, then sure it should be included. |
No. Not yet.
Speaking about workarounds: The workaround could also be triggered by the SDA line still being low when we are trying to initiate a new transaction. That is clearly a "problem" situation. The "it goes wrong" case might be rare for "normal" cases. Many people will have a hardware-i2c module on the I2C-bus. Back when I2C was invented "10us" might have been a "very short" timeframe for some chips to handle the requests, so clock stretching was necessary. Nowadays any hardware I2C implementation should be able to handle "fast I2C" up to even faster. This combines with the way the atmel guys implemented their async modules. Instead of having the module run on the externally provided clock (i2c-slave or SPI-slave), the module still runs on the processor clock. And it synchronizes all incoming signals by passing them through a bunch of filpflops. Good hardware design. The alternative is IMHO better: run the module off the external clock, and synchronize when the data passes to the other (cpu-) clock domain. This would for example allow the SPI module to function at 20MHz, even though the CPU runs at only 8. Anyway. Enough hardware rambling. Suggested workaround: Issue an extra clock cycle when at the beginning of a transaction the SDA line is still low. (it should start out high to be able to issue a "start" condition). |
Firstly, I know little about I2C. I have spoken to Gert and others about workarounds. There are two problems with I2C known about:
For 1, Gert's view is there is no foolproof workaround. I don't know if your proposal will work. I'd be interested to hear if it reduces the problem, or fixes it completely. Gert says he has a desciption of the bug that he'll try and get released. For 2, in the GPU I2C driver we have some code working around a problem with state machine:
I'm not sure if people are hitting this issue, but the pasted could may be helpful if you are. |
It is not possible to completely eradicate this hardware bug. When it happens, the master and slave disagree on the number of clock cycles. Therefore, the data, as interpreted by the slave and the data as the master (= raspberry pi = driver/application) intended differ. You might be writing a completely bogus value in a register somewhere. Highly annoying. My suggested workaround would at least cause the next transaction to work. The impact would be halved: only one transaction botched instead of two. It would not eliminate the problem.
If the I2C slave doesn't require clock stretching....
Yes. I have 10 devices "in the field" at one location. Apparently the clock frequency (RC clock) of three of the modules is different in such a way that they end up "done" with the clock stretching in exactly the wrong moment. Reducing the clock frequency fixed the problems for those modules. If the clock stretching delay is NOT fixed, but, say, varies by more than 5 microseconds, there is a more or less statistical chance of hitting the problem. In my case the slave requires 0.25 microseconds of clock width. So clock stretching ends in the 5% of the time leading up to the clock changing moment for the BCM, then things go wrong. So, in this case, about 1 in 20 transfers would go wrong. (specs say min 250ns for a clock pulse. That in fact doesn't mean that it won't be seen if it is shorter. I think the chances of it being seen goes linearly from 100% at > 250ns to 0% at < 125ns. ) I've spent yesterday stretching the clock stretching period in such a way that it would fall around the 2.5 microsecond mark in the 5 microsecond window-of-opportunity. If I'd aim for the start of the window and my RC clock runs bit fast, I'd hit the "0" which triggers the bug. If I aim for 5, same thing. So now I'm aiming in the middle, away from the bad places. But change the I2C clock to 80kHz, and BAM, I'm aiming right for the sensitive spot... (and if it doesn't happen with 80kHz, it'll happen with some value between 80 and 100). |
Here is Gert's description of the I2C bug: |
Haha! I have instrumented my test-board in exactly the same way. I've cut the SCL trace and mounted a 0.1" jumper connector on there. While not measuring there is a normal jumper there, or a female connector with a 100 ohm resistor when I am. I don't think it only happens on the first clock. That would mean I could work around the problem by either NOT clock stretching at all (as most hardware I2C chips probably do), or by causing the clock stretch to last at least a full cycle. I have instrumented my i2c slave code in such a way that I can cause the clock stretch to change by 0.5 microsecond increments. I can command it to change the clock stretch to XXX over I2C and then it works like that for 50 miliseconds before changing back to the default values. (in case it doesn't work we need it working again. This way I can quickly scan a whole range of values for these settings, while monitoring the SDA and SCL lines with a logic analyzer. Sending a "use delay XXX" and then sending a sting takes 25 miliseconds. THen incrementing a counter in the script and sending the "next delay" takes the shell script about 100ms. So I can scan 100 possible values in about 10 seconds. I HAVE seen the bug happening on longer delays. So the theory: "it only happens on the first" is not correct. As to the "super efficient" interrupt handler: Yes, I can do that too. The thing is: The clock stretching feature allows me to put the "handle this byte" code in the interrupt routine and not worry about it taking a few microseconds or a few miliseconds. In theory the clock stretching will make sure that the master waits for me to finish handling this byte before continuing with the next one. That plan is totally busted with the bug in the 2835. I now just cause clock stretching to stop, and then hope to finish handling the byte in the 70 microseconds that follow. I have to "cycle count" to aim for the middle of the 5 microsecond half-clock for me to release the clock stretching. If I'd aim for the start (to cause a 4.9 microsecond clock-high-period) the clock on my slave might be running a few percent faster and cause the bug to trigger a half-cycle earlier. So both ends of the half-clock cycle are dangerous, I have to aim for a short, but valid enough clock cycle. The change in the module is IMHO simple: When SCL is "driven high" (i.e. not driven at all), stop the I2C module master clock. This has the side effect of allowing much longer I2C buses because automatic clock stretching occurs when the bus capacitance is above spec. |
This: http://comments.gmane.org/gmane.linux.kernel.rpi/268 and this: http://news.gmane.org/gmane.linux.kernel.rpi can help? |
Thanks for trying to help, iz8mbw, but those have nothing to do with this (hardware) bug. I've measured my I2C clock and it comes out as 100kHz which the driver thinks it should be. The driver has a "clock source" object which it queries for the frequency and it uses that. I have not yet instrumented my driver to print out the frequency. |
Just for information for people who want to SEE the bug in action I have some "proof" in the form of a logic analsyer dump. In the image LA_with_bug_shot1.png you'll see the short burst of activity: "set delay to XXX" then a small burst that tries to write some 10 bytes, but it gets aborted due to a NACK. On the right you see a proper 10-byte-packet. The SDA line remains low until the next burst that is misaligned (all bits are shifted, and in my slave the "xth-byte-in-packet" counter has not been reset by the non-existant start condition at the start of that packet. In the image LA_with_bug_shot2.png I've zoomed in to the section where the short pulse happens. This the longest pulse that I've seen go wrong: 290ns. (above datasheet-spec for the Atmel!) |
Oh so THAT's where that came from! I've got a project I'm trying to control with I2C from a raspberry pi and I'm honestly close to just giving up and either switching to a beagleboard or having the Pi talk to an FPGA via SPI which then does the I2C stuff. |
I and others are certainly hitting the repeated start issue. I have a userspace hack at http://cpansearch.perl.org/src/MDOOTSON/HiPi-0.26/BCM2835.xs in proc hipi_i2c_read_register_rs. This seems to work for the users who have reported trying it or incorporating the C code ( though that's only 5 people ), even though I had not figured out correctly how to determine that the TX stage had completed. I suppose I need to learn to compile kernel modules and run existing tests against code above. |
Couldn't we just force the kernel to ignore the I2C hardware and use bitbanged GPIO-based I2C on the same pins? I would think the overhead for that would be more acceptable then the bugs in the hardware itself. |
What overhead do you expect? The current driver defaults to 100kHz clock. So reading 10 bytes from an i2c intertial sensor takes about 1ms. If you bitbang, you're likely to be churning CPU cycles all that time.... |
My preference would be for the driver to optionally support bitbanging i2c through a module parameter (or maybe a separate module if that is more convenient). Many devices don't use clock stretching (or clock stretch by a bounded amount so can work correctly at reduced speed), so the hardware I2C driver is still preferable in those cases. |
The kernel itself has a driver for bitbanged I2C on GPIO pins, this driver has much lower overhead than bitbangging from userspace, and exposes the same API as the hardware I2C drivers. My suggestion would be to compile both that driver and the hardware I2C driver as modules, then load the hardware driver by default. |
@Ferroin |
Haven't used the driver on the Pi, but I have a friend who uses it successfully on a BeagleBoard. |
The i2c-gpio module itself does not provide any way to pass parameters to it. So you need another module that registers platform data and configures at least bus number, SDA and SCL pins. I've written proof of concept of such a module and it even worked on pins 0 and 1 but without i2c-bcm2708 loaded (pcf8574 was detected by i2cdetect, I was able to drive some LED with gpio-pcf857x module). But then I burned my pcf8574 when changing the pins (and I don't have any other i2c chips right now) so I couldn't really test it. Here's the code |
@KADAMSI |
That is less trivial than it looks: It expects to be passed a platformdata structure all the way. On the other hand, the platform does not dictate which GPIOs you are going to use today, so putting in the work to change it does sound reasonable. (normally the platform data specifies: "How is /this/ computer wired". ) |
@popcornmix |
@kadamski So, if you produce a new driver that looks harmless for people not using it (i.e. built as a module, and doesn't nothing until loaded), and there's some evidence (i.e. some confirming user reports) it is working correctly, then I'm happy to accept a PR. |
commit 7d3d43d (net: In unregister_netdevice_notifier unregister the netdevices.) makes pktgen crashing at module unload. [ 296.820578] BUG: spinlock bad magic on CPU#6, rmmod/3267 [ 296.820719] lock: ffff880310c38000, .magic: ffff8803, .owner: <none>/-1, .owner_cpu: -1 [ 296.820943] Pid: 3267, comm: rmmod Not tainted 3.4.0-rc5+ raspberrypi#254 [ 296.821079] Call Trace: [ 296.821211] [<ffffffff8168a715>] spin_dump+0x8a/0x8f [ 296.821345] [<ffffffff8168a73b>] spin_bug+0x21/0x26 [ 296.821507] [<ffffffff812b4741>] do_raw_spin_lock+0x131/0x140 [ 296.821648] [<ffffffff8169188e>] _raw_spin_lock+0x1e/0x20 [ 296.821786] [<ffffffffa00cc0fd>] __pktgen_NN_threads+0x4d/0x140 [pktgen] [ 296.821928] [<ffffffffa00ccf8d>] pktgen_device_event+0x10d/0x1e0 [pktgen] [ 296.822073] [<ffffffff8154ed4f>] unregister_netdevice_notifier+0x7f/0x100 [ 296.822216] [<ffffffffa00d2a0b>] pg_cleanup+0x48/0x73 [pktgen] [ 296.822357] [<ffffffff8109528e>] sys_delete_module+0x17e/0x2a0 [ 296.822502] [<ffffffff81699652>] system_call_fastpath+0x16/0x1b Hold the pktgen_thread_lock while splicing pktgen_threads, and test pktgen_exiting in pktgen_device_event() to make unload faster. Signed-off-by: Eric Dumazet <[email protected]> Cc: Eric W. Biederman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Has the HW been fixed on BCM2837 (in RPi 3)? |
Well. It depends on what your slave detects as a valid clock signal. And how much delay the RC on your bus introduces. In my case, the AVR I use as a slave considers pulses less than two clock cycles to be a glitch. With an internal RC clock of 8MHz (+/-10%!!!), the pulse needs to be at least 250ns to be detected. I2C specifications require 5 microseconds, so there is a factor of 20 "margin" between what is officially required and what no longer works. At first I thought I'd increase the clock stretch in my slave produces a bit to say 12us. That works until the bus capacitance adds enough delay to cause a problem again. So it is very difficult to work around. Annoying to have to tell your clients not to use the standard 100kHz i2c. |
Sorry, but it I needed some time to write a full implementaion and test scenario.
Therefore the MEMS and the PCF accept a clock higher then the standard 100 kHz. |
I'm playing with this issue now on a RPi 3 Model B+, and I'm seeing different results than the Advamation article describes. The silicon wasn't revved recently, was it? The short version is, things seem to work if I stretch at read-preACK, write-preACK, or write-postACK; the only time I need to be sure to stretch either long or not-at-all is read-postACK. I still need to clean up my experiment before I can say anything certain, though. First, a note about terminology. The Advamation article says:
The term "I2C-read-ACK-phase" seems somewhat ambiguous. If it's an I2C read command (i.e., the post-address R/W̅ bit is set), the slave would be writing the A̅C̅K̅ bit, not reading it, contrary to the parenthetical. When I say "read-postACK" here, I mean that the R/W̅ bit was set right after the address (and presumably not changed by protocol like SMBus does), so the slave has just written its A̅C̅K̅ response. As has been hinted at around here, it seems that if the RPi senses that SCL is being held low by a slave, it will delay its internal SCL clock by one half-cycle (5μS). The bug is triggered if the slave holds SCL low for almost one half-cycle. If that happens, then the RPi will still let SCL go high, and then pull it down immediately afterwards when its SCL clock triggers. This results in a runt clock pulse, which some devices might see as a pulse while others don't. (For reference: the I2C spec implies a nominal SCL high time of 5μS, and establishes a minimum time of 4.0μS at 100kHz, which is the default speed of the RPi's I2C interface.) In my experiments, in the read-preACK, write-preACK, and write-postACK phases, the RPi would ensure that the maximum SCL high time it would allow was 4μS or so. It seemed willing to push its SCL clock back in those phases. I suspect that the RPi sampled the SCL line shortly after the it released it, and if SCL was still low, it pushed its SCL clock back a half-phase. However, in the read-postACK phase, I did see that the RPi would emit runt clocks (<4μS, often 2μS or less) if the SCL line was held low for about 6–9μS. I suspect that in this phase, the Broadcom engineers follow a different path that tries to make it less likely that they will have to delay a half-phase on a high-capacitance line. Sadly — as we've seen — it falls short. Therefore, my current code needs to stretch SCL beyond ½ cycle in the read-postACK phase of reads that may take a while (such as ADC conversions). If I fail to The current tests that I'm running use a PIC16F1619 as a slave device. The slave device uses a loop to stretch the clock in both read and write commands, both pre- and postACK. The length of this loop plus overhead varies from 3μS (less than the time the RPi holds SCL low) to 90μS in 0.35μS increments. I have a Python program on the RPi that will different values to two registers, then read them back, and assert that the read-back values are correct. The values being used |
Piquan, |
@rewolff Well, there may be something to their interpretation. If, by “first clock stretch” they mean “the first time in a given stage that SCL is stretched by more than one clock translation” then their description is accurate: if you stretch by more than one clock cycle, then yeah: in all phases, if you stretch by >.5 cycle, you’re in good shape. At least, in the vague experiments I’ve done so far; more precise experiments are yet to come. I’m happy to conduct some better experiments to demonstrate when the RPi’s I2C can and can’t handle clock stretches, and get repros that BCM can send to the engineers. Sadly, there’s a very significant silicon rev they’d need to fix this issue; they need to not tick the SCL divider clock when SCL is held low (by an internal or external source). That’s probably going to require a silicon-level change, which is expensive; most ECOs are metal-level, which is… well, not cheap, but way cheaper than silicon. My main goal here is to characterize the types of slaves that would and wouldn’t work with the RPi, with some experimental evidence. It sounds like we need more solid data, such as repro scenarios, to convince Broadcom to re-engineer the I2C controller. Even with solid data, the engineers will have to convince management that it’s worth revising silicon for the next revision. That said, I’ll put some time into designing reproducible scenarios so the BCM engineers can clearly see and test the problem. I’m going to depend on the community members here to communicate my findings to Broadcom, though; I don’t have a channel to them. |
I should clarify something about my previous comment for the community, about silicon vs. metal changes. You know how, if you get a board from SparkFun or Adafruit, it often has traces you can cut or solder together to change its behavior? Most I2C boards have traces you can change to change the address, enable or disable the pull-up resistors, etc. Microchip designers do the same thing. There’s lots of changes they can make on the metal layers of a microchip, so that they don’t have to change the silicon layers. That’s because changing silicon layers is hard and expensive (like changing the chip on a board), while metal layers are a lot easier to change (like changing the traces on a board). Many useful changes can be made in the metal layers, because designers think ahead and make metal connections (jumpers, essentially) that change the behavior of a chip. But if there’s a change they didn’t plan for, and can’t figure out how to make by rerouting traces within the chip, then they have to change the silicon. Silicon needs a lot more planning, verification, and masking, so it’s hard to change. Many companies, if they have a silicon layer that’s “good enough”, will use it as long as possible. They’ll reuse some parts of the silicon layer in multiple revisions of the chip. I suspect that’s where BCM is at: they need good evidence that this bug can’t be fixed in drivers or metal before they can justify re-engineering this part of the silicon layer from scratch. |
It is extremely unlikely that there will be a respin of the exisiting silicon just to fix this issue (got a spare $1M?). Future chips may well have a fix though, who knows. |
On Wed, Sep 05, 2018 at 02:13:19AM -0700, Joel Ray Holveck wrote:
It sounds like we need more solid data, such as repro scenarios, to
convince Broadcom to re-engineer the I2C controller. Even with
solid data, the engineers will have to convince management that
it’s worth revising silicon for the next revision.
They already have had six years and 3 silicon revisions where they
have demonstrated their decision: "won't fix".
if you stretch by >.5 cycle, you~re in good shape.
I do not think so. Every time you release CLK in the "just before the
broadcom looks again" time period, you get short pulses. I can easily
delay the release of the clock signal by say 5 useless microseconds to
always be out of the 0.5 cycle. But that did not work.
|
On Wed, Sep 05, 2018 at 02:28:53AM -0700, James Hughes wrote:
It is extremely unlikely that there will be a respin of the
exisiting silicon just to fix this issue (got a spare $1M?). Future
chips may well have the fix though.
I doubt it. We went from BCM2835 with KNOWN bug to the BCM2836 with
the same bug to the BCM2837 with the exact same bug to the BCM2837B0.
Now why do you think the next revision will be different?
BCM has a target market: Phones and TV-sticks for these chips. In
their target market nobody is doing software-I2C on general purpose
microcontrollers like AVRs and PICs. they are NOT going to fix this.
|
It is a shame this investigative effort is required, great though it is. When I asked before for an erratum that covers this it was thought there wasn't one. Has that now changed and there's a URL? Ideally, Broadcom would help the community by precisely describing the fault and when it occurs so workarounds or evaluations of compatible components would be easier to work out with confidence. Even to the extent of the snippet of relevant HDL if they don't want to attempt to translate that accurately into a description. |
The suggestion to halt the half-bit-clock-divisor when allowing clock stretching is a good one. It would work. The problem with that is that part of the I2C module would then need to run at "full clock speed". Now the I2C module is a nice low-power one because it runs normally not faster than 200kHz. |
The main reason for my post was what I said at the beginning of my original post:
If we had significant changes to the I2C hardware or driver after the 2013 Advamation writeup, and the community knows how they may affect this bug, then I should make appropriate changes to my experiments. I know the fundamental problem is in hardware, but for all I know, somebody may have found by experiment that changing a seemingly-unrelated register in the driver actually changes this bug's behavior. I also haven't confirmed that the BCM2837 has the same I2C module as the RPi 2 (the most recent time I could find definitive confirmation that the I2C hardware was unchanged since this problem was first reported). I also haven't found any information about the any of the BCM283x steppings; for all I know, there could have been a metal change in the BCM2836 era. I'm not interested in bashing Broadcom, speculating how they may change future revisions, or anything like that. My primary goal here is to document clearly the current behavior; it looks like it's changed over time. I have a secondary goal to allow future users to reproduce my experiments to see if the results are still valid, or to make improvements or test other conditions. I have another secondary goal to allow Broadcom engineers to reproduce the tests easily if they want to. Both of these mean I need to publish my experiments, which also helps: because of my own hubris, if I'm publishing my experiments, I'll be more driven to make them rigorous. My current experimental plan is to use a PIC16F1619 as the slave implementation; it's fast and versatile enough to do the clock stretching experiments, easy enough to work with, and common enough that other experimenters can reproduce my results. If the results are ambiguous, I might also build an experiment with an FPGA. I might see if I can get reasonable granularity using another Raspberry Pi in a bit-bashing configuration as the slave, but I suspect I'll need something without a whole OS on it to do real-time testing. To address another couple of things that came up: Sorry that my original post was incomplete; I accidentally posted an early draft. I'll leave it as-is, since the changes I made are pretty minor. I don't claim originality on the idea of halting the clock divisor while SCL was externally held low. I read that somewhere else, but don't remember where; I thought it was from Gert van Loo in the I2C clock stretching thread on raspberrypi.org, but I can't find it now, so I must have seen that elsewhere. I'm not convinced that it would change the power profile significantly, |
The I2C HW block has not been fixed/changed, and is unlikely to be, in the bcm2835/6/7 as that would require respins of the chip which is simply much too expensive. I do not know who Advamation are, but they are unrealted to the RPF(T) group, so I cannot guarantee the articles accuracy. |
Dear all, I got a very strange phainomenon on my Pi 2B and 3, I try to explain: |
@hmf2015 This issue is about a very specific I2C problem. Problems with the SD card are not caused by this I2C problem. If your SD card is not working right, you should try a new SD card. Some cards are not good, and stop working quickly. My SD card stopped working last month, and I had to replace it. Usually, I2C ghosts are not because of this problem. They most often happen because:
I said that we're talking about a particular problem. Some devices might make ghosts because of this problem, but not many devices. If devices make ghosts because of this problem, they don't make many ghosts. I think your problem is different. I don't know what you mean when you say that you "noted leaks on the sd card". That's an important problem, but it is not a problem that we talk about in this web page's issue. This conversation is about a particular bug. That bug does not affect SD cards. I do not think that recent Raspbian changed how I2C works. We think that Raspbian cannot fix this bug. We might be wrong. Thank you for telling us about this. If you are only telling us this information to help us, then we thank you. If you are telling us this information to ask for help, you might want to ask at the Raspberry Pi forums. The forums talk about many things, like I2C, SD cards, GPIO extenders, and many more things. The things you are seeing may be the same things we are talking about, but they may be different things. The people at the forums can help you to know which problems you are seeing. Thank you for telling us this information! |
Thanks for the answer, but the sd card problem was produced by the O/S, I checked the card on my (Unix) host, and it was ok, and now the same cad works perfekt after new installation. |
Hi hmf2015, |
@popcornmix |
@sergey-suloev When i2cdetect does not continue scanning after 8 or 16 addresses, I would say that this is unrelated to this issue. The issue pops up when devices do clock stretching to indicate that they are not yet ready for the next clock cycle. When I2C was designed almost 40 years ago, that might have been conceivable for hardware implementations too. But nowadays, hardware is always ready in time to handle the next clockcycle because 10 microseconds is ages of time for something implemented in hardware. |
Both drd and gadget interrupt handler use the struct cdns3 pointer as dev_id, it causes devm_free_irq at cdns3_gadget_exit doesn't free gadget's interrupt handler, it freed drd's handler. So, when the host interrupt occurs, the gadget's interrupt hanlder is still called, and causes below oops. To fix it, we use gadget's private data priv_dev as interrupt dev_id for gadget. Unable to handle kernel NULL pointer dereference at virtual address 0000000000000380 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000971d79000 [0000000000000380] pgd=0000000971d6f003, pud=0000000971d6e003, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Modules linked in: mxc_jpeg_encdec crct10dif_ce fsl_imx8_ddr_perf CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-03486-g69f4e7d9c54a-dirty #254 Hardware name: Freescale i.MX8QM MEK (DT) pstate: 00000085 (nzcv daIf -PAN -UAO) pc : cdns3_device_irq_handler+0x1c/0xb8 lr : __handle_irq_event_percpu+0x78/0x2c0 sp : ffff800010003e30 x29: ffff800010003e30 x28: ffff8000129bb000 x27: ffff8000126e9000 x26: ffff0008f61b5600 x25: ffff800011fe1018 x24: ffff8000126ea120 x23: ffff800010003f04 x22: 0000000000000000 x21: 0000000000000093 x20: ffff0008f61b5600 x19: ffff0008f5061a80 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 003d090000000000 x13: 00003d0900000000 x12: 0000000000000000 x11: 00003d0900000000 x10: 0000000000000040 x9 : ffff800012708cb8 x8 : ffff800012708cb0 x7 : ffff0008f7c7a9d0 x6 : 0000000000000000 x5 : ffff0008f7c7a910 x4 : ffff8008ed359000 x3 : ffff800010003f40 x2 : 0000000000000000 x1 : ffff0008f5061a80 x0 : ffff800010161a60 Call trace: cdns3_device_irq_handler+0x1c/0xb8 __handle_irq_event_percpu+0x78/0x2c0 handle_irq_event_percpu+0x40/0x98 handle_irq_event+0x4c/0xd0 handle_fasteoi_irq+0xbc/0x168 generic_handle_irq+0x34/0x50 __handle_domain_irq+0x6c/0xc0 gic_handle_irq+0xd4/0x174 el1_irq+0xb8/0x180 arch_cpu_idle+0x3c/0x230 default_idle_call+0x38/0x40 do_idle+0x20c/0x298 cpu_startup_entry+0x28/0x48 rest_init+0xdc/0xe8 arch_call_rest_init+0x14/0x1c start_kernel+0x48c/0x4b8 Code: aa0103f3 aa1e03e0 d503201f f9409662 (f941c040) ---[ end trace 091dcf4dee011b0e ]--- Kernel panic - not syncing: Fatal exception in interrupt SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x0002,2100600c Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fixes: 7733f6c ("usb: cdns3: Add Cadence USB3 DRD Driver") Cc: <[email protected]> #v5.4 Signed-off-by: Peter Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit af58e1f upstream. Both drd and gadget interrupt handler use the struct cdns3 pointer as dev_id, it causes devm_free_irq at cdns3_gadget_exit doesn't free gadget's interrupt handler, it freed drd's handler. So, when the host interrupt occurs, the gadget's interrupt hanlder is still called, and causes below oops. To fix it, we use gadget's private data priv_dev as interrupt dev_id for gadget. Unable to handle kernel NULL pointer dereference at virtual address 0000000000000380 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000971d79000 [0000000000000380] pgd=0000000971d6f003, pud=0000000971d6e003, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Modules linked in: mxc_jpeg_encdec crct10dif_ce fsl_imx8_ddr_perf CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-03486-g69f4e7d9c54a-dirty #254 Hardware name: Freescale i.MX8QM MEK (DT) pstate: 00000085 (nzcv daIf -PAN -UAO) pc : cdns3_device_irq_handler+0x1c/0xb8 lr : __handle_irq_event_percpu+0x78/0x2c0 sp : ffff800010003e30 x29: ffff800010003e30 x28: ffff8000129bb000 x27: ffff8000126e9000 x26: ffff0008f61b5600 x25: ffff800011fe1018 x24: ffff8000126ea120 x23: ffff800010003f04 x22: 0000000000000000 x21: 0000000000000093 x20: ffff0008f61b5600 x19: ffff0008f5061a80 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 003d090000000000 x13: 00003d0900000000 x12: 0000000000000000 x11: 00003d0900000000 x10: 0000000000000040 x9 : ffff800012708cb8 x8 : ffff800012708cb0 x7 : ffff0008f7c7a9d0 x6 : 0000000000000000 x5 : ffff0008f7c7a910 x4 : ffff8008ed359000 x3 : ffff800010003f40 x2 : 0000000000000000 x1 : ffff0008f5061a80 x0 : ffff800010161a60 Call trace: cdns3_device_irq_handler+0x1c/0xb8 __handle_irq_event_percpu+0x78/0x2c0 handle_irq_event_percpu+0x40/0x98 handle_irq_event+0x4c/0xd0 handle_fasteoi_irq+0xbc/0x168 generic_handle_irq+0x34/0x50 __handle_domain_irq+0x6c/0xc0 gic_handle_irq+0xd4/0x174 el1_irq+0xb8/0x180 arch_cpu_idle+0x3c/0x230 default_idle_call+0x38/0x40 do_idle+0x20c/0x298 cpu_startup_entry+0x28/0x48 rest_init+0xdc/0xe8 arch_call_rest_init+0x14/0x1c start_kernel+0x48c/0x4b8 Code: aa0103f3 aa1e03e0 d503201f f9409662 (f941c040) ---[ end trace 091dcf4dee011b0e ]--- Kernel panic - not syncing: Fatal exception in interrupt SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x0002,2100600c Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fixes: 7733f6c ("usb: cdns3: Add Cadence USB3 DRD Driver") Cc: <[email protected]> #v5.4 Signed-off-by: Peter Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
Add various tests to check maximum number of supported programs being attached: # ./vmtest.sh -- ./test_progs -t tc_opts [...] ./test_progs -t tc_opts [ 1.185325] bpf_testmod: loading out-of-tree module taints kernel. [ 1.186826] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel [ 1.270123] tsc: Refined TSC clocksource calibration: 3407.988 MHz [ 1.272428] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fc932722, max_idle_ns: 440795381586 ns [ 1.276408] clocksource: Switched to clocksource tsc #252 tc_opts_after:OK #253 tc_opts_append:OK #254 tc_opts_basic:OK #255 tc_opts_before:OK #256 tc_opts_chain_classic:OK #257 tc_opts_chain_mixed:OK #258 tc_opts_delete_empty:OK #259 tc_opts_demixed:OK #260 tc_opts_detach:OK #261 tc_opts_detach_after:OK #262 tc_opts_detach_before:OK #263 tc_opts_dev_cleanup:OK #264 tc_opts_invalid:OK #265 tc_opts_max:OK <--- (new test) #266 tc_opts_mixed:OK #267 tc_opts_prepend:OK #268 tc_opts_replace:OK #269 tc_opts_revision:OK Summary: 18/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
Add a new test case which performs double query of the bpf_mprog through libbpf API, but also via raw bpf(2) syscall. This is testing to gather first the count and then in a subsequent probe the full information with the program array without clearing passed structs in between. # ./vmtest.sh -- ./test_progs -t tc_opts [...] ./test_progs -t tc_opts [ 1.398818] tsc: Refined TSC clocksource calibration: 3407.999 MHz [ 1.400263] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd336761, max_idle_ns: 440795243819 ns [ 1.402734] clocksource: Switched to clocksource tsc [ 1.426639] bpf_testmod: loading out-of-tree module taints kernel. [ 1.428112] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel #252 tc_opts_after:OK #253 tc_opts_append:OK #254 tc_opts_basic:OK #255 tc_opts_before:OK #256 tc_opts_chain_classic:OK #257 tc_opts_chain_mixed:OK #258 tc_opts_delete_empty:OK #259 tc_opts_demixed:OK #260 tc_opts_detach:OK #261 tc_opts_detach_after:OK #262 tc_opts_detach_before:OK #263 tc_opts_dev_cleanup:OK #264 tc_opts_invalid:OK #265 tc_opts_max:OK #266 tc_opts_mixed:OK #267 tc_opts_prepend:OK #268 tc_opts_query:OK <--- (new test) #269 tc_opts_replace:OK #270 tc_opts_revision:OK Summary: 19/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
Add a new test case to query on an empty bpf_mprog and pass the revision directly into expected_revision for attachment to assert that this does succeed. ./test_progs -t tc_opts [ 1.406778] tsc: Refined TSC clocksource calibration: 3407.990 MHz [ 1.408863] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fcaf6eb0, max_idle_ns: 440795321766 ns [ 1.412419] clocksource: Switched to clocksource tsc [ 1.428671] bpf_testmod: loading out-of-tree module taints kernel. [ 1.430260] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel #252 tc_opts_after:OK #253 tc_opts_append:OK #254 tc_opts_basic:OK #255 tc_opts_before:OK #256 tc_opts_chain_classic:OK #257 tc_opts_chain_mixed:OK #258 tc_opts_delete_empty:OK #259 tc_opts_demixed:OK #260 tc_opts_detach:OK #261 tc_opts_detach_after:OK #262 tc_opts_detach_before:OK #263 tc_opts_dev_cleanup:OK #264 tc_opts_invalid:OK #265 tc_opts_max:OK #266 tc_opts_mixed:OK #267 tc_opts_prepend:OK #268 tc_opts_query:OK #269 tc_opts_query_attach:OK <--- (new test) #270 tc_opts_replace:OK #271 tc_opts_revision:OK Summary: 20/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
… rules rx_create no longer allocates a modify_hdr instance that needs to be cleaned up. The mlx5_modify_header_dealloc call will lead to a NULL pointer dereference. A leak in the rules also previously occurred since there are now two rules populated related to status. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 109907067 P4D 109907067 PUD 116890067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 484 Comm: ip Not tainted 6.9.0-rc2-rrameshbabu+ #254 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014 RIP: 0010:mlx5_modify_header_dealloc+0xd/0x70 <snip> Call Trace: <TASK> ? show_regs+0x60/0x70 ? __die+0x24/0x70 ? page_fault_oops+0x15f/0x430 ? free_to_partial_list.constprop.0+0x79/0x150 ? do_user_addr_fault+0x2c9/0x5c0 ? exc_page_fault+0x63/0x110 ? asm_exc_page_fault+0x27/0x30 ? mlx5_modify_header_dealloc+0xd/0x70 rx_create+0x374/0x590 rx_add_rule+0x3ad/0x500 ? rx_add_rule+0x3ad/0x500 ? mlx5_cmd_exec+0x2c/0x40 ? mlx5_create_ipsec_obj+0xd6/0x200 mlx5e_accel_ipsec_fs_add_rule+0x31/0xf0 mlx5e_xfrm_add_state+0x426/0xc00 <snip> Fixes: 94af50c ("net/mlx5e: Unify esw and normal IPsec status table creation/destruction") Signed-off-by: Rahul Rameshbabu <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
… rules [ Upstream commit 16d66a4 ] rx_create no longer allocates a modify_hdr instance that needs to be cleaned up. The mlx5_modify_header_dealloc call will lead to a NULL pointer dereference. A leak in the rules also previously occurred since there are now two rules populated related to status. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 109907067 P4D 109907067 PUD 116890067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 484 Comm: ip Not tainted 6.9.0-rc2-rrameshbabu+ #254 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014 RIP: 0010:mlx5_modify_header_dealloc+0xd/0x70 <snip> Call Trace: <TASK> ? show_regs+0x60/0x70 ? __die+0x24/0x70 ? page_fault_oops+0x15f/0x430 ? free_to_partial_list.constprop.0+0x79/0x150 ? do_user_addr_fault+0x2c9/0x5c0 ? exc_page_fault+0x63/0x110 ? asm_exc_page_fault+0x27/0x30 ? mlx5_modify_header_dealloc+0xd/0x70 rx_create+0x374/0x590 rx_add_rule+0x3ad/0x500 ? rx_add_rule+0x3ad/0x500 ? mlx5_cmd_exec+0x2c/0x40 ? mlx5_create_ipsec_obj+0xd6/0x200 mlx5e_accel_ipsec_fs_add_rule+0x31/0xf0 mlx5e_xfrm_add_state+0x426/0xc00 <snip> Fixes: 94af50c ("net/mlx5e: Unify esw and normal IPsec status table creation/destruction") Signed-off-by: Rahul Rameshbabu <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
… rules [ Upstream commit 16d66a4 ] rx_create no longer allocates a modify_hdr instance that needs to be cleaned up. The mlx5_modify_header_dealloc call will lead to a NULL pointer dereference. A leak in the rules also previously occurred since there are now two rules populated related to status. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 109907067 P4D 109907067 PUD 116890067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 484 Comm: ip Not tainted 6.9.0-rc2-rrameshbabu+ #254 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014 RIP: 0010:mlx5_modify_header_dealloc+0xd/0x70 <snip> Call Trace: <TASK> ? show_regs+0x60/0x70 ? __die+0x24/0x70 ? page_fault_oops+0x15f/0x430 ? free_to_partial_list.constprop.0+0x79/0x150 ? do_user_addr_fault+0x2c9/0x5c0 ? exc_page_fault+0x63/0x110 ? asm_exc_page_fault+0x27/0x30 ? mlx5_modify_header_dealloc+0xd/0x70 rx_create+0x374/0x590 rx_add_rule+0x3ad/0x500 ? rx_add_rule+0x3ad/0x500 ? mlx5_cmd_exec+0x2c/0x40 ? mlx5_create_ipsec_obj+0xd6/0x200 mlx5e_accel_ipsec_fs_add_rule+0x31/0xf0 mlx5e_xfrm_add_state+0x426/0xc00 <snip> Fixes: 94af50c ("net/mlx5e: Unify esw and normal IPsec status table creation/destruction") Signed-off-by: Rahul Rameshbabu <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
… rules [ Upstream commit 16d66a4 ] rx_create no longer allocates a modify_hdr instance that needs to be cleaned up. The mlx5_modify_header_dealloc call will lead to a NULL pointer dereference. A leak in the rules also previously occurred since there are now two rules populated related to status. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 109907067 P4D 109907067 PUD 116890067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 484 Comm: ip Not tainted 6.9.0-rc2-rrameshbabu+ #254 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014 RIP: 0010:mlx5_modify_header_dealloc+0xd/0x70 <snip> Call Trace: <TASK> ? show_regs+0x60/0x70 ? __die+0x24/0x70 ? page_fault_oops+0x15f/0x430 ? free_to_partial_list.constprop.0+0x79/0x150 ? do_user_addr_fault+0x2c9/0x5c0 ? exc_page_fault+0x63/0x110 ? asm_exc_page_fault+0x27/0x30 ? mlx5_modify_header_dealloc+0xd/0x70 rx_create+0x374/0x590 rx_add_rule+0x3ad/0x500 ? rx_add_rule+0x3ad/0x500 ? mlx5_cmd_exec+0x2c/0x40 ? mlx5_create_ipsec_obj+0xd6/0x200 mlx5e_accel_ipsec_fs_add_rule+0x31/0xf0 mlx5e_xfrm_add_state+0x426/0xc00 <snip> Fixes: 94af50c ("net/mlx5e: Unify esw and normal IPsec status table creation/destruction") Signed-off-by: Rahul Rameshbabu <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
The broadcom BCM2835 has a (hardware) bug in the I2C module.
The BCM implements "clock stretching" in a bad way:
Instead of ensuring a minimum "hightime" for the clock, the BCM lets the I2C clock internally run on, and when it's time to flip the clock again, it checks wether or not the clock has become high, and if it is low, it does a "clock stretch". However if just in the few nanoseconds before this check the slave releases the clock signal indicating it is ready for a new clock, the clock pulse may become as short as 40ns. (that's the lowest I'm able to measure, and that of course is quite rare). With a 100kHz (10 microseconds) I2C clock, the "bad" moments to release the clock are 5 microseconds apart.
My slaves require a clock pulse of at least 500ns. (which should be a factor of ten shorter than what the master should generate when it is configured for 100kHz operation).
That is a hardware bug that we can't do much about at the moment. ( Tip for the hardware guys: instead of implementing clock stretching in this way: inhibit the counting of the CLOCKDIV counter while the clock remains low. This will probably clock-stretch a few fast-clocks on every cycle, but standards conformance will improve substantially).
When this occurs, the BCM will send a byte, with a very short first clock pulse. The slave then doesn't see the first clock pulse, so when at the end of that byte it is time to ACK that byte, the slave is still waiting for the 8th bit. The BCM interprets this as a "NAK" and aborts the I2C transfer. However at that point in time the slave sees its 8th clock pulse and issues an "ACK" (Pulls the SDA line low). This in turn means that the BCM cannot issue a proper "START" condition for the next transfer.
Suggested software workaround to reduce the impact of this hardware bug: When a transfer is aborted due to a "NAK" for a byte, issue a single extra clock cycle before the STOP to synchronize the slave. This probably has to be done in software.
This only happens in the "error path" when the slave doesn't ACK a byte. Maybe it's best to only do this when it is NOT the first byte in a transfer, as the bug cannot happen on the first byte of a transfer.
Does this software fix belong in the driver? Yes. It's a driver for a piece of hardware that has a bug. It is the duty of the driver to make this buggy hardware work the best it can.
The text was updated successfully, but these errors were encountered: