Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] OLED over I2C on RP2040 causes secondary half of split keyboard to slowdown and crash #17720

Closed
3 tasks
CodyMathis123 opened this issue Jul 19, 2022 · 31 comments · Fixed by ChibiOS/ChibiOS-Contrib#329

Comments

@CodyMathis123
Copy link

CodyMathis123 commented Jul 19, 2022

Describe the Bug

When OLED screens are enabled and used on a KB2040 (Assuming this affects other RP2040 as well) over I2C, the secondary side of the keyboard becomes slow, missing keystrokes, and eventually unresponsive.

System Information

Keyboard: Corne / CRKBD
Revision (if applicable): v3
Operating system: Windows 10
qmk doctor output:

Ψ QMK Doctor is checking your environment.
Ψ CLI version: 1.1.0
Ψ QMK home: C:/Users/<username>/source/keebs/rp2040/qmk_firmware
Ψ Detected Windows 10 (10.0.19043).
Ψ Git branch: develop
Ψ Repo version: 0.11.53
⚠ Git has unstashed/uncommitted changes.
Ψ All dependencies are installed.
Ψ Found arm-none-eabi-gcc version 10.1.0
Ψ Found avr-gcc version 8.4.0
Ψ Found avrdude version 6.4
Ψ Found dfu-util version 0.11
Ψ Found dfu-programmer version 0.7.2
Ψ Submodules are up to date.
Ψ QMK is ready to go, but minor problems were found

Any keyboard related software installed?

  • AutoHotKey (Windows)
  • Karabiner (macOS)
  • Other:

Additional Context

I am not sure if this only impacts split keyboards or not, but it only seems to impact the secondary side of the split.

@CodyMathis123 CodyMathis123 changed the title [Bug] OLED over I2C on RP2040 causes (split?) keyboard slowdown and crash [Bug] OLED over I2C on RP2040 causes secondary half of split keyboard to slowdown and crash Jul 19, 2022
@daskygit
Copy link
Member

This isn't an issue I've experienced using my corne with sparkfun rp2040s. I'm assuming you have the OLEDs fitted? Do you have a branch with the code you are using to cause this issue?

@CodyMathis123
Copy link
Author

CodyMathis123 commented Jul 19, 2022

@daskygit

https://github.com/CodyMathis123/qmk_firmware/tree/develop/keyboards/crkbd/keymaps/codymathis123

Here you go! And yes, I do have the OLED fitted on both sides.

@Jpe230
Copy link
Contributor

Jpe230 commented Jul 19, 2022

I have experienced this behavior in boards (CRKBDs and TidBit) with either the KB2040 and PM-RP2040.

In my case, it was due to i2c errors thanks to some poorly solder joints.

I suggest you to enable debug and see if any failed transaction occur :)

@CodyMathis123
Copy link
Author

I can reproduce this across two separate keyboards, and it does not happen on an elite-c, for example.

I will look at enabling debug though!

@CodyMathis123
Copy link
Author

CodyMathis123 commented Jul 19, 2022

Set

CONSOLE_ENABLE = yes

in rules.mk

and added the below to my keymap.c

void keyboard_post_init_user(void) {
  // Customise these values to desired behaviour
  debug_enable=true;
  debug_matrix=true;
  //debug_keyboard=true;
  //debug_mouse=true;
}

The attached file is the output from QMK toolbox. The issue was occurring at the time of capture. Toolbox would stop outputting when keypresses would fail on the secondary side.

crkbd-debug.txt

@Jpe230
Copy link
Contributor

Jpe230 commented Jul 19, 2022

Interesting...

Does the bug present itself when using the other half as master? Or even when only using just 1 half without the other connected?

@CodyMathis123
Copy link
Author

It is always the secondary side that stops responding to keypresses. So I can plug in either side, and the secondary will have the issue.

The 'main' half always types as expected.

@amadea-system
Copy link

I have also experienced this issue on my crkbd w/ KB2040.
And just like @CodyMathis123, it only ever happens on the secondary side.
When I swap which side is Primary/Secondary, the issue always moves over to the 'new' Secondary side.
The issue does not seem to happen immediately upon power up, but a random amount of time later. Anywhere from a minute to an hour I would say.

When it happens on my keyboard I notice the following symptoms, all exclusively on the Secondary side.

  • Keypresses are still recognized, but not well.
  • The OLED stops updating.
  • RGB Matrix updates at 1-2 fps.

@Jpe230
Copy link
Contributor

Jpe230 commented Jul 19, 2022

Interesting, just flashed @CodyMathis123 keymap in my Crkbd with KB2040s and it locks up after a minute or two.

Funny enough my keymap which uses a simple oled_write_P for the slave doesn't present it. Gonna try later with my PM-RP2040 just to see if this is a behavior only present in KB2040s.

Update: using an animation taken from my AVR CRKBD also causes lock up...

@Jpe230
Copy link
Contributor

Jpe230 commented Jul 20, 2022

Just tested with both PM-RP2040 and KB2040, either as a slave or master, and in all cases the slave dies the same.

I'm using this to reliably crash my slave half:

void render_slave_oled() {
	static int current_block = 0;

	current_block = (current_block + 1) % 528;

	char white_bg[528] = {0};
	memset(white_bg, 0, 528);
	memset(white_bg, 0xff, current_block);
	oled_write_raw(white_bg, 528);
}

@CodyMathis123 @amadea-system, can you confirm that this is the exact behavior as yours?

Img.1456-1.mp4

(KB2040 as Master and PM-RP2040 as slave)

@tzarc
Copy link
Member

tzarc commented Jul 20, 2022

Shouldn't they be 512 bytes? (128*32/8)=512?
Also perhaps try making the white_bg a static -- if it's on the stack instead of the heap the compiler sometimes has a fit.

@Jpe230
Copy link
Contributor

Jpe230 commented Jul 20, 2022

Oh shoot! My maths did wrong to me.

But sadly even by changing the size and making it static still locks it up hmmm

Update: It makes sense that changing the size doesn't affect the reproducibility of the bug thanks to the sanity check in the oled driver for dumb mistakes like mine...

@daskygit
Copy link
Member

Yeah I can confirm the issue, the RGB matrix and key issues aren't as apparent when you have a slow OLED_INTERVAL defined

Using the following code Jpe230 provided triggers the issue pretty quickly, thanks for that.

	static int current_block = 0;

	current_block = (current_block + 1) % 512;

	static char white_bg[512] = {0};
	memset(white_bg, 0, 512);
	memset(white_bg, 0xff, current_block);
    oled_write_raw(white_bg, 512);

From some quick initial testing calling i2c_stop on an I2C error seems to resolve the problem. Do you mind giving this a test please?

diff --git a/platforms/chibios/drivers/i2c_master.c b/platforms/chibios/drivers/i2c_master.c
index 21e064b1dc..cbf3b84d9f 100644
--- a/platforms/chibios/drivers/i2c_master.c
+++ b/platforms/chibios/drivers/i2c_master.c
@@ -115,6 +115,7 @@ static i2c_status_t chibios_to_qmk(const msg_t* status) {
             return I2C_STATUS_TIMEOUT;
         // I2C_BUS_ERROR, I2C_ARBITRATION_LOST, I2C_ACK_FAILURE, I2C_OVERRUN, I2C_PEC_ERROR, I2C_SMB_ALERT
         default:
+            i2c_stop();
             return I2C_STATUS_ERROR;
     }
 }

@CodyMathis123
Copy link
Author

This does seem to resolve the issue! Greatly appreciated. I will keep an eye on this and see if it reoccurs in any way.

I am curious about what I2C errors are happening? And should they be happening?

@KarlK90
Copy link
Member

KarlK90 commented Jul 20, 2022

@Jpe230 @CodyMathis123 @daskygit I'm not sure on the root cause of the problem (yet), but I was able to reproduce the lockup easily. The cause was an I2C IRQ tail chaining of the TX FIFO EMPTY event, after a transmission was done. My fix is to disable all IRQ sources after a successful transmission (just like in the zephyr driver). That solved the lockups for me.

Changes are already in a PR -> ChibiOS/ChibiOS-Contrib#329

@jaygreco
Copy link
Contributor

I was going to open an issue after discovering and digging into the same thing -- I just wanted to add my anecdata to the mix. I have seen in most frequently on the slave side while testing the SNAP, but have also seen the same lockups on the master side albeit with a lower rate of occurrence. The problem seems specific to RP2040 as I haven't seen any I2C errors with AVR on the same board.

@KarlK90
Copy link
Member

KarlK90 commented Jul 21, 2022

@jaygreco could you test the ChibiOS-contrib PR and see if this fixes your lockups as well?

@jaygreco
Copy link
Contributor

@jaygreco could you test the ChibiOS-contrib PR and see if this fixes your lockups as well?

No problem. I'll grab your PR and test with my setup and share results.

@jaygreco
Copy link
Contributor

As of this afternoon, I was able to reproduce the issue on both master and slave of my test board with your ChibiOS-contrib patch pulled in. Is there any additional debug info I can pull during the failure that might help?

@Jpe230
Copy link
Contributor

Jpe230 commented Jul 22, 2022

Chiming in...the bug is still reproducible with the linked PR.

I now managed to reliably crash the master half by spamming I2C commands to the OLED display

@KarlK90
Copy link
Member

KarlK90 commented Jul 22, 2022

Interesting, could both of you post your configs and the code that make it crash?

@Jpe230
Copy link
Contributor

Jpe230 commented Jul 22, 2022

Sure!

My config is here: https://github.com/Jpe230/Jpe230sKeebs/blob/main/keyboards/crkbd/keymaps/jpe230_rp/config.h

And to crash the board I use this:

diff --git a/drivers/oled/ssd1306_sh1106.c b/drivers/oled/ssd1306_sh1106.c
index 30cfeb5648..1c1453e9a2 100644
--- a/drivers/oled/ssd1306_sh1106.c
+++ b/drivers/oled/ssd1306_sh1106.c
@@ -295,17 +295,8 @@ void oled_render(void) {
         return;
     }
 
-    // Do we have work to do?
-    oled_dirty &= OLED_ALL_BLOCKS_MASK;
-    if (!oled_dirty || oled_scrolling) {
-        return;
-    }
-
     // Find first dirty block
     uint8_t update_start = 0;
-    while (!(oled_dirty & ((OLED_BLOCK_TYPE)1 << update_start))) {
-        ++update_start;
-    }
 
     // Set column & page position
     static uint8_t display_start[] = {I2C_CMD, COLUMN_ADDR, 0, OLED_DISPLAY_WIDTH - 1, PAGE_ADDR, 0, OLED_DISPLAY_HEIGHT / 8 - 1};
@@ -319,6 +310,8 @@ void oled_render(void) {
     if (I2C_TRANSMIT(display_start) != I2C_STATUS_SUCCESS) {
         print("oled_render offset command failed\n");
         return;
+    }else{
+        print("oled_render offset command succeded\n");
     }
 
     if (!HAS_FLAGS(oled_rotation, OLED_ROTATION_90)) {
@@ -346,7 +339,9 @@ void oled_render(void) {
     }
 
     // Turn on display if it is off
-    oled_on();
+    //oled_on();
+    oled_active = true;
+    oled_off();
 
     // Clear dirty flag
     oled_dirty &= ~((OLED_BLOCK_TYPE)1 << update_start);

@KarlK90
Copy link
Member

KarlK90 commented Jul 22, 2022

Thanks!

@jaygreco @Jpe230 could you try out this contrib patch on top of my pr? It is hacky atm (and the poll logic is wrong oops) as I'm away until monday to clean it up but forces a i2c abort if it can't disable the peripheral. It solved the lockups on my side.

i2c_abort.patch.txt

@Jpe230
Copy link
Contributor

Jpe230 commented Jul 22, 2022

45 min with the patch applied and no lockups -- Gonna leave it plugged for the rest of the day but so far it seems that it has resolved the lockups 👯

@jaygreco
Copy link
Contributor

Thanks @KarlK90 -- happy to test over the weekend. No worries there. I'll post some updates in a few days.

@jaygreco
Copy link
Contributor

It's looking good on my end -- no OLED hangs so far.

@KarlK90
Copy link
Member

KarlK90 commented Jul 26, 2022

PRs:

ChibiOS-contrib: ChibiOS/ChibiOS-Contrib#329
QMK: #17798

I have mostly rewritten the rp2040 i2c driver in the last couple of days and pushed to current state of affairs to chibios-contrib.

The driver now fully utilizes the rx/tx fifos for sending and receiving data and has proper error handling. With this the driver withstands a split keyboard with oleds running a derived torture test by @Jpe230 and can still read out a cirque trackpad connected on the secondary half. QMK needed another update in the handling of i2c errors, which is a continuation of @daskygit work.

Please try out these changes and leave your feedback in the PRs if this fixes your problems.

Torture test:

void housekeeping_task_user(void) {
	static int current_block = 0;

	current_block = (current_block + 1) % 512;

	static char white_bg[512] = {0};
	memset(white_bg, 0, 512);
	memset(white_bg, 0xff, current_block);
    oled_write_raw(white_bg, 512);
}

@0xcharly
Copy link
Contributor

Looking good as well on my end: currently running the Dilemma with OLED on the slave side, Cirque trackpad on the master side, and I haven't had any issues in the last 6 hours that the keyboard has been plugged in after being flashed with the patches referenced in the comment above. The same configuration used to lock up in just a few minutes before those changes.

Thanks, @KarlK90!

@KarlK90
Copy link
Member

KarlK90 commented Jul 27, 2022

With #17817 merged the fixes should now be available on the latest develop branch.

@Jpe230
Copy link
Contributor

Jpe230 commented Jul 27, 2022

A late reply:

I torture-tested one extra display (using my old code + rendering it each ms) in the course of the night and happy to report that I couldn't find any problem with the aforementioned PRs!

Amazing job as always and thanks for fix, @KarlK90!

@KarlK90
Copy link
Member

KarlK90 commented Jul 27, 2022

Nice, thanks for the feedback everyone. I'll close this issue then.

@KarlK90 KarlK90 closed this as completed Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants