-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the IRAM usage of I2C code by 600-1500 bytes #6326
Conversation
The I2C code takes a large chunk of IRAM space. Attempt to reduce the size of the routines without impacting functionality. First, remove the `static` classifier on the sda/scl variables in the event handlers. The first instructions in the routines overwrite the last value stored in them, anyway, and their addresses are never taken.
Where it doesn't make a functional difference, make global variables ints and not unit8_t. Bytewide updates and extracts require multiple instructions and hence increase IRAM usage as well as runtime.
Before any changes, building
|
After 5ba2dd3 the code is at:
a savings of 100 bytes of IRAM and a loss of 32 bytes of heap. |
Sketch uses 270855 bytes (25%) of program storage space. Maximum is 1044464 bytes. Global variables use 27940 bytes (34%) of dynamic memory, leaving 53980 bytes for local variables. Maximum is 81920 bytes. ./xtensa-lx106-elf/bin/xtensa-lx106-elf-objdump -t -j .text1 /tmp/arduino_build_9615/*elf | sort -k1 | head -20 401000cc l F .text1 00000014 twi_delay 401000ec l F .text1 00000020 twi_reply$part$1 4010010c g F .text1 00000035 twi_reply 4010014c g F .text1 00000052 twi_stop 401001a0 g F .text1 0000003b twi_releaseBus 40100204 g F .text1 000001e6 twi_onTwipEvent 40100404 l F .text1 000001f7 onSdaChange 40100608 l F .text1 000002fd onSclChange 40100908 l F .text1 0000003b onTimer
If SCL is low then all branches of the case are no-ops, so factor that portion outo to remove some redundant logic in each case. Sketch uses 270843 bytes (25%) of program storage space. Maximum is 1044464 bytes. Global variables use 27944 bytes (34%) of dynamic memory, leaving 53976 bytes for local variables. Maximum is 81920 bytes. 401000cc l F .text1 00000014 twi_delay 401000ec l F .text1 00000020 twi_reply$part$1 4010010c g F .text1 00000035 twi_reply 4010014c g F .text1 00000052 twi_stop 401001a0 g F .text1 0000003b twi_releaseBus 40100204 g F .text1 000001e6 twi_onTwipEvent 40100404 l F .text1 000001e7 onSdaChange 401005f8 l F .text1 000002fd onSclChange 401008f8 l F .text1 0000003b onTimer 0x0000000040107468 _text_end = ABSOLUTE (.)
I was hoping for better results, but as-is it's saving 128 bytes of IRAM with a cost of 32 heap bytes. |
twi_reply is a chunk of code that can be inlined and actually save IRAM space because certain conditions acan be statically evaluated by gcc. Sketch uses 270823 bytes (25%) of program storage space. Maximum is 1044464 bytes. Global variables use 27944 bytes (34%) of dynamic memory, leaving 53976 bytes for local variables. Maximum is 81920 bytes. 401000cc l F .text1 00000014 twi_delay 401000f4 g F .text1 00000052 twi_stop 40100148 g F .text1 0000003b twi_releaseBus 401001b0 g F .text1 00000206 twi_onTwipEvent 401003d0 l F .text1 000001e7 onSdaChange 401005c4 l F .text1 000002fd onSclChange 401008c4 l F .text1 0000003b onTimer 40100918 g F .text1 00000085 millis 401009a0 g F .text1 0000000f micros 401009b0 g F .text1 00000022 micros64 401009d8 g F .text1 00000013 delayMicroseconds 401009f0 g F .text1 00000034 __digitalRead 401009f0 w F .text1 00000034 digitalRead 40100a3c g F .text1 000000e4 interrupt_handler 40100b20 g F .text1 0000000f vPortFree 0x0000000040107434 _text_end = ABSOLUTE (.)
Sketch uses 270799 bytes (25%) of program storage space. Maximum is 1044464 bytes. Global variables use 27944 bytes (34%) of dynamic memory, leaving 53976 bytes for local variables. Maximum is 81920 bytes. 401000cc l F .text1 00000014 twi_delay 401000f4 w F .text1 0000003b twi_releaseBus 4010015c g F .text1 00000246 twi_onTwipEvent 401003bc l F .text1 000001e7 onSdaChange 401005b0 l F .text1 000002f9 onSclChange 401008ac l F .text1 0000003b onTimer 0x000000004010741c _text_end = ABSOLUTE (.)
GCC won't use a lookup table for the TWI state machine, so it ends up using a series of straight line compare-jump, compare-jumps to figure out which branch of code to execute for each state. For branches that have multiple states that call them, this can expand to a lot of code. Short-circuit the whole thing by converting the FSM to a 1-hot encoding while executing it, and then just and-ing the 1-hot state with the bitmask of states with the same code. Sketch uses 270719 bytes (25%) of program storage space. Maximum is 1044464 bytes. Global variables use 27944 bytes (34%) of dynamic memory, leaving 53976 bytes for local variables. Maximum is 81920 bytes. 401000cc l F .text1 00000014 twi_delay 401000f4 w F .text1 0000003b twi_releaseBus 4010015c g F .text1 00000246 twi_onTwipEvent 401003c0 l F .text1 000001b1 onSdaChange 40100580 l F .text1 000002da onSclChange 4010085c l F .text1 0000003b onTimer 0x00000000401073cc _text_end = ABSOLUTE (.) Saves 228 bytes of IRAM vs. master, uses 32 additional bytes of heap.
Thx for working on IRAM usage. 128 bytes more IRAM DOES count!!! |
@earlephilhower I have noticed that, if loose variables in a module are pooled together into a structure, the compiler generates fewer instructions to access. For example: int mod_a = 24;
int mod_b = 42;
int mod_c = 83;
int just4Fun(void) {
return mod_a + mod_b + mod_c;
} .literal .LC0, mod_a
.literal .LC1, mod_b
.literal .LC2, mod_c
.align 4
.global _Z8just4Funv
.type _Z8just4Funv, @function
_Z8just4Funv:
l32r a2, .LC0
l32i.n a3, a2, 0
l32r a2, .LC1
l32i.n a2, a2, 0
add.n a3, a3, a2
l32r a2, .LC2
l32i.n a2, a2, 0
add.n a2, a3, a2
ret.n With a structure: struct _FUN {
int a;
int b;
int c;
} mod = {24, 42, 83};
int just4Fun(void) {
return mod.a + mod.b + mod.c;
} .literal .LC0, mod
.align 4
.global _Z8just4Funv
.type _Z8just4Funv, @function
_Z8just4Funv:
l32r a3, .LC0
l32i.n a2, a3, 4
l32i.n a4, a3, 0
add.n a4, a4, a2
l32i.n a2, a3, 8
add.n a2, a4, a2
ret.n In addition to removing two 3 byte |
Good note! Globals need a literal load for each one's base address since they can be placed anywhere in the linker stage, but structs have only one literal base address so if you still have the address handy you can access with a simple load+offset. I'll add it to the to-do list. I don't expect every change to make it in to a PR (the 1-hot encoding seems heroic) but we'll see what we can do. There are two flash_qd* functions I also want to take a look at (~800 bytes). The source in Espressif's repo looks like they could be a single one because they share a lot of code. But there's no guarantee the code in their GH repo is what's actually in their binary .a, and I'm not sure how to test any changes are equivalent (unless you know of some formal verification stuff that's free for open source). |
@earlephilhower Sorry I am relatively new to open source tools. |
No sweat, @mhightower83 . The "best" way to make the change happen is to make it an object, then you'll have a |
twi_status is set immediately before an event handler is called, resulting in lots of duplicated code. Set the twi_status flag inside the handler itself. Saves an add'l ~100 bytes of IRAM from prior changes, for a total of ~340 bytes. earle@server:~/Arduino/hardware/esp8266com/esp8266/tools$ ./xtensa-lx106-elf/bin/xtensa-lx106-elf-objdump -t -j .text1 /tmp/arduino_build_849115/*elf | sort -k1 | head -20 401000cc l F .text1 00000014 twi_delay 401000f4 w F .text1 0000003b twi_releaseBus 40100160 g F .text1 0000024e twi_onTwipEvent 401003c8 l F .text1 00000181 onSdaChange 40100558 l F .text1 00000297 onSclChange
Alright, now we're getting somewhere. @mhightower83 's suggestion (I left it as C, but might consider doing it as a C++ class to make it cleaner) freed another 168 bytes of IRAM. The grand total now is 550 IRAM bytes saved, not too shabby. It is completely untested, though, so don't take that number to the bank yet. |
2741d46
to
618ec7b
Compare
@earlephilhower Just did a short test with Tasmota. The two I2C sensors on the same bus do still work :-). Elf file elf.zip |
Great, thanks for the update! |
Thanks to the suggestion from @mhightower83, move all global objects into a struct. This lets a single base pointer register to be used in place of constantly reloading the address of each individual variable. This might be better expressed by moving this to a real C++ implementaion based on a class object (the twi.xxxx would go back to the old xxx-only naming for vars), but there would then need to be API wrappers since the functionality is exposed through a plain C API. Saves 168 additional code bytes, for a grand total of 550 bytes IRAM. earle@server:~/Arduino/hardware/esp8266com/esp8266/tools$ ./xtensa-lx106-elf/bin/xtensa-lx106-elf-objdump -t -j .text1 /tmp/arduino_build_849115/*elf | sort -k1 | head -20 401000cc l F .text1 00000014 twi_delay 401000e8 w F .text1 00000032 twi_releaseBus 40100128 g F .text1 00000217 twi_onTwipEvent 4010034c l F .text1 00000149 onSdaChange 4010049c l F .text1 00000267 onSclChange 40100704 l F .text1 00000028 onTimer
618ec7b
to
b975996
Compare
Make the TWI states enums and not #defines, in the hope that it will allow GCC to more easily flag problems and general good code organization. 401000cc l F .text1 00000014 twi_delay 401000e8 w F .text1 00000032 twi_releaseBus 40100128 g F .text1 00000217 twi_onTwipEvent 4010034c l F .text1 00000149 onSdaChange 4010049c l F .text1 00000257 onSclChange 401006f4 l F .text1 00000028 onTimer Looks like another 16 bytes IRAM saved from the prior push. Sketch uses 267079 bytes (25%) of program storage space. Maximum is 1044464 bytes. Global variables use 27696 bytes (33%) of dynamic memory, leaving 54224 bytes for local variables. Maximum is 81920 bytes.
7b44bf8
to
52bf8ac
Compare
Convert the entire file into a C++ class (with C wrappers to preserve the ABI). This allows for setting individual values of the global struct(class) in-situ instead of a cryptic list at the end of the struct definition. It also removes a lot of redundant `twi.`s from most class members. Clean up the code by converting from `#defines` to inline functions, get rid of ternarys-as-ifs, use real enums, etc. For slave_receiver.ino, the numbers are: GIT Master IRAM: 0x723c This push IRAM: 0x6fc0 For a savings of 636 total IRAM bytes (note, there may be a slight flash text increase, but we have 1MB of flash to work with and only 32K of IRAM so the tradeoff makes sense.
Since the C++ version has significant text differences anyway, now is a good time to clean up the mess of spaces, tabs, and differing cuddles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the state machine really worth it? it seems like the switch + fallthrough parts are being replaced by a bunch of if-elses and || operations. Is it worth it?
Some of the code in the state machine doesn't look quite the same as the current, e.g.: changes to state-status, some lines missing, etc. I'm not sure if it's cleanup resulting from the transformation, oversight, or something else. I suggest going over it again with great care, because in deeply nested logic trees like this it's very easy to miss some detail.
Other than that, a big question: why go with a singleton instance object twi? It bothers me that the static methods of the object access the global instance, I've cleaned up similar code elsewhere. How about making the data members static (highlights the fact that the class is meant to have a single context only, and if there are multiple instances they all share the same internal context), no global instance twi, and use Twi::blah to access the static members? This better reflects your intent.
I'll look at the other comments in better depth, but for
It's actually a bunch of The switch gets implemented by GCC as a bunch of ifs in the original code: (p-code)
By using 1-hot and or-ing the states, these IFs turn into
So the if-elseif of many states turns into a simple binary was actually a series of 9 compare/jumps which becomes a single and/beq.z |
I didn't mean wrt performance, I meant wrt readability: the switch is rather easier to understand, while the if-else blocks with all the conditions are rather complex, especially when trying to determine which condition cases are covered and and which aren't. But keep the if-else blocks, this is IRAM code, so smaller == better. if(state1)
handleState1();
else if(state2)
handleState2();
else if(myCondition)
handleMyCondition();
... or along those lines. Does it make sense to do that, or is there too much bin size overhead added? |
I agree on readability. 1-hot is a "heroic" optimization, definitely, and not to be taken lightly. It can definitely be backed out of the changes. As long as the function is |
Some thoughts about this feature? As long as we need to rely on gpio-user-interrupts I only see limited (maybe annoying?) use of I2C-slave mode on ESP8266. Why that? I2C bus specification UM10204 by NXP defines standard mode from 0-100kHz. So ESP8266 I2C-slave is inside the specs, but sadly is only compatible to other ESP8266 (with assigned-address-master-mode). This leads to annoying not-compatible-with-other-I2C devices problems. Ideas:
|
Those are very good points. I never used I2C slave and if I would then using a separate class for it would be no problem. |
It's not a clean break to go to slave as a separate library. There will be significant redundancies in the code, and it will break existing apps which use the (not so hot) slave mode so that might be a 3.0.x move. I fear Slave is always going to be a kludge given the complete lack of HW support and the very tight timing required by the protocol. :( |
@devyte, that was a conscious choice to reduce code size. When you have an normal object variable, you simply do a |
Per review comments
There were multiple places where the code was waiting for a slave to finish stretching the clock. Factor them out to an *inline* function to reduce code smell.
@devyte, I've gone through the code once more by hand and, with the above mentioned change to We talked about moving the code inside the case blocks in the event handlers to their own inline functions. I can do that later, but it's easier to compare the logic without that refactoring so I left it the way it is for now. If you see something specific that doesn't match, please do tell me! |
Add a new twi_setSlaveMode call which actually attached the interrupts to the slave pin change code onSdaChenge/onSclChange. Don't attach interrupts in the main twi_begin. Because slave mode is only useful should a onoReceive or onRequest callback, call twi_setSlaveMode and attach interrupts on the Wire setters. This allows GCC to not link in slave code unless slave mode is used, saving over 1,000 bytes of IRAM in the common, master-only case.
Nice, can confirm. Test Tasmota build saved 968 bytes :-) |
I also can confirm that this works. Please merge. Thanks. |
No functional changes, only formatting.
Hi, sorry to wake up an old thread but this seems to be the best place to start:
I wanted to use two I2C master interfaces (two slaves with same IDs), and upon realizing that the default implementation of Wire library calls to this code (no significant changes has been made to Twi since this PR if I am correct), which is also made to handle only one master at a time, I had to touch the core source files to make multiple masters viable. What I did is create class hierarchy, where TwiMaster (a subset of the original Twi class) does not have any static members, callbacks or IRS, and is exposed in the twi.h header file. TwiMasterSlave derives from it and adds all the remaining functionality of the original Twi class, and the C interfaces is rerouted from Twi singleton to TwiMasterSlave singleton (I could also create a typedef Twi = TwiMasterSlave, but Twi itself was not exposed in any header file before). Do you thing it would be compatible with the original IRAM reduction goals @earlephilhower? There is already an issue #7894 asking to support multiple I2C instances (with a kinda crude but probably working solution - turning all the static dependencies into 5-tuples so that up-to 5 master/slave I2C interfaces can be used). |
The I2C code takes a large chunk of IRAM space. Attempt to reduce the
size of the routines without impacting functionality.
First, remove the
static
classifier on the sda/scl variables in theevent handlers. The first instructions in the routines overwrite the
last value stored in them, anyway, and their addresses are never taken.