Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert hsb_to_rgb to use Integers Instead of Floats to Improve Performance #926

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

coltontcrowe
Copy link

@coltontcrowe coltontcrowe commented Aug 29, 2021

Hello! I've implemented this change suggested in #65 (RGB Underglow Improvements). As the title suggests, this converts the function hsb_to_rgb to use Integers instead of Floats in the hope that it improves performance. Overall, I think the change is mostly self-explanatory, but there are a few points I want to dive into.

  1. There is a slight increase of complexity in this version. In the previous version, the variable v could be reused in the calculation of p, q, and t. Due to rounding issues introduced by integers, this is no longer possible as the division by BRT_MAX must be performed after all of the multiplication is performed. While this step isn't strictly necessary, it prevents a large number of inconvenient off-by-one rounding errors that would otherwise be introduced.

  2. On the topic of rounding errors, even with my best efforts, a few did slip through the cracks. That said, there are very few. I wrote a script to compare all values produced with the old function to the values produced with the new function. Here are all of the differences:

H S B Old R New R Equal Old G New G Equal Old B New B Equal
213 099 055 002 002 ✔️ 064 063 138 138 ✔️
221 067 071 060 060 ✔️ 098 097 179 179 ✔️
239 061 079 078 078 ✔️ 081 080 199 199 ✔️
239 083 097 043 043 ✔️ 047 046 244 244 ✔️
261 075 049 064 063 031 031 ✔️ 123 123 ✔️
261 075 098 128 127 063 063 ✔️ 247 247 ✔️
327 099 090 227 227 ✔️ 004 004 ✔️ 126 127

The most any single result differs by is only 1, so the impact is pretty small. That said, I wasn't able to find a way around these. Still, of the 3,672,360 possible combinations, only these 7 were different. (See the footnote at the bottom for the script I used to generate this.)

  1. That brings me to another point that may be a bit weird. I said 3,672,360 combinations instead of 3,600,000 which might be what you expect. This is because I realized that in the current implementation, saturation and brightness take the range of 0 to 100 inclusive, meaning that there are 101 possible values for each of those. Don't know if it's desirable to change this, but I decided to leave it be for now.

  2. I was a little concerned about overflow with the large multiplications happening, but that doesn't seem like an issue when I ran the comparison script.

  3. Lastly, I cannot confirm if this actually improves the performance. Odds are, that will be somewhat dependent on the hardware used, but I can't imagine floating point math being more efficient that often. Still, it probably does mean some testing would be in order to determine that for sure.

And that's pretty much it. If you have any feedback, please let me know. Thanks!


Footnote: Code used to compare old and new algorithms (hsb_to_rgb_old and hsb_to_rgb_new are not included here)

#include <sys/param.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>

#define HUE_MAX 360
#define SAT_MAX 101
#define BRT_MAX 101
#define RGB_MAX 255
#define NUM_SEG 6
#define DEG_SEG (HUE_MAX / NUM_SEG)

struct led_rgb {
    uint8_t r;
    uint8_t g;
    uint8_t b;
};

struct zmk_led_hsb {
    uint16_t h;
    uint8_t s;
    uint8_t b;
};

// DECLARATION
static struct led_rgb hsb_to_rgb_old(struct zmk_led_hsb hsb);
static struct led_rgb hsb_to_rgb_new(struct zmk_led_hsb hsb);

// DEFINITION
int main()
{
    FILE *outfile = fopen("output.md", "w");
    if (!outfile) {
        fprintf(stderr, "unable to open file for printing\n\n");
    }

    fprintf(outfile, "| H | S | B | | Old R | New R | Equal | | Old G | New G | Equal | | Old B | New B| Equal |\n");
    fprintf(outfile, "|---|---|---|-|-------|-------|-------|-|-------|-------|-------|-|-------|------|-------|\n");
    uint32_t num_total = 0;
    uint32_t num_correct = 0;
    uint8_t largest_err = 0;


    for (uint16_t hue = 0; hue < 360; hue+=1) {
        for (uint8_t sat = 0; sat < 101; sat+=1) {
            for (uint8_t bri = 0; bri < 101; bri+=1) {
                struct zmk_led_hsb in_hsb = {hue, sat, bri};

                struct led_rgb old_rgb = hsb_to_rgb_old(in_hsb);
                struct led_rgb new_rgb = hsb_to_rgb_new(in_hsb);
                
                char *eq_r = (old_rgb.r == new_rgb.r) ? ":heavy_check_mark:" : ":x:";
                char *eq_g = (old_rgb.g == new_rgb.g) ? ":heavy_check_mark:" : ":x:";
                char *eq_b = (old_rgb.b == new_rgb.b) ? ":heavy_check_mark:" : ":x:";

                num_total += 1;
                if ((old_rgb.r == new_rgb.r) && (old_rgb.g == new_rgb.g) && (old_rgb.b == new_rgb.b)) {
                    num_correct += 1;
                } else {
                  uint8_t err_r = MIN(abs((int8_t)old_rgb.r - (int8_t)new_rgb.r), 256 - abs((int8_t)old_rgb.r - (int8_t)new_rgb.r));
                  uint8_t err_g = MIN(abs((int8_t)old_rgb.g - (int8_t)new_rgb.g), 256 - abs((int8_t)old_rgb.g - (int8_t)new_rgb.g));
                  uint8_t err_b = MIN(abs((int8_t)old_rgb.b - (int8_t)new_rgb.b), 256 - abs((int8_t)old_rgb.b - (int8_t)new_rgb.b));

                  largest_err = MAX(largest_err, err_r);
                  largest_err = MAX(largest_err, err_g);
                  largest_err = MAX(largest_err, err_b);

                  fprintf(outfile, "| %.3d | %.3d | %.3d | | %.3d | %.3d | %-18s | | %.3d | %.3d | %-18s | | %.3d | %.3d | %-18s |\n",
                          in_hsb.h, in_hsb.s, in_hsb.b,
                          old_rgb.r, new_rgb.r, eq_r,
                          old_rgb.g, new_rgb.g, eq_g,
                          old_rgb.b, new_rgb.b, eq_b);
                }
            }
        }
    }

    fprintf(outfile, "\n(%d/%d) = %.4f%% correct\n", num_correct, num_total, 100*(float)num_correct/num_total);
    fprintf(outfile, "largest error = %d\n", largest_err);
    fclose(outfile);
    return 0;
}

@joelspadin
Copy link
Collaborator

Without benchmarking, it's impossible to say if this code is faster. It looks like you're doing more integer operations than the old code did float operations, and it's also possible the compiler is using very large integer types to avoid overflow which might not perform as well. Floating point math on ARM processors is also probably faster than you'd expect if you're used to AVR, though my experience is mostly with more powerful ARM SoCs that can run Linux, so maybe that isn't true of lower power ones.

I couldn't find any built-in utilities for general benchmarking in Zephyr, but it's pretty easy to write your own benchmarking code, for example:

// Taken from https://github.com/google/benchmark
#define DO_NOT_OPTIMIZE(value) \
    asm volatile("" : : "r,m"(value) : "memory");

// Increase this number until each benchmark takes long enough to get good data
#define ITERATIONS 1000

void benchmark_old(void) {
  const int64_t start_ticks = k_uptime_ticks();

  for (int i = 0; i < ITERATIONS; i++) {
    for (uint16_t hue = 0; hue < 360; hue++) {
      for (uint8_t sat = 0; sat < 101; sat++) {
        for (uint8_t bri = 0; bri < 101; bri++) {
          struct zmk_led_hsb hsb = {hue, sat, bri};
          struct led_rgb rgb = hsb_to_rgb_old(hsb);
          DO_NOT_OPTIMIZE(rgb);
        }
      }
    }
  }

  const int64_t elapsed = k_uptime_ticks() - start_ticks;
  LOG_INF("Old function: %d ticks", elapsed);
}

void benchmark_new(void) {
  const int64_t start_ticks = k_uptime_ticks();

  // same loop but with hsb_to_rgb_new()

  const int64_t elapsed = k_uptime_ticks() - start_ticks;
  LOG_INF("New function: %d ticks", elapsed);
}

// Ideally you run this in a standalone app that isn't doing any interrupt handling,
// but if that's too hard to set up, just run it several times and make sure you're
// getting consistent numbers.
void run_benchmark(void) {
  benchmark_old();
  benchmark_new();
}

If you do run a benchmark, make sure you have CONFIG_FPU enabled, as it looks like it is currently not being enabled on some boards like nice!nano.

Maybe even run it with and without the FPU enabled to see how much of a difference that makes. In the (probably unlikely) scenario that we support some boards that don't have an FPU, the float version is faster when the FPU is enabled but the integer version is faster when disabled, and we expect this function to get called frequently in some future RGB animation code, then we might actually want both versions of the function so we can pick the faster one based on CONFIG_FPU.

@coltontcrowe
Copy link
Author

I like that idea for sure. I'm not sure I can manage a standalone app unless there's a template I could work from. Maybe I could set up a branch where the benchmark is run every time I reset the board? I'm a little concerned that running the benchmark at startup could be different than running it after the board's been going a while. Maybe I'm overthinking it though.

@joelspadin
Copy link
Collaborator

You could maybe use a delayed work queue item to run the benchmark after a delay so it isn't affected by other init code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants