Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Thumb-2 optimized memcpy/memset #67

Open
jserv opened this issue Feb 6, 2014 · 10 comments
Open

Implement Thumb-2 optimized memcpy/memset #67

jserv opened this issue Feb 6, 2014 · 10 comments
Assignees

Comments

@jserv
Copy link
Member

jserv commented Feb 6, 2014

Directory kernel/lib contains the implementation of memcpy and memset, but it is too generic. We can utilize several ARM Cortex-M3/M4 specific features to optimize:

  • Thumb-2
    • apply 32-bit aligned data copy in inner loop, which is not necessary to Cortex-M3/M4, but it could be better for the external memory access depending on memory controller.
  • unaligned memory access
  • PLD instruction to preload cache with memory source
@jserv
Copy link
Member Author

jserv commented Mar 13, 2014

lk implements arm-m optimized memcpy and memset routines in git commit littlekernel/lk@33b94d9

@gapry
Copy link

gapry commented Jul 29, 2014

@jserv The profile result:

  1. unalignment
    unalignment
  2. alignment
    alignment

@jserv
Copy link
Member Author

jserv commented Jul 29, 2014

It looks so weird. Can you explain?

@gapry
Copy link

gapry commented Jul 29, 2014

@jserv The implementation is the branch.
https://github.com/gapry/f9-kernel/blob/benchmark_memcpy/benchmark/benchmark.c

My approach is that measure the case, alignment and unalignment, five times and take the avg time. Assume my approach is correct, the data imply the conclusion is the unalignment case is better than alignment after the optimized on the stm32F407.

@jserv
Copy link
Member Author

jserv commented Jul 29, 2014

@gapry In order to clarify the performance gain, please compare the optimized memcpy routines with plain byte-oriented C version.

@gapry
Copy link

gapry commented Jul 29, 2014

@jserv What does plain byte-oriented mean ?

@jserv
Copy link
Member Author

jserv commented Jul 29, 2014

The simplest and inefficient implementation of memcpy:

void memcpy(void* src, void* dst, size_t len)
{
    char* p = (char*)src;
    char* q = (char*)dst;
    while(len--) *p++ = *q++;
}

@gapry
Copy link

gapry commented Jul 29, 2014

@jserv For now, I use DWT to measure the elapsed clock cycles. You can check the commit: https://github.com/gapry/f9-kernel/commit/33e58dfcb1105140365132269c596763531e9ede

and the completed Implementation: https://github.com/gapry/f9-kernel/blob/benchmark_memcpy/benchmark/benchmark.c

The profile result:
unalignment:
dwt_unalign

alignment:
dwt_align

@jserv
Copy link
Member Author

jserv commented Jul 29, 2014

@gapry I don't think your benchmarking is valid since it doesn't represent the variance. There must be something wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants