-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Stats + Instructions #284
base: master
Are you sure you want to change the base?
Conversation
Extra compiler flags for better output. Added max(), min() macros Heavy Video.c refactoring.
Removed tile pallette code, re-added
rebase to indigodarkwolf
Changes to render_line_bitmap and render_line_tile
Remove lamdba in render_line_tile as Mac and WebAsm didn't weren't happy.
render_line_tile - Now reads props->tilew for "tile update skipping" step width.
use MHZ constant
The branches can not be predicted very well, oldpc is also currently globally scoped requiring a memory read and write, making it local removes this. (Oldpc appears to not be used globally)
oldpc is now a local stack variable where needed, removing globally scoped
Make bra() instruction branchless
Personally, the avenue I'm looking into next for optimization is to move VERA work into smaller batches and closer to the point where 65C02 instructions would cause changes, leaning heavily into added buffers and prerendering assets so that we don't have to expand or calculate a variety of things, we can just blit. Some of this is very straightforward - every time we make a change to any of the layer properties, we refresh all of that layer's properties in that 65C02 instruction. Obviously we could set a dirty flag instead and only update the layer's properties the one time once we need to draw a line. This especially makes sense in the common case of someone changing a bunch of layer properties at once during VBLANK. Some of this seems like it has obvious wins - if we keep a buffer of uncompressed color indices for a layer's tiles, we don't have to expand it on-the-fly for each line, and it's fairly trivial to keep this data in-sync and it should require no extra work. This seems like pure savings after the initial expansion of VRAM into the a tile image backbuffer (and that initial expensive step would be done only in response to refreshing dirty layer properties). If the cost of that refresh proves excessive, such as in cases where someone is using line IRQs to swap around layer properties a bunch (though this seems like it would be rare) we could optimize it by keeping a pool of previous settings, paying a little bit of cost on 65C02 instructions that write to VRAM in order to speed up swaps between multiple layer settings by keeping old settings cached that we can choose to use. Then we just need to find a decent hash for the layer properties, to try and ensure we don't needlessly clobber the cache, and how hard could it be to find a decent scattering of 3 bytes' worth of data? The part that's harder to anticipate whether it'll be a win or a loss is prerendering whole layers onto backbuffers, so that line draws only have to deal with scaling and translation. This is something I'd look into last, but potentially gives us the biggest wins on the hot path. It's somewhere I'd be more comfortable working in DirectX rather than SDL2, but hey, opportunity to learn. Though actually, to accomplish this with maximum speed in a cross-platform fashion, it looks like OpenGL would really be the way to go, so that a tilemap could be described as a series of UV coordinates onto a mesh, and we could render it to a backbuffer with a single draw call. Still, SDL2 provides what it calls an accelerated blit-blitting function that can perform horizontal and vertical flips (SDL_RenderCopyEx), so that may prove to be a win. The way that prerendering a whole layer could easily spiral out of control, though, is if someone is updating the tile data after the layer has been rendered. Perhaps a better approach would be to use SDL2 calls to render individual rows of tiles into a backbuffer, so we're paying for the draw once every 8 or 16 unscaled lines, then can trivially blit individual pixels. Again, unless or until the layer's properties or VRAM data is changed to invalidate the backbuffer. Does this make sense, or am I sounding like a madman? |
In related news, my RPi4 arrived recently, the emulator runs at... 36%. Very "oof", much room for improvement. But now I have a platform that isn't a $3500 gaming PC, that I can use for testing performance tweaks. And there seems to be no shortage of interest in running the emulator on RPis... |
Well, that's avenue is not a bust, but even once you get the rendering down to just "copy lines from backbuffers and then replace with final palette colors", an RPi4 only goes from 36% to 40%, and there's a substantial "warm-up" time, naturally, from building up the cached data. Still, it's about half of the way to -warp mode's 45% speed (which, in retrospect, I should have tried sooner, but... eh, it was fun to try, anyways). One thing I really like from my efforts, though, was that the refactor to drawing backbuffers creates trivial opportunities to insert multithreading, and I did some experimenting with OMP. I don't have measured numbers because I knew I wouldn't be comfortable committing any multithreaded code at this point, but OMP substantially reduced the cost of warming up the video cache, not quite to the point where the warm-up cost at boot time was unnoticeable, but on my desktop it got me to running at top speed within a cursor blink on the default console, and the demos and games I threw at it didn't even appear to flinch. |
If you're interested, my work is in this direction is committed to https://github.com/indigodarkwolf/x16-emulator/tree/vera-fast-v1. |
Hi Steven, This direction Not overly bothered about the startup time, Its not that noticable. But does these changes enable multithreading in the hot paths at all? Thoughts on direction 3.4% uint16_t x_add = (xx * props->bits_per_pixel) >> 3; I don't know enough about the x16 vera stuff, When can these props values change? They surely can't change during a render_layer_line_* call so I think, while we would have to duplicate code, we should attempt to break these out and hoist these tests and calculations higher up the call tree? |
Looks like you focused on the performance of bitmap layers. Bitmap rendering was almost untouched by my changes, and bitmap layers aren't being prerendered onto a backbuffer like text and tile ones. So I'm not surprised that their performance was not substantially changed. I knew tile and text would be more involved (in particular, if someone replaces a single tile/character on the map then it should only redraw that individual tile/character and not the entire backbuffer), so I focused on those. Timing of property changes The other thing to consider, as well, is that this work is shifting some of the work out of the previous hot path and into the CPU emulation. For instance, when someone executes an instruction to write a byte into VRAM, if that byte touches an active or cached tilemap, we go ahead and immediately update the appropriate backbuffers on those tilemaps (assuming those backbuffers have not been invalidated for other reasons). Multithreading potential When there's a backbuffer to rely on, drawing the backbuffer itself is trivial to multithread, and I'm sure that work on the hot path will be just about as easily threaded. The main difference is that the blit on the hot path looks like this: const uint32_t hscale = reg_composer[1];
uint32_t xaccum = props->hscroll << 7;
for (uint16_t x = 0; x < hsize; ++x) {
const uint16_t eff_x = xaccum >> 7;
layer_line[layer][x] = prerender_line[eff_x & max_buffer_x];
xaccum += hscale;
} While each backbuffer render looks like this: for (int y = 0; y < buffer_height; ++y) {
prerender_layer_line_text(layer, y, props->prerendered_layer_color_index_buffer + (buffer_width * y));
} It's trivial, then, to get large benefits from inserting an OMP #pragma omp parallel for
for (int y = 0; y < buffer_height; ++y) {
prerender_layer_line_text(layer, y, props->prerendered_layer_color_index_buffer + (buffer_width * y));
} I'm just new enough to OMP to have not sussed out the correct way to thread the trivial copy case, with its incrementing xaccum parameter, or whether OMP is smart enough to figure out for itself the appropriate blocks of work so threads are not being dispatched for a single trivial Now that I'm thinking about it, if we needed to speed up the execution of keeping backbuffers up-to-date from CPU writes to VRAM, that would be pretty easily parallelized, as that's a fairly trivial for-loop as well: for (int i = 0; i < num_layer_properties_allocd; ++i) {
if (is_layer_map_addr(i, address)) {
poke_layer_map(i, address, value);
}
if (is_layer_tile_addr(i, address)) {
poke_layer_tile(i, address, value);
}
} Other thoughts My concern was initially centered around animations requiring a much larger cache of sprite settings (which would require a proper hash table or map structure to store them in, unlike the layers where there were few enough I could get away with a linked list). Sprites don't look like tile layers, in the sense of having a base address and asset size, followed by a reference to an asset index. Sprites just have a base address, so a 6-frame animation is, immediately, 6 cache entries. And if it can be h-flipped or v-flipped, that work either goes back into the hot path, or we potentially multiply the number of cache entries to account for flips. It's something I'll continue to think about. |
I still need to think about sprites, but I've gone ahead and sped up bitmaps in the same way that I did tile and text layers. The render_line for them is now essentially a blit with horizontal scaling. I also realized that the compositing step before final color selection could be sped up by only bothering with the arrays that were needed - a single switch with 7 valid options that are all one-liners seems like a no-brainer optimization over copying potentially 4x the data into local stack variables before juggling one into the final array that matters. A little less helpful in the worst case when sprites and both layers enabled, but still better to replace branching three times per pixel with a single branch per line. I also wonder if SDL2 could speed up the final step where we replace color indices with actual palette colors. Something to experiment with. And then there's PS/2 Something else for me to think about this week. |
What is the status of this? |
So far, using an RPi4, the runtime difference between r38 and my current branch is 37% for r38 versus 44% for my branch when idling after a boot. Not bad. But if I have it loop on 10PRINT"HELLO COMMANDERX16!":GOTO10, r38 maintains 37% while my branch drops to 31%. Not ideal, and unless/until this is addressed, I'm not sure I want to submit a PR. Chase Vault runs at 23% for r38, 33% for my branch. So, various pros and cons. Update of what I've been doing and thinking about, specifically. The majority of the changes in my branch are to maintain backbuffers for the layers and sprites. By pre-drawing all the layer data onto a backbuffer, the actual write is a trivial for-loop. Also, the way that the backbuffer draw occurs is trivial to multithread, as previously discussed. That said, I don't have the multithreading checked in anywhere yet, as I got distracted by chasing other butterflies. One of the thoughts I hit on was that writes to VRAM are "relatively rare", so in order to make layer and sprite backbuffer operations as fast as possible, I've changed all VRAM writes to essentially perform 4 writes: one each assuming the data is 8bpp, 4bpp, 2bpp, and 1bpp, to parallel VRAM arrays. That way, we don't have to expand windows of the data later for various layer and tile sets. It makes the act of changing layer settings somewhat quicker, and means we don't have to otherwise update or refresh the expanded data behind a layer when a write occurs. I think this was a worthwhile trade-off. And that's about where I've stalled out in making the VERA faster, with rapidly updating tiles on a layer being the new expensive problem. Potentially a problem for OMP to sort out, if we officially want to go that route. In my own profiling, PS/2 is the next biggest single hotspot, but I'm not sure it can be easily addressed. I made a small tweak to the ringbuffer implementation in the PS/2 code, since that was a mild hotspot with the for-loop approach of checking whether the ring buffer had room for new data. That optimization is included in the runtime figures up top. Beyond that, I'm seeing time lost to branching, and I just don't know if there's anything to be done about that: I took a stab at refactoring the ps2_port_t struct in the hopes of replacing some branching with bit ops, but it's only a small win: it gets me up to 46% on my RPi4 while idling. So then the next place to go digging, according to my profiling, is the 65C02 implementation itself. One easy-ish change to reduce needless branching is to pull the bcd logic out of adc and sbc, and instead create bespoke operations which are substituted into the instruction table whenever we execute sed, and then replace with the non-bcd flavors on cld. But the branching issues in the 65c02 code run a lot deeper than that: each potential status flag change is a branch based on an appropriate conditional statement, and each time the processor reads or writes a value is a branch to check whether it should take the value from the accumulator instead. But I'm inclined to believe Michael Brown's analyzer when it says that we'd perhaps only get a 10% benefit from substantially reworking all that anyways, especially since any hit to reading or writing an actual memory address is an unavoidable set of branches to deal with I/O, banked RAM, and ROM. And that's where I'm at right now. |
@indigodarkwolf If I may .... before slicing and dicing though the video.c implementation, I would first try to mimic more closely the separation of hardware that does exist on the real machine. In real life, the main proc is not bothered at all by any of the work done by the video... it writes to some registers and things happens.. that's all ... it is neither blocked nor slowed down ... no interactions whatsoever ... the emulator should attempt to replicate that model, by all possible means!!! in the end... maybe SDL should be running the main loop, and emulator loop should be running in its own background thread ... because that is how x16 works. it becomes very clear that it works this way when you carve out the SDL code from the VERA code in video.c (I have separate video.c and video_vera.c with a clean complete software vera separate from everything else). I also cleaned up the memory related code and RAM is no longer patcheable by whoever and their dog... having done that made it very clear what the real contact points are between the 6502's running ROM code and the rest of the world. |
My general thoughts on that -
I believe the original goal was indeed a simulator - emulate as accurately
as possible. Changing that would require general consensus or a fork/new
project.
Multithreading is not possible cleanly with the current architecture. And
with a clock accurate emulator, I'm not sure it would buy us much. Video
rendering is the lions share of the work, even the CPU stuff is minor. The
shifting of data is also minor so there isn't much work to offload.
The unit of work in one tick is very small.
Also got to consider the WASM target, which doesn't support multithreading.
Imho, changing the video pipeline is still the better option if more speed
is still needed? I thought things were in a good place with the emulator
now?
Kind regards
Mike
…On Sat, 6 Mar 2021, 13:43 L. Mihalkovic, ***@***.***> wrote:
If I may .... before slicing and dicing though the video.c implementation,
I would first try to run everything on a separate thread async from the
rest to mimic the separation of hardware that does exist on the real
machine. anything else would just be going into a SIMULATOR rather than
trying to build an EMULATOR
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW75FR45AHYWJAZOO53TCIWRVANCNFSM4OXLRQUQ>
.
|
@mikey-b .. well ... my cleaned up code is a lot closer to making it possible .. and it only took a couple days to a complete newb to do the clean up. the code works just fine as it is, but the real final test will be running 2 video cards.... will try that next yes, one tick is very small ... and it should have NOTHING to do whatsoever with the video behavior :) .... my point exactly... proof that they are conceptually separate? .. IRQs .. vera can trigger IRQ on some special conditions (the v/h syncs, the sprite collisions) to completely disrupt the normal flow of the 6502, meaning that they are indeed not sync at all by nature. ...... ... hmmm ... WASM ... ... it does have access to the worker threads.. no? (I will have to check again .. made a poc at work for a cad thing targeting wasm over a year ago) |
Could you expland? What changes have you done and proposing?
And would two video cards be for?
…On Sat, 6 Mar 2021, 14:03 L. Mihalkovic, ***@***.***> wrote:
@mikey-b <https://github.com/mikey-b> .. well ... my cleaned up code is a
lot closer to making it possible .. and it only took a couple days to a
complete newb to do the clean up. the code works just fine as it is, but
the real final test will be running 2 video cards.... will try that next
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIWZ6YQ5JY3IYH7I4CEDTCIYZ3ANCNFSM4OXLRQUQ>
.
|
@mikey-b the 2 cards is not directly related .. I will just use that step as a validation that I did isolate things properly. if I did... then a simple basic app can directly poke to the second card with zero impact on the main one (the little BALLS example is a good one to start from) with the contact surface so greatly reduced, I can then try to run all the rendering for card2 inside a separate thread (the debugger will have to be cut off for the moment). the rendering code can do what it wants. hmmm .... I just realized that I have to look at the rom code it see what it does (if anything) with the v/h syncs... |
I would stay away from this for now ... .. I am not sure if I am the madman myself, but I see video.c as a CRAZZZZY GEM of an opportunity as it exists today. I know that I make no sense at all if I do not explain what I mean .. but better than explaining, I want to show it.. and that will take me a little bit more time to get there. and in the meantime, I am hoping that nobody destroys this opportunity. note: I had the same kind of discussion a few years ago with the Arduino team ... trying to make them see that there was sooo much potential in the IDE source code, which they did not seem to be able to see. |
Seems I missed a fair amount of conversation. Please forgive me the length and breadth of a reply as I catch up. @imihalkovic It may may a certain amount of sense to run the SDL loop on a separate thread from the rest of the machine, but by that I mean divorcing SDL from the VERA behavior as well, so that SDL is snapshotting completed framebuffers and drawing that. However, I suspect that will buy us very little in the way of performance. The emulator simply does not perform a lot of work through the SDL API - it is essentially just grabbing a framebuffer generated by the VERA, and blitting it to the window. Easy-peasy, this is not our bottleneck. De-synchronizing the VERA from the CPU by placing them on separate threads will be a non-starter for the emulator for the simple reason that the official emulator concerns itself deeply with cycle-accurate behavior, from the CPU ops to the PS2 interface to the VERA, and everything in-between. I suspect we've already broken that in minor way with other VERA optimizations, but we can't really know unless or until we have official hardware. Allowing each to spin freely, regardless of the other, will break cycle-accuracy in a major fashion. And perhaps, instead of spending so much time programming for the emulator, I should be building test applications and asking Frank to run some comparisons between the emulator the real hardware, so we can improve or restore the accuracy of the emulator. Or by all means, fork the emulator. In truth, I'm working on my own fork of the emulator that makes a number of rather deep changes, in particular swapping to C++17 so that I can make use of the STL, instead of hand-rolling a bunch of data structures and algorithms I wish to have in trying to tackle further VERA optimizations. But if you want my opinion of how best to speed up the VERA, the answer is very simple: Multithread the for-loops inside each render_layer_line_* function. These iterative steps are trivial to perform in arbitrary order, and are where the bulk of the VERA spends its time. Multithread the for-loop below "// Calculate color without border." within Multithread the final framebuffer color assignment and NTSC overscan for loops in The only reason I haven't already submitted PRs with those, many months ago, is because I was uncomfortable with adding a dependency on openmp without being able to build the project on Mac platforms. |
Have you seen any improvements from multithreading those functions?
I found VERA to be heavy on the reads, but only produced about 200 bytes.
Multithreading didn't yield much when I attempted.
…On Sat, 6 Mar 2021, 18:41 Stephen Horn, ***@***.***> wrote:
Seems I missed a fair amount of conversation. Please forgive me the length
and breadth of a reply as I catch up.
@imihalkovic <https://github.com/imihalkovic> It may may a certain amount
of sense to run the SDL loop on a separate thread from the rest of the
machine, but by that I mean divorcing SDL from the VERA behavior as well,
so that SDL is snapshotting completed framebuffers and drawing that.
However, I suspect that will buy us very little in the way of performance.
The emulator simply does not perform a lot of work through the SDL API - it
is essentially just grabbing a framebuffer generated by the VERA, and
blitting it to the window. Easy-peasy, this is not our bottleneck.
De-synchronizing the VERA from the CPU by placing them on separate threads
will be a non-starter for the emulator for the simple reason that the
official emulator concerns itself deeply with cycle-accurate behavior, from
the CPU ops to the PS2 interface to the VERA, and everything in-between. I
suspect we've already broken that with other VERA optimizations, but we
can't really know unless or until we have official hardware.
And perhaps, instead of spending so much time programming for the
emulator, I should be building test applications and asking Frank to run
some comparisons between the emulator the real hardware, so we can improve
or restore the accuracy of the emulator.
Or by all means, fork the emulator. In truth, I'm working on my own fork
of the emulator that makes a number of rather deep changes, in particular
swapping to C++17 so that I can make use of the STL, instead of
hand-rolling a bunch of data structures and algorithms I wish to have in
trying to tackle further VERA optimizations.
*But if you want my opinion of how best to speed up the VERA, the answer
is very simple:*
Multithread the for-loops inside each render_layer_line_* function. These
iterative steps are trivial to perform in arbitrary order, and are where
the bulk of the VERA spends its time.
Multithread the for-loop below "// Calculate color without border." within static
void render_line(uint16_t y). Again, trivial to perform in arbitrary
order, and the VERA spends substantial time here.
Multithread the final framebuffer color assignment and NTSC overscan for
loops in static void render_line(uint16_t y). Again, it is obviously
trivial to perform each iteration in arbitrary order.
The only reason I haven't already submitted PRs with those, many months
ago, is because I was uncomfortable with adding a dependency on openmp
without being able to build the project on Mac platforms.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW3V6IO4NDSOTO64S63TCJZONANCNFSM4OXLRQUQ>
.
|
It wasn't order-of-magnitude differences in performance, but yes, I did see improvements from multithreading the various render_layer_line() functions. I believe I also took the liberty of spinning the calls to those functions, themselves, into separate threads, so everything was run in parallel. Note that my test platform was a Raspberry Pi 4, because I feel that PC-based performance is likely "good enough", and RPi4s were the trending subject on the Facebook group at the time. So optimizations against PC performance are not interesting to me. Also note that even the CPU emulation is slow on the RPi4, so the VERA is not the only corner of the emulator that needs help if the emulator is to run at full speed on one of those. And that may be asking too much. The changes I'd experimented with also rolled back the |
The changes I'd experimented with also rolled back the
LAYER_PIXELS_PER_ITERATION
I can only assume the following thinking - A fixed loop iteration is
required for the auto vectorizer to work. But has other restrictions too,
e.g. no globals. adding "-fopt-info-vec-missed" to the compiler will show
you the places it failed to vectorize. IMHO, LAYER_PIXELS_PER_ITERATION
needs to be removed for better clarity anyway.
Also note that even the CPU emulation is slow on the RPi4, so the VERA is
not the only corner
OK, well, some suggestions in fake6502;
* Remove "callexternal, hookexternal() and loopexternal", it's a feature
that isn't used. We can then remove the branch "if (callexternal)
(*loopexternal)();" from the hotpath.
* Remove "instructions++" from step6502, its duplicated work in main.c - Or
(my pref -) extern instructions and remove instruction_counter which is in
main.c;
* clockgoal6502 is unused as exec6502 is not being used.
* There's a few places we can go branchless in here, (unsure of any benefit)
** in step6502 can become - clockticks6502 +=1; clockticks6502 +=
(penaltyop && penaltyaddr);
** page boundaries - e.g. if ((oldpc & 0xFF00) != (pc & 0xFF00))
clockticks6502 += 2; else clockticks6502++;
And then the tricky bit
uint32_t old_clockticks6502 = clockticks6502;
step6502();
uint8_t clocks = clockticks6502 - old_clockticks6502;
bool new_frame = false;
for (uint8_t i = 0; i < clocks; i++) {
ps2_step(0);
ps2_step(1);
joystick_step();
vera_spi_step();
new_frame |= video_step(MHZ);
}
audio_render(clocks);
That MHZ is a macro constant, and is being passed in as a stack variable.
We 100% need to get rid of that and keep it compile-time constant.
Because we're only doing a single step (step6502), I think we should alter
that to return the clockticks, and remove that calculation on globals.
uint8_t clocks = step6502();
Unless anyone can suggest a way to use exec6502? Or must the CPU be in such
a tight lockstep with the VERA (No wiggle room here)?
We also know that clocks <= 7. (The most expensive single tick on a 6502.).
IHMO - This is why the video_step is struggling, it can't get enough data
processed in one pass. Can we not move this out similar to
audio_render(clocks) ? Does video_step rely on ps2_step, joystick_step, or
vera_spi_step?
…On Sat, 6 Mar 2021 at 21:32, Stephen Horn ***@***.***> wrote:
It wasn't order-of-magnitude differences in performance, but yes, I did
see improvements from multithreading the various render_layer_line()
functions. I believe I also took the liberty of spinning the calls to those
functions, themselves, into separate threads, so everything was run in
parallel.
Note that my test platform was a Raspberry Pi 4, because I feel that
PC-based performance is likely "good enough", and RPi4s were the trending
subject on the Facebook group at the time. So optimizations against PC
performance are not interesting to me.
Also note that even the CPU emulation is slow on the RPi4, so the VERA is
not the only corner of the emulator that needs help if the emulator is to
run at full speed on one of those. And that may be asking too much.
The changes I'd experimented with also rolled back the
LAYER_PIXELS_PER_ITERATION arrays that... well, I'm going to be honest, I
have no idea how they would help, *in any context*. They don't
meaningfully improve cachability, the compiler won't automatically cast
them to 32-bits to try and batch operations together with larger type
sizes, the arrays aren't vector types and the loops aren't using vector
ops... I could go on, but at this point I don't even think I would believe
any amount of theory-crafting and I simply need someone to whap me on the
nose with data from an instrumented build to be convinced. And then there's
the calls to memset when initializing these arrays, and I can't immediately
recall whether the compiler is optimizing those away, but if it isn't then
there's a complete waste of time for arrays of only 4 elements.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW6O6G6UZ3H37N4WWY3TCKNNJANCNFSM4OXLRQUQ>
.
|
I broadly agree with all of these. Additionally, with the current VERA behavior, we could probably exec6502 in increments of 239 cycles and then pump individual instructions until the VERA signals a new line has started. If there are per-pixel behaviors lerking in the VERA hardware, though, we don't know about them yet and in a worst case it means the VERA would have to render smaller segments to the framebuffer based on accesses to certain I/O registers. There would also be open questions about other subsystems like PS/2, which also depend on being pumped every CPU instruction. As part of the work I've done in my own emulator, I've actually largely rewritten the PS/2, just to understand it, and in the process ended up making it substantially faster by making it do fewer checks based on I/O state, and making those checks much faster. My implementation could be back-ported to C, itself a win, but it could also probably be modified to update only when the CPU reads or writes to the appropriate VIA ports, or when the KB/M devices post updates into it. This would represent another win in performance. (And I look at it as an example of how my fork, though it's making some brazen changes, can still create opportunities for me to contribute back to the original emulator once Michael Steil is available again). I haven't looked at audio subsystems yet, to determine to what extent they may benefit from a similar treatment of only updating state when the CPU accesses their I/O, in the hopes of enabling the CPU to tight-loop as much as possible. And I haven't looked at SPI yet, either, for SD Card access. |
my turn to play catch-up .. @indigodarkwolf divorcing VERA from SDL it is what I did. I carved out all the VERA code into vera_video.c and added a public API (not yet good) in places (wink wink .. debugger) where some of the rest of the emulator was getting too close to that code. as you said, video.c is now remarkably simple, and update just does get a snapshot of the framebuffer which it bliss out. when I saw that the code was so cleanly written I just couldn't believe the doors that it did open .... because they are so completely disjointed (in HW too), then it does make all the more sense to me to run it completely OUTSIDE the simulator mainloop (years ago I was working on military sims, and one of the lessons I remember from working with sims of 1M people in a city during a catastrophic event was that the more independent things are, the more degrees of freedom the system is keeping). this SIM is a beautifully written piece of clean code... there are still features missing before going into aggressive optimization phase, which invariably results in damaging the layer boundaries... for e.g. there is no easy way now for someone to simulate their own extension cards... which will be a key feature of x16... so before cutting corners, the APIs should be finished and the missing abstractions created/completed. To do that, some abstractions are clearly missing... and knowing what they should look like should be a higher priority than to optimize the ones that are currently in the system, because they may have to change in light of what is currently missing. when all the abstractions are there, then and only then, it will be time to slash through them to try to aggressively speed up what can be. you know what Knuth said, right!? |
Hi Mihalkovic, Stephen,
Are we setting the target goal to be the RPi 4?
Also, Can I pick both of your brains, What is the lockstep you have chosen
for the CPU and VERA? And is there a spec decision on this? Does the VERA
have separate RAM, or is it shared with the main RAM?
Kind regards,
Mike Brown
…On Mon, 8 Mar 2021 at 07:58, L. Mihalkovic ***@***.***> wrote:
my turn to play catch-up ..
@indigodarkwolf <https://github.com/indigodarkwolf> "divorcing VERA from
SDL" it is what I did. I carved out all the VERA code into vera_video.c and
added a public API (not yet good) in places (wink wink .. debugger) where
some of the rest of the emulator was getting too close to that code. as you
said, video.c is now remarkably simple, and update just does get a snapshot
of the framebuffer which it bliss out. when I saw that the code was so
cleanly written I just couldn't believe the doors that it did open ....
so one of the things I can do now is run 2 VERA ... which means the
possibility of having the debugger on one and the app on the other....
personally, I like it. .. but it will also be possible to have 2 vera for
the apps themselves. the caveat here is that the second card is not known
at all be the ROM (which is ok) and (if I am not mistaken) x16 does not
have the ROM extensions mechanism that apple2 had which would allow to load
a relocatable stub that could allow card selection (basically using a reg
in page0 to tell the base address of the 32 bytes IO window where vera is
located, coupled with making the ROM vera code use that to compute the regs
to use instead of hardcoding a ref to the 2 or 3 IO region .. forgot which
one proto2 is using). but that does not prevent an app from directly using
the regs of a second vera to drive the second screen.... and yes... there
is the small details of a HW version of that second 'light' VERA... but
that is not an issue either
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW3L2ISNMWN3QX4JKJDTCR7T7ANCNFSM4OXLRQUQ>
.
|
Well, I can at least answer that the VERA has separate RAM from the main RAM, it is not shared. As for batch-executing the 65C02, I'd just point out that if we have a signal from the VERA module that a new scanline has been produced, then we have at minimum 239 cycles we can execute before the current implementation will care about any changes, within or without the VERA. This is just taking 8,000,000 cycles per second, divided by 60 frames per second, divided by 525 scanlines per frame, minus 14 cycles to account for "slop" from a previous instruction taking us as far as 7 cycles into the scanline, and some "final" instruction taking up to 7 cycles. This is actually probably conservative by 7 cycles, but even so we get at minimum 34 instructions before needing to go back to stepping the CPU by a single instruction and then processing the VERA. But you're right, it would be ideal to finish working on the core features before doing anything drastic that might blur the boundaries between systems. For a target, our known platforms are Windows, Mac, Linux, and WebAssembly. I'm not familiar with building WebAssembly, and can't test a Mac build. I don't know how to instrument and profile a WebAssembly build either; haven't even looked into it. The RPi4 runs on Linux and happens to be a platform that has some optimization peculiarities which I believe it has in common with WebAssembly, and since it runs at around 40% speed at present it seems like a platform where it'll be easy to observe concrete gains, even without instrumentation. |
Hi Stephen,
Thanks for the information. Is the VERA connected and communicating via
SPI? And does this communication processing happen after one scan line?
Kind regards,
Mike
…On Wed, 10 Mar 2021 at 22:20, Stephen Horn ***@***.***> wrote:
Well, I can at least answer that the VERA has separate RAM from the main
RAM, it is *not* shared.
As for batch-executing the 65C02, I'd just point out that if we have a
signal from the VERA module that a new scanline has been produced, then we
have at minimum 239 cycles we can execute before the current implementation
will care about any changes, within or without the VERA. This is just
taking 8,000,000 cycles per second, divided by 60 frames per second,
divided by 525 scanlines per frame, minus 14 cycles to account for "slop"
from a previous instruction taking us as far as 7 cycles into the scanline,
and some "final" instruction taking up to 7 cycles. This is actually
probably conservative by 7 cycles, but even so we get at minimum 34
instructions before needing to go back to stepping the CPU by a single
instruction and then processing the VERA.
But you're right, it would be ideal to finish working on the core features
before doing anything drastic that might blur the boundaries between
systems.
For a target, our known platforms are Windows, Mac, Linux, and
WebAssembly. I'm not familiar with building WebAssembly, and can't test a
Mac build. I don't know how to instrument and profile a WebAssembly build
either; haven't even looked into it. The RPi4 runs on Linux and happens to
be a platform that has some optimization peculiarities which I believe it
has in common with WebAssembly, and since it runs at around 40% speed at
present it seems like a platform where it'll be easy to observe concrete
gains, even without instrumentation.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW2PXJYIBG3RALVGVWTTC7PCHANCNFSM4OXLRQUQ>
.
|
The VERA's I/O is mapped to a variety of memory addresses as registers, using parallel communication. See also, https://github.com/commanderx16/x16-docs/blob/master/VERA%20Programmer's%20Reference.md Communication is processed immediately. We're not sure whether the VERA latches certain values when starting to draw each scanline, or whether that was a misunderstanding of something Frank once said, which may instead have been referring to a behavior where it draws lines to a backbuffer. The original VERA implementation processed between each CPU instruction and drew individual pixels as the scanline would have swept across the display. |
If we go with the assumption that the VERA is drawing per-pixel and latches nothing, then we do at least know that the VERA draws to a back buffer, so for instance if you're counting lines from VSYNC, then it draws line 33 (the first visible line) during line 32 (the last line of the back porch), then swaps buffers so that the result of line 33 is presented while line 34 is drawn to the back buffer. This means we'd have to rollback the optimization that draws entire lines at a time, and this would also make the VERA much less likely to see wins from multithreading, unless we did one of those "blurring the boundaries"-type optimizations where the VERA would only process based on I/O, and every so many CPU cycles. The biggest performance hit from per-pixel processing is the sprite logic. That loops through 128 sprites, and order of processing matters because of how work units are calculated, and doing that task 18 million times per second is... unappealing. But there were obviously other wins to be had from drawing whole lines, due to improved cache behavior. It would be sad to roll that back, as it would seriously hurt the emulator's performance. |
@indigodarkwolf I know I already said this... but I in the real hardware, VERA is independent from the clock cycle of the 6502. Considering that the software needs to be in lock-step with VERA is creating an artificial constraint. @mikey-b like many of the computers of that era, VERA is a memory mapped device, where a region of memory (32 bytes) corresponds to 32 registers in the FPGA. the code I cleaned up in memory.c made it possible for me to have a second VERA allocated to a second block of 32 io ports. there are roughly 2 families of ways in which the FPGA code can be implemented, and they are veeeery different. having no access to the hardware or the people working on it, I cannot comment on which of the 2 approaches has been chosen (just know what I would have done to 'go fast'). as for what it 'can or cannot do', the vga spec is ultimately the guide. Yes, to prepare the frame buffer VERA manages it own local ram, outside of the main cpu ram (all GPUs do). that ram is effectively split into 2: the vram and the frame buffer. as the frame buffer is actually small, it is possible that it is implemented using the FPGA's bock ram (otherwise it is also in the ram chip). that ram contains different types of data: there are the layer, the sprites, the tiles, the active color palette... the frame buffer is what gets bit banged to the monitor at the pixel clock rate (value depends on vga mode). vga defined the 'dance' .. i.e. the time for sending data and the time for sending nothing so that old monitors could sync the ray guns. this was (and is) all analog signaling (.7v), hence the need for a dac (you can either use an existing IC or make your own, which is what I thing happened on the VERA board). Performance of such a memory mapped GPU could be abysmal.. but VERA uses the auto-increment (with user selectable stride) to improve the situation. so there is no SPI involved... even though in my own little proto I am using a PCF8574 to transfer the 32x1byte onto an SPI bus (a simple non-FPGA way of doing mem mapped IO ports). well .. technically there is likely some SPI: if VERA does not have all the memory on the FPGA, and depending on the selected sram chip, it is either a parallel or a serial sram chip. can't tell from the proto 1 video if the sram is the 2 SMDs or one of the 2xdip-8... dip-8 sram would mean SPI to the FPGA. @indigodarkwolf actually, I was reading the vera doc again.... |
@lmihalkovic I think the subtlety being missed is that just because the VERA runs on its own clock doesn't mean the VERA will spin for, say, 10ms without the CPU being allowed to have any influence on it. Or vice versa, it won't just go to sleep for 10ms while the CPU does stuff and then wake up. But these circumstances are real hazards of multithreading - you don't get to control when the thread executes, only the OS does. Any time you ask the thread to sleep for a while, you're putting your faith in the OS to wake it up in a timely manner, except there's no guarantee of that. Even if you spin-wait instead, there's no guarantee the CPU won't interrupt you for "however long it wants". So you can't just spin the VERA off onto its own thread and call it done. You have to synchronize it with the CPU on a periodic basis, or the behavior will, in fact, be different from the real hardware. And that periodic basis will basically be "every single CPU instruction, with the VERA being told how many clock cycles elapsed and thus how many ticks it is allowed to process until it must wait for the CPU again". But at that point, you've got all the same overhead as our single-threaded approach now, only you've added the extra weight and complexity of threading. You might get a performance win with that approach, anyways, but only if the VERA implementation is brought back to a state of calculating output on a per-pixel basis instead of per-line. But that's still a world in which the VERA is essentially in lock-step with the CPU. |
Hi all
I have opted to not get too deep into the details of going multithreaded
yet. I am hopeful for any feedback from you both. Experiences, snags and
performance numbers would be wonderful.
I have however looked into a Pi (3 Model B), perf and stats on the current
code base. Unittesting. And the up coming hardware changes.
Unittesting
Using dlang BetterC - I have attempted unittesting. The current design and
the nature of being an emulator has made unittesting difficult and yeilded
very little benefit.
Looking at other emulators, acceptance tests look to be more suitable.
These are tests written in 6502 that provide feedback out of the real
hardware or emulator. E.g. drawing a checker board or gradient on the
screening using the IRQ signals? Using sprite collision to generate a
sinwave.
This is currently beyond my knowledge. But I see no reason why we can't
create a list of tests to get the ball rolling.
Upcoming Changes & VERA
As it stands, the Pi needs a 3x speed up to be full speed. The coz profiles
suggest a possible 50% speed up, but this still falls short.
I feel we are all in agreement that we need an architectural change.
The Vera chip has some upcoming changes, and I feel the code for it is just
too messy. And for performance, it has too many indirections.
There is a lot of information that is calculated every cycle. e.g. it's
extremely unlikely the mode will change so frequently.
I feel the points above are in agreement by everyone? Stephen has already
looked at caching etc.
I don't believe we can effectively do this kind of change in C. We need
some kind of meta programming. I've tried both Dlang and C++. It's possible
to change the entire code base to one of these with only a few changes. We
could also just move the VERA into a separate folder as an external dep,
and keep the current code in C.
So. I think we need to get some consensus. And to make clear on what people
are willing to contribute to the main repo, forking, or none etc.
Is everyone in agreement to -
* Moving the VERA to a separate comp unit. Based on C++
* Keeping the rest of the code C based, but conforment to C++
* Isolating and cleaning up the Vera code. Ready for up coming changes and
the possibility of multithreading.
* Look to start planning acceptance tests.
Kind regards
Mike
…On Fri, 19 Mar 2021, 23:47 Stephen Horn, ***@***.***> wrote:
@lmihalkovic <https://github.com/lmihalkovic> I think the subtlety being
missed is that just because the VERA runs on its own clock doesn't mean the
VERA will spin for, say, 10ms without the CPU being allowing to have any
influence on it. Or vice versa, it won't just go to sleep for 10ms while
the CPU does stuff and then wake up. But these circumstances are real
hazards of multithreading - you don't get to control when the thread
executes, only the OS does. Any time you ask the thread to sleep for a
while, you're putting your faith in the OS to wake it up in a timely
manner, except there's no guarantee of that. Even if you spin-wait instead,
there's no guarantee the CPU won't interrupt you for "however long it
wants".
So you can't just spin the VERA off onto its own thread and call it done.
You have to synchronize it with the CPU on a periodic basis, or the
behavior will, in fact, be different from the real hardware. And that
periodic basis will basically be "every single CPU instruction, with the
VERA being told how many clock cycles elapsed and thus how many ticks it is
allowed to process until it must wait for the CPU again". But at that
point, you've got all the same overhead as our single-threaded approach
now, only you've added the extra weight and complexity of threading.
You might get a performance win with that approach, anyways, but only if
the VERA implementation is brought back to a state of calculating output on
a per-pixel basis instead of per-line. But that's still a world in which
the VERA is essentially in lock-step with the CPU.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW3TFGIK32JQ4CEAY3TTEPPARANCNFSM4OXLRQUQ>
.
|
Mike, I think you and I are on the same page here. I like that plan. |
Hi all,
Bit of a curve ball.
In an attempt to find bottlenecks, I cut down the emulator to doing nothing
but render. With no instructions run. It's basically a counting loop and
doing a frame update. The emulator runs at 99-100%. Basically maxed out
getting 60fps.
I've looking into this and how other emulators are working.
Research says SDL is very poor on the pi. And opengles is required to get
hardware acceleration
https://www.raspberrypi.org/forums/viewtopic.php?f=78&t=25146
Other emulators have also gone with assembly versions of the CPU.
I'm thinking we need to consider these directions
Kind regards
Mike Brown
…On Sat, 20 Mar 2021, 18:48 Stephen Horn, ***@***.***> wrote:
Mike, I think you and I are on the same page here. I like that plan.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW2HVSZHOREIWUBEPXTTETUWXANCNFSM4OXLRQUQ>
.
|
I would like to know exactly what advantages moving the VERA implementation to C++ would provide. (I have never used C++, so the answer might be obvious.) |
Hi Elektron72,
Ofcourse. My interest is in the metaprogramming features, such as
templates, constexpr etc. To move some of the logic to compile-time.
It's technically possible to get the same result in C, but takes longer and
is harder to maintain.
A good example is the 6502 code. We need to combine address modes and
instructions. Currently we have two indirections to combine these - so is
done at runtime. We could use metaprogramming to combine those into a
single indirection (the call).
Vera has an output modes, bitdepths etc. With indirections and decisions
everywhere to handle these. We could template these to specialise the code
to those modes. And pull that as high as possible.
Kind regards
Mike Brown
…On Sat, 20 Mar 2021, 21:09 Elektron72, ***@***.***> wrote:
I would like to know exactly what advantages moving the VERA
implementation to C++ would provide. (I have never used C++, so the answer
might be obvious.)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIWY3F27363EX2ITTDV3TEUFKFANCNFSM4OXLRQUQ>
.
|
@mikey-b @Elektron72 ... same here ... don't get me wrong ... C++ is the 2nd language I ever learned (zortech c++ for the history buffs) ... after 6502 assembly .. so I do love C++. I am thinking about moving my version of the sim to C++, but for other reasons: the c definition of mapped hardware registers as plain structs has it own limitations which C++ overcomes (dan saks). |
Hot damn. I bloody well suspected there was gold to mine in that PS2 and VIA emulation code, tools notwithstanding. I'm seeing substantial wins from PR #336, I would appreciate if other eyeballs could verify it. |
A kind soul from the unofficial Discord is saying that their Pi4 is seeing a gain from 61% to 77%. Not full speed like my test, but they're running headless so are accessing it through VNC. Still a good win. I wonder what the differences are between their Pi4 environment and mine. And their Pi3 is apparently running the official repo version at 16%, and my PR at 19%-21%. And probably still through VNC. :) |
Hi Stephen,
I haven't reviewed yet but planned on doing so over Easter weekend.
Great work either way, some good wins.
I continued with the dispemanx drivers last weekend. I added the ability to
select driver on the sdl side. But non of the installed drivers in a
standard raspbian helped.
Moving to EGL to use this driver gave a ~40% boost. Considering how to
implement this. A separate pi Vera? As this will most likely break x86 and
WASM
Kind regards
Mike Brown
…On Mon, 29 Mar 2021, 09:30 Stephen Horn, ***@***.***> wrote:
A kind soul from the unofficial Discord is saying that their Pi4 is seeing
a gain from 61% to 77%. Not full speed like my test, but they're running
headless so are accessing it through VNC. Still a good win. I wonder what
the differences are between their Pi4 environment and mine.
And their Pi3 is apparently running the official repo version at 16%, and
my PR at 19%-21%. And probably still through VNC. :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW3KSQLLOGHVMYYCRVTTGA3DVANCNFSM4OXLRQUQ>
.
|
@indigodarkwolf yeah ... surely that code is 'gold' ... no worries, I never managed to get the Arduino team to see how they were about to trash their IDE (and they did it) ... so I don't expect I will be more successful this time around. :-) |
Mihalkovic,
This group has no interest in that kind of atmosphere. We have real
engineering challenges, with only the goal of solving them.
Stephen is in the top percentile of impactful contributors. And it looks
like he's managed to give us another 10% frame rate improvement.
And that's all we really care for. Good engineering delivering results.
Regards
Mike Brown
…On Wed, 31 Mar 2021, 14:04 L. Mihalkovic, ***@***.***> wrote:
@indigodarkwolf <https://github.com/indigodarkwolf> yeah ... surely that
code is 'gold' ... no worries, I never managed to get the Arduino team to
see how they were about to trash their IDE (and they did it) ... so I don't
expect I will be more successful this time around. :-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW5OGVH7L4AB6WP2C5DTGMMWRANCNFSM4OXLRQUQ>
.
|
@mikey-b Thanks. Although there's ambiguity in the intended target, I'm choosing to assume that @lmihalkovic's apparent sarcasm about golden code was directed towards the code powering Raspbian. If it was directed towards my own contributions, I would welcome constructive feedback or, failing that, specific criticisms of what could use improvement. And of course I would be happy to answer questions about my code. Obviously my contributions aren't perfect: I've personally introduced some bugs into the Vera emulation, specifically, and of course I've opened some questions about how the Vera truly behaves versus optimizations that I've made to try and make the Vera performant. But also, anyone can write code and submit a PR. Even if we can't see eye-to-eye on certain technical issues, I would still welcome more contributors working to improve the emulator, and it's ultimately up to Michael Steil to decide whose PRs are accepted. |
@indigodarkwolf I apologize for the confusion I obviously created: there was, and still is, ZERO sarcasm in what I am saying... life is too short and precious for that. I am as serious as I was writing something similar to M Banzi. This codebase is very well written .. and I don't mind admitting that I did not even see its full potential when I proposed #325. it is only after I was done implementing it that my blindness downed on me. I explained in #331 that I was not going to explain everything in this forum, but that I have no reservations explaining things in a more private setting. my apologies again if my comment had the appearance of a sarcasm. |
@lmihalkovic All good mate, Let's make everything fast together. :D |
@indigodarkwolf as I said elsewhere, given a choice my first goal would be to preserve that unique opportunity which I see for the x16 team within this code, and then only to make it faster... which may be at odds with your own. saying this, I am keenly aware that I have done nothing to build any credibility within this group, while the work you have done and shared speaks for you (again just plain honesty). ... maybe one day I should share my private Arduino IDE repo to show what the arduino team missed by refactoring their codebase as they did: for years they had said that certain features were not doable at a reasonable cost because there was no file manager in the code.... what they never saw was that there was one ... or more so, how easy it was to make one emerge from the existing code. once I was done doing that, many things became trivial to implement, which they gave up on doing (basically feature parity with Arduino Create ... I still wonder to this day if they didn't just want to keep a difference with Create in order to push new users towards the web). At the time I did not want to share the code for free, considering how many problems it was solving for their professional development team. In one single 'improvement' refactoring they completely closed the door one many of these long asked-for improvements. |
Hi all,
I understand, text is hard and can always be misunderstood. As long as
we're all on the same page
For general chat, could I recommend the discord server? I'm not there but I
think Stephen is there regularly. Any help on improvements is always
greatly appreciated.
I have attempted several times to use the dispemanx drivers but without
success.
Reviewing Stephens later this weekend.
Happy Easter everyone
…On Wed, 31 Mar 2021, 19:55 L. Mihalkovic, ***@***.***> wrote:
@indigodarkwolf <https://github.com/indigodarkwolf> as I said elsewhere,
given a choice my first goal would be to preserve that unique opportunity
which I see for the x16 team within this code, and then only to make it
faster... which may be at odds with your own. saying this, I am keenly
aware that I have done nothing to build any credibility within this group,
while the work you have done and shared speaks for you (again just plain
honesty).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAIW5IM4CKK2U54I3OCDDTGNVZLANCNFSM4OXLRQUQ>
.
|
Yes, I am regularly on the unofficial Discord server, my username there is "Indigo". |
I have run a profiler over the whole codebase which aims to predict possible performance gain areas.
Each chart says, If you make this section of code x% faster, the whole program will speed up by y%. Unfortunitely all recommendations are very flat suggesting that optimisation has come to a head with the current architecture.
What I did find interesting is all the instruction.h suggestions. It has highlighted a branch prediction issue. The chart say's if we can speed up these functions by 10%, the program will also speed up by 10%. I don't believe this is possible - But we can gain 1-2%.
oldpc is currently global, which could be aliased, making this local removes the unnessessary memory store - Is oldpc safe to remove?
We can also go branchless - Example of optimised code: https://godbolt.org/z/q6Ma4e