Performance Stats + Instructions #284

mikey-b · 2020-07-11T16:00:17Z

I have run a profiler over the whole codebase which aims to predict possible performance gain areas.

Each chart says, If you make this section of code x% faster, the whole program will speed up by y%. Unfortunitely all recommendations are very flat suggesting that optimisation has come to a head with the current architecture.

What I did find interesting is all the instruction.h suggestions. It has highlighted a branch prediction issue. The chart say's if we can speed up these functions by 10%, the program will also speed up by 10%. I don't believe this is possible - But we can gain 1-2%.

oldpc is currently global, which could be aliased, making this local removes the unnessessary memory store - Is oldpc safe to remove?

We can also go branchless - Example of optimised code: https://godbolt.org/z/q6Ma4e

Extra compiler flags for better output. Added max(), min() macros Heavy Video.c refactoring.

Removed tile pallette code, re-added

rebase to indigodarkwolf

Changes to render_line_bitmap and render_line_tile

Remove lamdba in render_line_tile as Mac and WebAsm didn't weren't happy.

render_line_tile - Now reads props->tilew for "tile update skipping" step width.

use MHZ constant

refork

The branches can not be predicted very well, oldpc is also currently globally scoped requiring a memory read and write, making it local removes this. (Oldpc appears to not be used globally)

oldpc is now a local stack variable where needed, removing globally scoped

Make bra() instruction branchless

indigodarkwolf · 2020-07-22T15:46:35Z

Personally, the avenue I'm looking into next for optimization is to move VERA work into smaller batches and closer to the point where 65C02 instructions would cause changes, leaning heavily into added buffers and prerendering assets so that we don't have to expand or calculate a variety of things, we can just blit.

Some of this is very straightforward - every time we make a change to any of the layer properties, we refresh all of that layer's properties in that 65C02 instruction. Obviously we could set a dirty flag instead and only update the layer's properties the one time once we need to draw a line. This especially makes sense in the common case of someone changing a bunch of layer properties at once during VBLANK.

Some of this seems like it has obvious wins - if we keep a buffer of uncompressed color indices for a layer's tiles, we don't have to expand it on-the-fly for each line, and it's fairly trivial to keep this data in-sync and it should require no extra work. This seems like pure savings after the initial expansion of VRAM into the a tile image backbuffer (and that initial expensive step would be done only in response to refreshing dirty layer properties).

If the cost of that refresh proves excessive, such as in cases where someone is using line IRQs to swap around layer properties a bunch (though this seems like it would be rare) we could optimize it by keeping a pool of previous settings, paying a little bit of cost on 65C02 instructions that write to VRAM in order to speed up swaps between multiple layer settings by keeping old settings cached that we can choose to use. Then we just need to find a decent hash for the layer properties, to try and ensure we don't needlessly clobber the cache, and how hard could it be to find a decent scattering of 3 bytes' worth of data?

The part that's harder to anticipate whether it'll be a win or a loss is prerendering whole layers onto backbuffers, so that line draws only have to deal with scaling and translation. This is something I'd look into last, but potentially gives us the biggest wins on the hot path. It's somewhere I'd be more comfortable working in DirectX rather than SDL2, but hey, opportunity to learn. Though actually, to accomplish this with maximum speed in a cross-platform fashion, it looks like OpenGL would really be the way to go, so that a tilemap could be described as a series of UV coordinates onto a mesh, and we could render it to a backbuffer with a single draw call. Still, SDL2 provides what it calls an accelerated blit-blitting function that can perform horizontal and vertical flips (SDL_RenderCopyEx), so that may prove to be a win.

The way that prerendering a whole layer could easily spiral out of control, though, is if someone is updating the tile data after the layer has been rendered. Perhaps a better approach would be to use SDL2 calls to render individual rows of tiles into a backbuffer, so we're paying for the draw once every 8 or 16 unscaled lines, then can trivially blit individual pixels. Again, unless or until the layer's properties or VRAM data is changed to invalidate the backbuffer.

Does this make sense, or am I sounding like a madman?

indigodarkwolf · 2020-07-23T03:46:14Z

In related news, my RPi4 arrived recently, the emulator runs at... 36%. Very "oof", much room for improvement. But now I have a platform that isn't a $3500 gaming PC, that I can use for testing performance tweaks. And there seems to be no shortage of interest in running the emulator on RPis...

indigodarkwolf · 2020-07-26T19:54:35Z

Well, that's avenue is not a bust, but even once you get the rendering down to just "copy lines from backbuffers and then replace with final palette colors", an RPi4 only goes from 36% to 40%, and there's a substantial "warm-up" time, naturally, from building up the cached data. Still, it's about half of the way to -warp mode's 45% speed (which, in retrospect, I should have tried sooner, but... eh, it was fun to try, anyways).

One thing I really like from my efforts, though, was that the refactor to drawing backbuffers creates trivial opportunities to insert multithreading, and I did some experimenting with OMP. I don't have measured numbers because I knew I wouldn't be comfortable committing any multithreaded code at this point, but OMP substantially reduced the cost of warming up the video cache, not quite to the point where the warm-up cost at boot time was unnoticeable, but on my desktop it got me to running at top speed within a cursor blink on the default console, and the demos and games I threw at it didn't even appear to flinch.

indigodarkwolf · 2020-07-26T19:58:59Z

If you're interested, my work is in this direction is committed to https://github.com/indigodarkwolf/x16-emulator/tree/vera-fast-v1.

mikey-b · 2020-07-26T21:55:50Z

Hi Steven,

This direction
Good effort, but im not measuring much benefit on my x86 machine, but no worse either.

Not overly bothered about the startup time, Its not that noticable. But does these changes enable multithreading in the hot paths at all?

Thoughts on direction
bottlenecks still appear to be render_layer_line_* functions. And within those is normally operations on the props->, e.g.

3.4% uint16_t x_add = (xx * props->bits_per_pixel) >> 3;
5.4% uint8_t col_index = (s >> (props->first_color_pos - ((xx & props->color_fields_max) << props->color_depth))) & props->color_mask;

I don't know enough about the x16 vera stuff, When can these props values change? They surely can't change during a render_layer_line_* call so I think, while we would have to duplicate code, we should attempt to break these out and hoist these tests and calculations higher up the call tree?

indigodarkwolf · 2020-07-27T00:02:17Z

Looks like you focused on the performance of bitmap layers.

Bitmap rendering was almost untouched by my changes, and bitmap layers aren't being prerendered onto a backbuffer like text and tile ones. So I'm not surprised that their performance was not substantially changed. I knew tile and text would be more involved (in particular, if someone replaces a single tile/character on the map then it should only redraw that individual tile/character and not the entire backbuffer), so I focused on those.

Timing of property changes
You're right that those props can't change while rendering a line, or at least that's my understanding of how the hardware behaves. With my branch, a change to those particular properties would invalidate the backbuffer, causing the current buffer and its settings to be cached when we are about to draw the next line, and we'd either retrieve a valid cache entry or expire an old one and redraw the layer from a fresh backbuffer.

The other thing to consider, as well, is that this work is shifting some of the work out of the previous hot path and into the CPU emulation. For instance, when someone executes an instruction to write a byte into VRAM, if that byte touches an active or cached tilemap, we go ahead and immediately update the appropriate backbuffers on those tilemaps (assuming those backbuffers have not been invalidated for other reasons).

Multithreading potential
It would be pretty straightforward to prerender bitmap layers to a backbuffer at this point, though, so I can add that.

When there's a backbuffer to rely on, drawing the backbuffer itself is trivial to multithread, and I'm sure that work on the hot path will be just about as easily threaded.

The main difference is that the blit on the hot path looks like this:

const uint32_t hscale = reg_composer[1];
uint32_t       xaccum = props->hscroll << 7;
for (uint16_t x = 0; x < hsize; ++x) {
	const uint16_t eff_x = xaccum >> 7;
	layer_line[layer][x] = prerender_line[eff_x & max_buffer_x];
	xaccum += hscale;
}

While each backbuffer render looks like this:

for (int y = 0; y < buffer_height; ++y) {
	prerender_layer_line_text(layer, y, props->prerendered_layer_color_index_buffer + (buffer_width * y));
}

It's trivial, then, to get large benefits from inserting an OMP #pragma to the latter:

#pragma omp parallel for
for (int y = 0; y < buffer_height; ++y) {
	prerender_layer_line_text(layer, y, props->prerendered_layer_color_index_buffer + (buffer_width * y));
}

I'm just new enough to OMP to have not sussed out the correct way to thread the trivial copy case, with its incrementing xaccum parameter, or whether OMP is smart enough to figure out for itself the appropriate blocks of work so threads are not being dispatched for a single trivial a[x] = b[y].

Now that I'm thinking about it, if we needed to speed up the execution of keeping backbuffers up-to-date from CPU writes to VRAM, that would be pretty easily parallelized, as that's a fairly trivial for-loop as well:

for (int i = 0; i < num_layer_properties_allocd; ++i) {
	if (is_layer_map_addr(i, address)) {
		poke_layer_map(i, address, value);
	}

	if (is_layer_tile_addr(i, address)) {
		poke_layer_tile(i, address, value);
	}
}

Other thoughts
Also, I have not substantially changed sprites, either, and they showed up in my own profiling as a pain point.

My concern was initially centered around animations requiring a much larger cache of sprite settings (which would require a proper hash table or map structure to store them in, unlike the layers where there were few enough I could get away with a linked list). Sprites don't look like tile layers, in the sense of having a base address and asset size, followed by a reference to an asset index. Sprites just have a base address, so a 6-frame animation is, immediately, 6 cache entries. And if it can be h-flipped or v-flipped, that work either goes back into the hot path, or we potentially multiply the number of cache entries to account for flips.

It's something I'll continue to think about.

indigodarkwolf · 2020-07-27T05:31:07Z

I still need to think about sprites, but I've gone ahead and sped up bitmaps in the same way that I did tile and text layers. The render_line for them is now essentially a blit with horizontal scaling.

I also realized that the compositing step before final color selection could be sped up by only bothering with the arrays that were needed - a single switch with 7 valid options that are all one-liners seems like a no-brainer optimization over copying potentially 4x the data into local stack variables before juggling one into the final array that matters. A little less helpful in the worst case when sprites and both layers enabled, but still better to replace branching three times per pixel with a single branch per line.

I also wonder if SDL2 could speed up the final step where we replace color indices with actual palette colors. Something to experiment with.

And then there's PS/2
Another thing that's starting to show up in my profiling, depending on my test cases, is the PS/2 emulation. If I run a demo program that's hands-off, the PS/2 emulation basically isn't present in my results (good thing), but if I run, say, Chase Vault and play a few levels, PS/2 emulation is almost as heavy as the VERA emulation (seems like a target for optimization).

Something else for me to think about this week.

mist64 · 2020-08-26T07:57:50Z

What is the status of this?

indigodarkwolf · 2020-08-27T00:14:21Z

So far, using an RPi4, the runtime difference between r38 and my current branch is 37% for r38 versus 44% for my branch when idling after a boot. Not bad.

But if I have it loop on 10PRINT"HELLO COMMANDERX16!":GOTO10, r38 maintains 37% while my branch drops to 31%. Not ideal, and unless/until this is addressed, I'm not sure I want to submit a PR.

Chase Vault runs at 23% for r38, 33% for my branch.

So, various pros and cons.

Update of what I've been doing and thinking about, specifically.

The majority of the changes in my branch are to maintain backbuffers for the layers and sprites. By pre-drawing all the layer data onto a backbuffer, the actual write is a trivial for-loop. Also, the way that the backbuffer draw occurs is trivial to multithread, as previously discussed. That said, I don't have the multithreading checked in anywhere yet, as I got distracted by chasing other butterflies.

One of the thoughts I hit on was that writes to VRAM are "relatively rare", so in order to make layer and sprite backbuffer operations as fast as possible, I've changed all VRAM writes to essentially perform 4 writes: one each assuming the data is 8bpp, 4bpp, 2bpp, and 1bpp, to parallel VRAM arrays. That way, we don't have to expand windows of the data later for various layer and tile sets. It makes the act of changing layer settings somewhat quicker, and means we don't have to otherwise update or refresh the expanded data behind a layer when a write occurs. I think this was a worthwhile trade-off.

And that's about where I've stalled out in making the VERA faster, with rapidly updating tiles on a layer being the new expensive problem. Potentially a problem for OMP to sort out, if we officially want to go that route.

In my own profiling, PS/2 is the next biggest single hotspot, but I'm not sure it can be easily addressed. I made a small tweak to the ringbuffer implementation in the PS/2 code, since that was a mild hotspot with the for-loop approach of checking whether the ring buffer had room for new data. That optimization is included in the runtime figures up top. Beyond that, I'm seeing time lost to branching, and I just don't know if there's anything to be done about that: I took a stab at refactoring the ps2_port_t struct in the hopes of replacing some branching with bit ops, but it's only a small win: it gets me up to 46% on my RPi4 while idling.

So then the next place to go digging, according to my profiling, is the 65C02 implementation itself. One easy-ish change to reduce needless branching is to pull the bcd logic out of adc and sbc, and instead create bespoke operations which are substituted into the instruction table whenever we execute sed, and then replace with the non-bcd flavors on cld. But the branching issues in the 65c02 code run a lot deeper than that: each potential status flag change is a branch based on an appropriate conditional statement, and each time the processor reads or writes a value is a branch to check whether it should take the value from the accumulator instead. But I'm inclined to believe Michael Brown's analyzer when it says that we'd perhaps only get a 10% benefit from substantially reworking all that anyways, especially since any hit to reading or writing an actual memory address is an unavoidable set of branches to deal with I/O, banked RAM, and ROM.

And that's where I'm at right now.

lmihalkovic · 2021-03-06T13:43:41Z

@indigodarkwolf If I may .... before slicing and dicing though the video.c implementation, I would first try to mimic more closely the separation of hardware that does exist on the real machine.

In real life, the main proc is not bothered at all by any of the work done by the video... it writes to some registers and things happens.. that's all ... it is neither blocked nor slowed down ... no interactions whatsoever ... the emulator should attempt to replicate that model, by all possible means!!!

in the end... maybe SDL should be running the main loop, and emulator loop should be running in its own background thread ... because that is how x16 works. it becomes very clear that it works this way when you carve out the SDL code from the VERA code in video.c (I have separate video.c and video_vera.c with a clean complete software vera separate from everything else).

I also cleaned up the memory related code and RAM is no longer patcheable by whoever and their dog... having done that made it very clear what the real contact points are between the 6502's running ROM code and the rest of the world.

mikey-b · 2021-03-06T14:00:30Z

My general thoughts on that - I believe the original goal was indeed a simulator - emulate as accurately as possible. Changing that would require general consensus or a fork/new project. Multithreading is not possible cleanly with the current architecture. And with a clock accurate emulator, I'm not sure it would buy us much. Video rendering is the lions share of the work, even the CPU stuff is minor. The shifting of data is also minor so there isn't much work to offload. The unit of work in one tick is very small. Also got to consider the WASM target, which doesn't support multithreading. Imho, changing the video pipeline is still the better option if more speed is still needed? I thought things were in a good place with the emulator now? Kind regards Mike

…

On Sat, 6 Mar 2021, 13:43 L. Mihalkovic, ***@***.***> wrote: If I may .... before slicing and dicing though the video.c implementation, I would first try to run everything on a separate thread async from the rest to mimic the separation of hardware that does exist on the real machine. anything else would just be going into a SIMULATOR rather than trying to build an EMULATOR — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW75FR45AHYWJAZOO53TCIWRVANCNFSM4OXLRQUQ> .

lmihalkovic · 2021-03-06T14:02:56Z

@mikey-b .. well ... my cleaned up code is a lot closer to making it possible .. and it only took a couple days to a complete newb to do the clean up. the code works just fine as it is, but the real final test will be running 2 video cards.... will try that next

yes, one tick is very small ... and it should have NOTHING to do whatsoever with the video behavior :) .... my point exactly... proof that they are conceptually separate? .. IRQs .. vera can trigger IRQ on some special conditions (the v/h syncs, the sprite collisions) to completely disrupt the normal flow of the 6502, meaning that they are indeed not sync at all by nature.

...... ... hmmm ... WASM ... ... it does have access to the worker threads.. no? (I will have to check again .. made a poc at work for a cad thing targeting wasm over a year ago)

mikey-b · 2021-03-06T14:09:58Z

Could you expland? What changes have you done and proposing? And would two video cards be for?

…

On Sat, 6 Mar 2021, 14:03 L. Mihalkovic, ***@***.***> wrote: @mikey-b <https://github.com/mikey-b> .. well ... my cleaned up code is a lot closer to making it possible .. and it only took a couple days to a complete newb to do the clean up. the code works just fine as it is, but the real final test will be running 2 video cards.... will try that next — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIWZ6YQ5JY3IYH7I4CEDTCIYZ3ANCNFSM4OXLRQUQ> .

lmihalkovic · 2021-03-06T14:20:26Z

@mikey-b the 2 cards is not directly related .. I will just use that step as a validation that I did isolate things properly. if I did... then a simple basic app can directly poke to the second card with zero impact on the main one (the little BALLS example is a good one to start from)

with the contact surface so greatly reduced, I can then try to run all the rendering for card2 inside a separate thread (the debugger will have to be cut off for the moment). the rendering code can do what it wants. hmmm .... I just realized that I have to look at the rom code it see what it does (if anything) with the v/h syncs...

lmihalkovic · 2021-03-06T14:35:01Z

Personally, the avenue I'm looking into next for optimization is to move VERA work into smaller batches and closer to the point where 65C02 instructions would cause changes, leaning heavily into added buffers and prerendering assets so that we don't have to expand or calculate a variety of things, we can just blit.

Some of this is very straightforward - every time we make a change to any of the layer properties, we refresh all of that layer's properties in that 65C02 instruction. Obviously we could set a dirty flag instead and only update the layer's properties the one time once we need to draw a line. This especially makes sense in the common case of someone changing a bunch of layer properties at once during VBLANK.

Some of this seems like it has obvious wins - if we keep a buffer of uncompressed color indices for a layer's tiles, we don't have to expand it on-the-fly for each line, and it's fairly trivial to keep this data in-sync and it should require no extra work. This seems like pure savings after the initial expansion of VRAM into the a tile image backbuffer (and that initial expensive step would be done only in response to refreshing dirty layer properties).

If the cost of that refresh proves excessive, such as in cases where someone is using line IRQs to swap around layer properties a bunch (though this seems like it would be rare) we could optimize it by keeping a pool of previous settings, paying a little bit of cost on 65C02 instructions that write to VRAM in order to speed up swaps between multiple layer settings by keeping old settings cached that we can choose to use. Then we just need to find a decent hash for the layer properties, to try and ensure we don't needlessly clobber the cache, and how hard could it be to find a decent scattering of 3 bytes' worth of data?

The part that's harder to anticipate whether it'll be a win or a loss is prerendering whole layers onto backbuffers, so that line draws only have to deal with scaling and translation. This is something I'd look into last, but potentially gives us the biggest wins on the hot path. It's somewhere I'd be more comfortable working in DirectX rather than SDL2, but hey, opportunity to learn. Though actually, to accomplish this with maximum speed in a cross-platform fashion, it looks like OpenGL would really be the way to go, so that a tilemap could be described as a series of UV coordinates onto a mesh, and we could render it to a backbuffer with a single draw call. Still, SDL2 provides what it calls an accelerated blit-blitting function that can perform horizontal and vertical flips (SDL_RenderCopyEx), so that may prove to be a win.

The way that prerendering a whole layer could easily spiral out of control, though, is if someone is updating the tile data after the layer has been rendered. Perhaps a better approach would be to use SDL2 calls to render individual rows of tiles into a backbuffer, so we're paying for the draw once every 8 or 16 unscaled lines, then can trivially blit individual pixels. Again, unless or until the layer's properties or VRAM data is changed to invalidate the backbuffer.

Does this make sense, or am I sounding like a madman?

I would stay away from this for now ... .. I am not sure if I am the madman myself, but I see video.c as a CRAZZZZY GEM of an opportunity as it exists today. I know that I make no sense at all if I do not explain what I mean .. but better than explaining, I want to show it.. and that will take me a little bit more time to get there. and in the meantime, I am hoping that nobody destroys this opportunity.

note: I had the same kind of discussion a few years ago with the Arduino team ... trying to make them see that there was sooo much potential in the IDE source code, which they did not seem to be able to see.

indigodarkwolf · 2021-03-06T18:41:29Z

Seems I missed a fair amount of conversation. Please forgive me the length and breadth of a reply as I catch up.

@imihalkovic It may may a certain amount of sense to run the SDL loop on a separate thread from the rest of the machine, but by that I mean divorcing SDL from the VERA behavior as well, so that SDL is snapshotting completed framebuffers and drawing that.

However, I suspect that will buy us very little in the way of performance. The emulator simply does not perform a lot of work through the SDL API - it is essentially just grabbing a framebuffer generated by the VERA, and blitting it to the window. Easy-peasy, this is not our bottleneck.

De-synchronizing the VERA from the CPU by placing them on separate threads will be a non-starter for the emulator for the simple reason that the official emulator concerns itself deeply with cycle-accurate behavior, from the CPU ops to the PS2 interface to the VERA, and everything in-between. I suspect we've already broken that in minor way with other VERA optimizations, but we can't really know unless or until we have official hardware. Allowing each to spin freely, regardless of the other, will break cycle-accuracy in a major fashion.

And perhaps, instead of spending so much time programming for the emulator, I should be building test applications and asking Frank to run some comparisons between the emulator the real hardware, so we can improve or restore the accuracy of the emulator.

Or by all means, fork the emulator. In truth, I'm working on my own fork of the emulator that makes a number of rather deep changes, in particular swapping to C++17 so that I can make use of the STL, instead of hand-rolling a bunch of data structures and algorithms I wish to have in trying to tackle further VERA optimizations.

But if you want my opinion of how best to speed up the VERA, the answer is very simple:

Multithread the for-loops inside each render_layer_line_* function. These iterative steps are trivial to perform in arbitrary order, and are where the bulk of the VERA spends its time.

Multithread the for-loop below "// Calculate color without border." within static void render_line(uint16_t y). Again, trivial to perform in arbitrary order, and the VERA spends substantial time here.

Multithread the final framebuffer color assignment and NTSC overscan for loops in static void render_line(uint16_t y). Again, it is obviously trivial to perform each iteration in arbitrary order.

The only reason I haven't already submitted PRs with those, many months ago, is because I was uncomfortable with adding a dependency on openmp without being able to build the project on Mac platforms.

mikey-b · 2021-03-06T19:45:21Z

Have you seen any improvements from multithreading those functions? I found VERA to be heavy on the reads, but only produced about 200 bytes. Multithreading didn't yield much when I attempted.

…

On Sat, 6 Mar 2021, 18:41 Stephen Horn, ***@***.***> wrote: Seems I missed a fair amount of conversation. Please forgive me the length and breadth of a reply as I catch up. @imihalkovic <https://github.com/imihalkovic> It may may a certain amount of sense to run the SDL loop on a separate thread from the rest of the machine, but by that I mean divorcing SDL from the VERA behavior as well, so that SDL is snapshotting completed framebuffers and drawing that. However, I suspect that will buy us very little in the way of performance. The emulator simply does not perform a lot of work through the SDL API - it is essentially just grabbing a framebuffer generated by the VERA, and blitting it to the window. Easy-peasy, this is not our bottleneck. De-synchronizing the VERA from the CPU by placing them on separate threads will be a non-starter for the emulator for the simple reason that the official emulator concerns itself deeply with cycle-accurate behavior, from the CPU ops to the PS2 interface to the VERA, and everything in-between. I suspect we've already broken that with other VERA optimizations, but we can't really know unless or until we have official hardware. And perhaps, instead of spending so much time programming for the emulator, I should be building test applications and asking Frank to run some comparisons between the emulator the real hardware, so we can improve or restore the accuracy of the emulator. Or by all means, fork the emulator. In truth, I'm working on my own fork of the emulator that makes a number of rather deep changes, in particular swapping to C++17 so that I can make use of the STL, instead of hand-rolling a bunch of data structures and algorithms I wish to have in trying to tackle further VERA optimizations. *But if you want my opinion of how best to speed up the VERA, the answer is very simple:* Multithread the for-loops inside each render_layer_line_* function. These iterative steps are trivial to perform in arbitrary order, and are where the bulk of the VERA spends its time. Multithread the for-loop below "// Calculate color without border." within static void render_line(uint16_t y). Again, trivial to perform in arbitrary order, and the VERA spends substantial time here. Multithread the final framebuffer color assignment and NTSC overscan for loops in static void render_line(uint16_t y). Again, it is obviously trivial to perform each iteration in arbitrary order. The only reason I haven't already submitted PRs with those, many months ago, is because I was uncomfortable with adding a dependency on openmp without being able to build the project on Mac platforms. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW3V6IO4NDSOTO64S63TCJZONANCNFSM4OXLRQUQ> .

indigodarkwolf · 2021-03-06T21:31:53Z

It wasn't order-of-magnitude differences in performance, but yes, I did see improvements from multithreading the various render_layer_line() functions. I believe I also took the liberty of spinning the calls to those functions, themselves, into separate threads, so everything was run in parallel.

Note that my test platform was a Raspberry Pi 4, because I feel that PC-based performance is likely "good enough", and RPi4s were the trending subject on the Facebook group at the time. So optimizations against PC performance are not interesting to me.

Also note that even the CPU emulation is slow on the RPi4, so the VERA is not the only corner of the emulator that needs help if the emulator is to run at full speed on one of those. And that may be asking too much.

The changes I'd experimented with also rolled back the LAYER_PIXELS_PER_ITERATION arrays that... well, I'm going to be honest, I have no idea how they would help, in any context. They don't meaningfully improve cachability, the compiler won't automatically cast them to 32-bits to try and batch operations together with larger type sizes, the arrays aren't vector types and the loops aren't using vector ops... I could go on, but at this point I don't even think I would believe any amount of theory-crafting and I simply need someone to whap me on the nose with data from an instrumented build to be convinced. And then there's the calls to memset when initializing these arrays, and I can't immediately recall whether the compiler is optimizing those away, but if it isn't then there's a complete waste of time for arrays of only 4 elements.

mikey-b · 2021-03-07T01:50:10Z

The changes I'd experimented with also rolled back the LAYER_PIXELS_PER_ITERATION I can only assume the following thinking - A fixed loop iteration is required for the auto vectorizer to work. But has other restrictions too, e.g. no globals. adding "-fopt-info-vec-missed" to the compiler will show you the places it failed to vectorize. IMHO, LAYER_PIXELS_PER_ITERATION needs to be removed for better clarity anyway. Also note that even the CPU emulation is slow on the RPi4, so the VERA is not the only corner OK, well, some suggestions in fake6502; * Remove "callexternal, hookexternal() and loopexternal", it's a feature that isn't used. We can then remove the branch "if (callexternal) (*loopexternal)();" from the hotpath. * Remove "instructions++" from step6502, its duplicated work in main.c - Or (my pref -) extern instructions and remove instruction_counter which is in main.c; * clockgoal6502 is unused as exec6502 is not being used. * There's a few places we can go branchless in here, (unsure of any benefit) ** in step6502 can become - clockticks6502 +=1; clockticks6502 += (penaltyop && penaltyaddr); ** page boundaries - e.g. if ((oldpc & 0xFF00) != (pc & 0xFF00)) clockticks6502 += 2; else clockticks6502++; And then the tricky bit uint32_t old_clockticks6502 = clockticks6502; step6502(); uint8_t clocks = clockticks6502 - old_clockticks6502; bool new_frame = false; for (uint8_t i = 0; i < clocks; i++) { ps2_step(0); ps2_step(1); joystick_step(); vera_spi_step(); new_frame |= video_step(MHZ); } audio_render(clocks); That MHZ is a macro constant, and is being passed in as a stack variable. We 100% need to get rid of that and keep it compile-time constant. Because we're only doing a single step (step6502), I think we should alter that to return the clockticks, and remove that calculation on globals. uint8_t clocks = step6502(); Unless anyone can suggest a way to use exec6502? Or must the CPU be in such a tight lockstep with the VERA (No wiggle room here)? We also know that clocks <= 7. (The most expensive single tick on a 6502.). IHMO - This is why the video_step is struggling, it can't get enough data processed in one pass. Can we not move this out similar to audio_render(clocks) ? Does video_step rely on ps2_step, joystick_step, or vera_spi_step?

…

On Sat, 6 Mar 2021 at 21:32, Stephen Horn ***@***.***> wrote: It wasn't order-of-magnitude differences in performance, but yes, I did see improvements from multithreading the various render_layer_line() functions. I believe I also took the liberty of spinning the calls to those functions, themselves, into separate threads, so everything was run in parallel. Note that my test platform was a Raspberry Pi 4, because I feel that PC-based performance is likely "good enough", and RPi4s were the trending subject on the Facebook group at the time. So optimizations against PC performance are not interesting to me. Also note that even the CPU emulation is slow on the RPi4, so the VERA is not the only corner of the emulator that needs help if the emulator is to run at full speed on one of those. And that may be asking too much. The changes I'd experimented with also rolled back the LAYER_PIXELS_PER_ITERATION arrays that... well, I'm going to be honest, I have no idea how they would help, *in any context*. They don't meaningfully improve cachability, the compiler won't automatically cast them to 32-bits to try and batch operations together with larger type sizes, the arrays aren't vector types and the loops aren't using vector ops... I could go on, but at this point I don't even think I would believe any amount of theory-crafting and I simply need someone to whap me on the nose with data from an instrumented build to be convinced. And then there's the calls to memset when initializing these arrays, and I can't immediately recall whether the compiler is optimizing those away, but if it isn't then there's a complete waste of time for arrays of only 4 elements. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW6O6G6UZ3H37N4WWY3TCKNNJANCNFSM4OXLRQUQ> .

indigodarkwolf · 2021-03-07T05:18:16Z

I broadly agree with all of these.

Additionally, with the current VERA behavior, we could probably exec6502 in increments of 239 cycles and then pump individual instructions until the VERA signals a new line has started. If there are per-pixel behaviors lerking in the VERA hardware, though, we don't know about them yet and in a worst case it means the VERA would have to render smaller segments to the framebuffer based on accesses to certain I/O registers.

There would also be open questions about other subsystems like PS/2, which also depend on being pumped every CPU instruction. As part of the work I've done in my own emulator, I've actually largely rewritten the PS/2, just to understand it, and in the process ended up making it substantially faster by making it do fewer checks based on I/O state, and making those checks much faster. My implementation could be back-ported to C, itself a win, but it could also probably be modified to update only when the CPU reads or writes to the appropriate VIA ports, or when the KB/M devices post updates into it. This would represent another win in performance. (And I look at it as an example of how my fork, though it's making some brazen changes, can still create opportunities for me to contribute back to the original emulator once Michael Steil is available again).

I haven't looked at audio subsystems yet, to determine to what extent they may benefit from a similar treatment of only updating state when the CPU accesses their I/O, in the hopes of enabling the CPU to tight-loop as much as possible. And I haven't looked at SPI yet, either, for SD Card access.

lmihalkovic · 2021-03-08T07:58:42Z

my turn to play catch-up ..

@indigodarkwolf divorcing VERA from SDL it is what I did. I carved out all the VERA code into vera_video.c and added a public API (not yet good) in places (wink wink .. debugger) where some of the rest of the emulator was getting too close to that code. as you said, video.c is now remarkably simple, and update just does get a snapshot of the framebuffer which it bliss out. when I saw that the code was so cleanly written I just couldn't believe the doors that it did open .... because they are so completely disjointed (in HW too), then it does make all the more sense to me to run it completely OUTSIDE the simulator mainloop (years ago I was working on military sims, and one of the lessons I remember from working with sims of 1M people in a city during a catastrophic event was that the more independent things are, the more degrees of freedom the system is keeping).
@indigodarkwolf we could probably exec6502 in increments of 239 cycles ... I would not .. it should be coming from the card .. the same way interrupts signal the events in the hardware... IMHO.

this SIM is a beautifully written piece of clean code... there are still features missing before going into aggressive optimization phase, which invariably results in damaging the layer boundaries...

for e.g. there is no easy way now for someone to simulate their own extension cards... which will be a key feature of x16... so before cutting corners, the APIs should be finished and the missing abstractions created/completed. To do that, some abstractions are clearly missing... and knowing what they should look like should be a higher priority than to optimize the ones that are currently in the system, because they may have to change in light of what is currently missing.

when all the abstractions are there, then and only then, it will be time to slash through them to try to aggressively speed up what can be. you know what Knuth said, right!?

mikey-b · 2021-03-10T21:20:53Z

Hi Mihalkovic, Stephen, Are we setting the target goal to be the RPi 4? Also, Can I pick both of your brains, What is the lockstep you have chosen for the CPU and VERA? And is there a spec decision on this? Does the VERA have separate RAM, or is it shared with the main RAM? Kind regards, Mike Brown

…

On Mon, 8 Mar 2021 at 07:58, L. Mihalkovic ***@***.***> wrote: my turn to play catch-up .. @indigodarkwolf <https://github.com/indigodarkwolf> "divorcing VERA from SDL" it is what I did. I carved out all the VERA code into vera_video.c and added a public API (not yet good) in places (wink wink .. debugger) where some of the rest of the emulator was getting too close to that code. as you said, video.c is now remarkably simple, and update just does get a snapshot of the framebuffer which it bliss out. when I saw that the code was so cleanly written I just couldn't believe the doors that it did open .... so one of the things I can do now is run 2 VERA ... which means the possibility of having the debugger on one and the app on the other.... personally, I like it. .. but it will also be possible to have 2 vera for the apps themselves. the caveat here is that the second card is not known at all be the ROM (which is ok) and (if I am not mistaken) x16 does not have the ROM extensions mechanism that apple2 had which would allow to load a relocatable stub that could allow card selection (basically using a reg in page0 to tell the base address of the 32 bytes IO window where vera is located, coupled with making the ROM vera code use that to compute the regs to use instead of hardcoding a ref to the 2 or 3 IO region .. forgot which one proto2 is using). but that does not prevent an app from directly using the regs of a second vera to drive the second screen.... and yes... there is the small details of a HW version of that second 'light' VERA... but that is not an issue either — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW3L2ISNMWN3QX4JKJDTCR7T7ANCNFSM4OXLRQUQ> .

indigodarkwolf · 2021-03-10T22:20:21Z

Well, I can at least answer that the VERA has separate RAM from the main RAM, it is not shared.

As for batch-executing the 65C02, I'd just point out that if we have a signal from the VERA module that a new scanline has been produced, then we have at minimum 239 cycles we can execute before the current implementation will care about any changes, within or without the VERA. This is just taking 8,000,000 cycles per second, divided by 60 frames per second, divided by 525 scanlines per frame, minus 14 cycles to account for "slop" from a previous instruction taking us as far as 7 cycles into the scanline, and some "final" instruction taking up to 7 cycles. This is actually probably conservative by 7 cycles, but even so we get at minimum 34 instructions before needing to go back to stepping the CPU by a single instruction and then processing the VERA.

But you're right, it would be ideal to finish working on the core features before doing anything drastic that might blur the boundaries between systems.

For a target, our known platforms are Windows, Mac, Linux, and WebAssembly. I'm not familiar with building WebAssembly, and can't test a Mac build. I don't know how to instrument and profile a WebAssembly build either; haven't even looked into it. The RPi4 runs on Linux and happens to be a platform that has some optimization peculiarities which I believe it has in common with WebAssembly, and since it runs at around 40% speed at present it seems like a platform where it'll be easy to observe concrete gains, even without instrumentation.

mikey-b · 2021-03-10T23:24:19Z

Hi Stephen, Thanks for the information. Is the VERA connected and communicating via SPI? And does this communication processing happen after one scan line? Kind regards, Mike

…

On Wed, 10 Mar 2021 at 22:20, Stephen Horn ***@***.***> wrote: Well, I can at least answer that the VERA has separate RAM from the main RAM, it is *not* shared. As for batch-executing the 65C02, I'd just point out that if we have a signal from the VERA module that a new scanline has been produced, then we have at minimum 239 cycles we can execute before the current implementation will care about any changes, within or without the VERA. This is just taking 8,000,000 cycles per second, divided by 60 frames per second, divided by 525 scanlines per frame, minus 14 cycles to account for "slop" from a previous instruction taking us as far as 7 cycles into the scanline, and some "final" instruction taking up to 7 cycles. This is actually probably conservative by 7 cycles, but even so we get at minimum 34 instructions before needing to go back to stepping the CPU by a single instruction and then processing the VERA. But you're right, it would be ideal to finish working on the core features before doing anything drastic that might blur the boundaries between systems. For a target, our known platforms are Windows, Mac, Linux, and WebAssembly. I'm not familiar with building WebAssembly, and can't test a Mac build. I don't know how to instrument and profile a WebAssembly build either; haven't even looked into it. The RPi4 runs on Linux and happens to be a platform that has some optimization peculiarities which I believe it has in common with WebAssembly, and since it runs at around 40% speed at present it seems like a platform where it'll be easy to observe concrete gains, even without instrumentation. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW2PXJYIBG3RALVGVWTTC7PCHANCNFSM4OXLRQUQ> .

indigodarkwolf · 2021-03-11T04:34:42Z

The VERA's I/O is mapped to a variety of memory addresses as registers, using parallel communication. See also, https://github.com/commanderx16/x16-docs/blob/master/VERA%20Programmer's%20Reference.md

Communication is processed immediately.

We're not sure whether the VERA latches certain values when starting to draw each scanline, or whether that was a misunderstanding of something Frank once said, which may instead have been referring to a behavior where it draws lines to a backbuffer. The original VERA implementation processed between each CPU instruction and drew individual pixels as the scanline would have swept across the display.

indigodarkwolf · 2021-03-11T05:11:13Z

If we go with the assumption that the VERA is drawing per-pixel and latches nothing, then we do at least know that the VERA draws to a back buffer, so for instance if you're counting lines from VSYNC, then it draws line 33 (the first visible line) during line 32 (the last line of the back porch), then swaps buffers so that the result of line 33 is presented while line 34 is drawn to the back buffer.

This means we'd have to rollback the optimization that draws entire lines at a time, and this would also make the VERA much less likely to see wins from multithreading, unless we did one of those "blurring the boundaries"-type optimizations where the VERA would only process based on I/O, and every so many CPU cycles.

The biggest performance hit from per-pixel processing is the sprite logic. That loops through 128 sprites, and order of processing matters because of how work units are calculated, and doing that task 18 million times per second is... unappealing. But there were obviously other wins to be had from drawing whole lines, due to improved cache behavior. It would be sad to roll that back, as it would seriously hurt the emulator's performance.

lmihalkovic · 2021-03-11T14:08:57Z

@indigodarkwolf I know I already said this... but I in the real hardware, VERA is independent from the clock cycle of the 6502. Considering that the software needs to be in lock-step with VERA is creating an artificial constraint.

@mikey-b like many of the computers of that era, VERA is a memory mapped device, where a region of memory (32 bytes) corresponds to 32 registers in the FPGA. the code I cleaned up in memory.c made it possible for me to have a second VERA allocated to a second block of 32 io ports. there are roughly 2 families of ways in which the FPGA code can be implemented, and they are veeeery different. having no access to the hardware or the people working on it, I cannot comment on which of the 2 approaches has been chosen (just know what I would have done to 'go fast').

as for what it 'can or cannot do', the vga spec is ultimately the guide. Yes, to prepare the frame buffer VERA manages it own local ram, outside of the main cpu ram (all GPUs do). that ram is effectively split into 2: the vram and the frame buffer. as the frame buffer is actually small, it is possible that it is implemented using the FPGA's bock ram (otherwise it is also in the ram chip).

that ram contains different types of data: there are the layer, the sprites, the tiles, the active color palette... the frame buffer is what gets bit banged to the monitor at the pixel clock rate (value depends on vga mode). vga defined the 'dance' .. i.e. the time for sending data and the time for sending nothing so that old monitors could sync the ray guns. this was (and is) all analog signaling (.7v), hence the need for a dac (you can either use an existing IC or make your own, which is what I thing happened on the VERA board).

Performance of such a memory mapped GPU could be abysmal.. but VERA uses the auto-increment (with user selectable stride) to improve the situation.

so there is no SPI involved... even though in my own little proto I am using a PCF8574 to transfer the 32x1byte onto an SPI bus (a simple non-FPGA way of doing mem mapped IO ports). well .. technically there is likely some SPI: if VERA does not have all the memory on the FPGA, and depending on the selected sram chip, it is either a parallel or a serial sram chip. can't tell from the proto 1 video if the sram is the 2 SMDs or one of the 2xdip-8... dip-8 sram would mean SPI to the FPGA.

@indigodarkwolf actually, I was reading the vera doc again.... eventhough it would make a lot of sense to write the results from the composer to a FB, thechnically, there is nothing in the doc that says it does... but if you think about the aamount of operations, it makes sense to have a render phase separate from the bitbanging to the monitor “Rendering is started one line earlier, so at line 524. Composing the outputs of the renderers (2 layer renderers and 1 sprite renderer) is performed while outputting the pixel data” ... the question is .. was that describing the emu hit is line based) or the card. My interest stems from making a HW replica of vera as a pet project.

indigodarkwolf · 2021-03-19T23:47:05Z

@lmihalkovic I think the subtlety being missed is that just because the VERA runs on its own clock doesn't mean the VERA will spin for, say, 10ms without the CPU being allowed to have any influence on it. Or vice versa, it won't just go to sleep for 10ms while the CPU does stuff and then wake up. But these circumstances are real hazards of multithreading - you don't get to control when the thread executes, only the OS does. Any time you ask the thread to sleep for a while, you're putting your faith in the OS to wake it up in a timely manner, except there's no guarantee of that. Even if you spin-wait instead, there's no guarantee the CPU won't interrupt you for "however long it wants".

So you can't just spin the VERA off onto its own thread and call it done. You have to synchronize it with the CPU on a periodic basis, or the behavior will, in fact, be different from the real hardware. And that periodic basis will basically be "every single CPU instruction, with the VERA being told how many clock cycles elapsed and thus how many ticks it is allowed to process until it must wait for the CPU again". But at that point, you've got all the same overhead as our single-threaded approach now, only you've added the extra weight and complexity of threading.

You might get a performance win with that approach, anyways, but only if the VERA implementation is brought back to a state of calculating output on a per-pixel basis instead of per-line. But that's still a world in which the VERA is essentially in lock-step with the CPU.

mikey-b · 2021-03-20T13:47:14Z

Hi all I have opted to not get too deep into the details of going multithreaded yet. I am hopeful for any feedback from you both. Experiences, snags and performance numbers would be wonderful. I have however looked into a Pi (3 Model B), perf and stats on the current code base. Unittesting. And the up coming hardware changes. Unittesting Using dlang BetterC - I have attempted unittesting. The current design and the nature of being an emulator has made unittesting difficult and yeilded very little benefit. Looking at other emulators, acceptance tests look to be more suitable. These are tests written in 6502 that provide feedback out of the real hardware or emulator. E.g. drawing a checker board or gradient on the screening using the IRQ signals? Using sprite collision to generate a sinwave. This is currently beyond my knowledge. But I see no reason why we can't create a list of tests to get the ball rolling. Upcoming Changes & VERA As it stands, the Pi needs a 3x speed up to be full speed. The coz profiles suggest a possible 50% speed up, but this still falls short. I feel we are all in agreement that we need an architectural change. The Vera chip has some upcoming changes, and I feel the code for it is just too messy. And for performance, it has too many indirections. There is a lot of information that is calculated every cycle. e.g. it's extremely unlikely the mode will change so frequently. I feel the points above are in agreement by everyone? Stephen has already looked at caching etc. I don't believe we can effectively do this kind of change in C. We need some kind of meta programming. I've tried both Dlang and C++. It's possible to change the entire code base to one of these with only a few changes. We could also just move the VERA into a separate folder as an external dep, and keep the current code in C. So. I think we need to get some consensus. And to make clear on what people are willing to contribute to the main repo, forking, or none etc. Is everyone in agreement to - * Moving the VERA to a separate comp unit. Based on C++ * Keeping the rest of the code C based, but conforment to C++ * Isolating and cleaning up the Vera code. Ready for up coming changes and the possibility of multithreading. * Look to start planning acceptance tests. Kind regards Mike

…

On Fri, 19 Mar 2021, 23:47 Stephen Horn, ***@***.***> wrote: @lmihalkovic <https://github.com/lmihalkovic> I think the subtlety being missed is that just because the VERA runs on its own clock doesn't mean the VERA will spin for, say, 10ms without the CPU being allowing to have any influence on it. Or vice versa, it won't just go to sleep for 10ms while the CPU does stuff and then wake up. But these circumstances are real hazards of multithreading - you don't get to control when the thread executes, only the OS does. Any time you ask the thread to sleep for a while, you're putting your faith in the OS to wake it up in a timely manner, except there's no guarantee of that. Even if you spin-wait instead, there's no guarantee the CPU won't interrupt you for "however long it wants". So you can't just spin the VERA off onto its own thread and call it done. You have to synchronize it with the CPU on a periodic basis, or the behavior will, in fact, be different from the real hardware. And that periodic basis will basically be "every single CPU instruction, with the VERA being told how many clock cycles elapsed and thus how many ticks it is allowed to process until it must wait for the CPU again". But at that point, you've got all the same overhead as our single-threaded approach now, only you've added the extra weight and complexity of threading. You might get a performance win with that approach, anyways, but only if the VERA implementation is brought back to a state of calculating output on a per-pixel basis instead of per-line. But that's still a world in which the VERA is essentially in lock-step with the CPU. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW3TFGIK32JQ4CEAY3TTEPPARANCNFSM4OXLRQUQ> .

indigodarkwolf · 2021-03-20T18:47:57Z

Mike, I think you and I are on the same page here. I like that plan.

mikey-b · 2021-03-20T20:28:09Z

Hi all, Bit of a curve ball. In an attempt to find bottlenecks, I cut down the emulator to doing nothing but render. With no instructions run. It's basically a counting loop and doing a frame update. The emulator runs at 99-100%. Basically maxed out getting 60fps. I've looking into this and how other emulators are working. Research says SDL is very poor on the pi. And opengles is required to get hardware acceleration https://www.raspberrypi.org/forums/viewtopic.php?f=78&t=25146 Other emulators have also gone with assembly versions of the CPU. I'm thinking we need to consider these directions Kind regards Mike Brown

…

On Sat, 20 Mar 2021, 18:48 Stephen Horn, ***@***.***> wrote: Mike, I think you and I are on the same page here. I like that plan. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW2HVSZHOREIWUBEPXTTETUWXANCNFSM4OXLRQUQ> .

Elektron72 · 2021-03-20T21:09:41Z

I would like to know exactly what advantages moving the VERA implementation to C++ would provide. (I have never used C++, so the answer might be obvious.)

mikey-b · 2021-03-20T21:47:15Z

Hi Elektron72, Ofcourse. My interest is in the metaprogramming features, such as templates, constexpr etc. To move some of the logic to compile-time. It's technically possible to get the same result in C, but takes longer and is harder to maintain. A good example is the 6502 code. We need to combine address modes and instructions. Currently we have two indirections to combine these - so is done at runtime. We could use metaprogramming to combine those into a single indirection (the call). Vera has an output modes, bitdepths etc. With indirections and decisions everywhere to handle these. We could template these to specialise the code to those modes. And pull that as high as possible. Kind regards Mike Brown

…

On Sat, 20 Mar 2021, 21:09 Elektron72, ***@***.***> wrote: I would like to know exactly what advantages moving the VERA implementation to C++ would provide. (I have never used C++, so the answer might be obvious.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIWY3F27363EX2ITTDV3TEUFKFANCNFSM4OXLRQUQ> .

lmihalkovic · 2021-03-28T18:32:28Z

@mikey-b @Elektron72 ... same here ... don't get me wrong ... C++ is the 2nd language I ever learned (zortech c++ for the history buffs) ... after 6502 assembly .. so I do love C++. I am thinking about moving my version of the sim to C++, but for other reasons: the c definition of mapped hardware registers as plain structs has it own limitations which C++ overcomes (dan saks). ~~but the 'we need meta programming' argument flies high above my head at the moment~~ ... I think I understand what you mean ... constexpr could help in a couple places ... but I am reluctant to say 'sure .. I share your view' .. because I am still not sure you are being the gold pot you are sitting on and seem so eager to 'blast a whole into' ;-)

indigodarkwolf · 2021-03-29T04:32:26Z

Hot damn. I bloody well suspected there was gold to mine in that PS2 and VIA emulation code, tools notwithstanding. I'm seeing substantial wins from PR #336, I would appreciate if other eyeballs could verify it.

indigodarkwolf · 2021-03-29T08:30:35Z

A kind soul from the unofficial Discord is saying that their Pi4 is seeing a gain from 61% to 77%. Not full speed like my test, but they're running headless so are accessing it through VNC. Still a good win. I wonder what the differences are between their Pi4 environment and mine.

And their Pi3 is apparently running the official repo version at 16%, and my PR at 19%-21%. And probably still through VNC. :)

mikey-b · 2021-03-30T07:23:15Z

Hi Stephen, I haven't reviewed yet but planned on doing so over Easter weekend. Great work either way, some good wins. I continued with the dispemanx drivers last weekend. I added the ability to select driver on the sdl side. But non of the installed drivers in a standard raspbian helped. Moving to EGL to use this driver gave a ~40% boost. Considering how to implement this. A separate pi Vera? As this will most likely break x86 and WASM Kind regards Mike Brown

…

On Mon, 29 Mar 2021, 09:30 Stephen Horn, ***@***.***> wrote: A kind soul from the unofficial Discord is saying that their Pi4 is seeing a gain from 61% to 77%. Not full speed like my test, but they're running headless so are accessing it through VNC. Still a good win. I wonder what the differences are between their Pi4 environment and mine. And their Pi3 is apparently running the official repo version at 16%, and my PR at 19%-21%. And probably still through VNC. :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW3KSQLLOGHVMYYCRVTTGA3DVANCNFSM4OXLRQUQ> .

lmihalkovic · 2021-03-31T13:04:23Z

@indigodarkwolf yeah ... surely that code is 'gold' ... no worries, I never managed to get the Arduino team to see how they were about to trash their IDE (and they did it) ... so I don't expect I will be more successful this time around. :-)

mikey-b · 2021-03-31T17:08:57Z

Mihalkovic, This group has no interest in that kind of atmosphere. We have real engineering challenges, with only the goal of solving them. Stephen is in the top percentile of impactful contributors. And it looks like he's managed to give us another 10% frame rate improvement. And that's all we really care for. Good engineering delivering results. Regards Mike Brown

…

On Wed, 31 Mar 2021, 14:04 L. Mihalkovic, ***@***.***> wrote: @indigodarkwolf <https://github.com/indigodarkwolf> yeah ... surely that code is 'gold' ... no worries, I never managed to get the Arduino team to see how they were about to trash their IDE (and they did it) ... so I don't expect I will be more successful this time around. :-) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW5OGVH7L4AB6WP2C5DTGMMWRANCNFSM4OXLRQUQ> .

indigodarkwolf · 2021-03-31T17:37:57Z

@mikey-b Thanks. Although there's ambiguity in the intended target, I'm choosing to assume that @lmihalkovic's apparent sarcasm about golden code was directed towards the code powering Raspbian. If it was directed towards my own contributions, I would welcome constructive feedback or, failing that, specific criticisms of what could use improvement. And of course I would be happy to answer questions about my code.

Obviously my contributions aren't perfect: I've personally introduced some bugs into the Vera emulation, specifically, and of course I've opened some questions about how the Vera truly behaves versus optimizations that I've made to try and make the Vera performant.

But also, anyone can write code and submit a PR. Even if we can't see eye-to-eye on certain technical issues, I would still welcome more contributors working to improve the emulator, and it's ultimately up to Michael Steil to decide whose PRs are accepted.

lmihalkovic · 2021-03-31T18:24:21Z

@indigodarkwolf I apologize for the confusion I obviously created: there was, and still is, ZERO sarcasm in what I am saying... life is too short and precious for that. I am as serious as I was writing something similar to M Banzi. This codebase is very well written .. and I don't mind admitting that I did not even see its full potential when I proposed #325. it is only after I was done implementing it that my blindness downed on me. I explained in #331 that I was not going to explain everything in this forum, but that I have no reservations explaining things in a more private setting. my apologies again if my comment had the appearance of a sarcasm.

indigodarkwolf · 2021-03-31T18:43:41Z

@lmihalkovic All good mate, Let's make everything fast together. :D

lmihalkovic · 2021-03-31T18:55:01Z

@indigodarkwolf as I said elsewhere, given a choice my first goal would be to preserve that unique opportunity which I see for the x16 team within this code, and then only to make it faster... which may be at odds with your own. saying this, I am keenly aware that I have done nothing to build any credibility within this group, while the work you have done and shared speaks for you (again just plain honesty).

... maybe one day I should share my private Arduino IDE repo to show what the arduino team missed by refactoring their codebase as they did: for years they had said that certain features were not doable at a reasonable cost because there was no file manager in the code.... what they never saw was that there was one ... or more so, how easy it was to make one emerge from the existing code. once I was done doing that, many things became trivial to implement, which they gave up on doing (basically feature parity with Arduino Create ... I still wonder to this day if they didn't just want to keep a difference with Create in order to push new users towards the web). At the time I did not want to share the code for free, considering how many problems it was solving for their professional development team. In one single 'improvement' refactoring they completely closed the door one many of these long asked-for improvements.

mikey-b · 2021-04-02T09:55:18Z

Hi all, I understand, text is hard and can always be misunderstood. As long as we're all on the same page For general chat, could I recommend the discord server? I'm not there but I think Stephen is there regularly. Any help on improvements is always greatly appreciated. I have attempted several times to use the dispemanx drivers but without success. Reviewing Stephens later this weekend. Happy Easter everyone

…

On Wed, 31 Mar 2021, 19:55 L. Mihalkovic, ***@***.***> wrote: @indigodarkwolf <https://github.com/indigodarkwolf> as I said elsewhere, given a choice my first goal would be to preserve that unique opportunity which I see for the x16 team within this code, and then only to make it faster... which may be at odds with your own. saying this, I am keenly aware that I have done nothing to build any credibility within this group, while the work you have done and shared speaks for you (again just plain honesty). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAIW5IM4CKK2U54I3OCDDTGNVZLANCNFSM4OXLRQUQ> .

indigodarkwolf · 2021-04-02T15:19:16Z

Yes, I am regularly on the unofficial Discord server, my username there is "Indigo".
https://discord.gg/nS2PqEC

mikey-b added 15 commits May 14, 2020 15:43

Add files via upload

36b7218

Extra compiler flags for better output. Added max(), min() macros Heavy Video.c refactoring.

Update video.c

0b73316

Removed tile pallette code, re-added

Update video.c

901e612

Add files via upload

81fb91c

rebase to indigodarkwolf

Update video.c

fbd2cc4

Changes to render_line_bitmap and render_line_tile

Update video.c

08f2789

Remove lamdba in render_line_tile as Mac and WebAsm didn't weren't happy.

Update video.c

410196d

render_line_tile - Now reads props->tilew for "tile update skipping" step width.

Update main.c

ee69603

Update video.h

bec7ae8

Update video.c

0d80aac

use MHZ constant

Update video.c

c3b76de

Merge pull request #1 from commanderx16/master

c6e016d

refork

Make instructions branchless

c74a5ce

The branches can not be predicted very well, oldpc is also currently globally scoped requiring a memory read and write, making it local removes this. (Oldpc appears to not be used globally)

Update fake6502.c

817140f

oldpc is now a local stack variable where needed, removing globally scoped

Update 65c02.h

ec3d499

Make bra() instruction branchless

Performance Stats + Instructions #284

Are you sure you want to change the base?

Performance Stats + Instructions #284

Conversation

mikey-b commented Jul 11, 2020

indigodarkwolf commented Jul 22, 2020

indigodarkwolf commented Jul 23, 2020

indigodarkwolf commented Jul 26, 2020

indigodarkwolf commented Jul 26, 2020

mikey-b commented Jul 26, 2020

indigodarkwolf commented Jul 27, 2020

indigodarkwolf commented Jul 27, 2020

mist64 commented Aug 26, 2020

indigodarkwolf commented Aug 27, 2020

lmihalkovic commented Mar 6, 2021 • edited Loading

mikey-b commented Mar 6, 2021 via email

lmihalkovic commented Mar 6, 2021 • edited Loading

mikey-b commented Mar 6, 2021 via email

lmihalkovic commented Mar 6, 2021 • edited Loading

lmihalkovic commented Mar 6, 2021

indigodarkwolf commented Mar 6, 2021 • edited Loading

mikey-b commented Mar 6, 2021 via email

indigodarkwolf commented Mar 6, 2021

mikey-b commented Mar 7, 2021 via email

indigodarkwolf commented Mar 7, 2021 • edited Loading

lmihalkovic commented Mar 8, 2021 • edited Loading

mikey-b commented Mar 10, 2021 via email

indigodarkwolf commented Mar 10, 2021

mikey-b commented Mar 10, 2021 via email

indigodarkwolf commented Mar 11, 2021

indigodarkwolf commented Mar 11, 2021

lmihalkovic commented Mar 11, 2021 • edited Loading

indigodarkwolf commented Mar 19, 2021 • edited Loading

mikey-b commented Mar 20, 2021 via email

indigodarkwolf commented Mar 20, 2021

mikey-b commented Mar 20, 2021 via email

Elektron72 commented Mar 20, 2021

mikey-b commented Mar 20, 2021 via email

lmihalkovic commented Mar 28, 2021

indigodarkwolf commented Mar 29, 2021

indigodarkwolf commented Mar 29, 2021

mikey-b commented Mar 30, 2021 via email

lmihalkovic commented Mar 31, 2021

mikey-b commented Mar 31, 2021 via email

indigodarkwolf commented Mar 31, 2021

lmihalkovic commented Mar 31, 2021

indigodarkwolf commented Mar 31, 2021

lmihalkovic commented Mar 31, 2021 • edited Loading

mikey-b commented Apr 2, 2021 via email

indigodarkwolf commented Apr 2, 2021

lmihalkovic commented Mar 6, 2021 •

edited

Loading

lmihalkovic commented Mar 6, 2021 •

edited

Loading

lmihalkovic commented Mar 6, 2021 •

edited

Loading

indigodarkwolf commented Mar 6, 2021 •

edited

Loading

indigodarkwolf commented Mar 7, 2021 •

edited

Loading

lmihalkovic commented Mar 8, 2021 •

edited

Loading

lmihalkovic commented Mar 11, 2021 •

edited

Loading

indigodarkwolf commented Mar 19, 2021 •

edited

Loading

lmihalkovic commented Mar 31, 2021 •

edited

Loading