Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Stats + Instructions #284

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Performance Stats + Instructions #284

wants to merge 15 commits into from

Conversation

mikey-b
Copy link

@mikey-b mikey-b commented Jul 11, 2020

I have run a profiler over the whole codebase which aims to predict possible performance gain areas.

image

Each chart says, If you make this section of code x% faster, the whole program will speed up by y%. Unfortunitely all recommendations are very flat suggesting that optimisation has come to a head with the current architecture.

What I did find interesting is all the instruction.h suggestions. It has highlighted a branch prediction issue. The chart say's if we can speed up these functions by 10%, the program will also speed up by 10%. I don't believe this is possible - But we can gain 1-2%.

oldpc is currently global, which could be aliased, making this local removes the unnessessary memory store - Is oldpc safe to remove?

We can also go branchless - Example of optimised code: https://godbolt.org/z/q6Ma4e

mikey-b added 15 commits May 14, 2020 15:43
Extra compiler flags for better output.

Added max(), min() macros

Heavy Video.c refactoring.
Removed tile pallette code, re-added
rebase to indigodarkwolf
Changes to render_line_bitmap and render_line_tile
Remove lamdba in render_line_tile as Mac and WebAsm didn't weren't happy.
render_line_tile - Now reads props->tilew for "tile update skipping" step width.
use MHZ constant
The branches can not be predicted very well, oldpc is also currently globally scoped requiring a memory read and write, making it local removes this. (Oldpc appears to not be used globally)
oldpc is now a local stack variable where needed, removing globally scoped
Make bra() instruction branchless
@indigodarkwolf
Copy link
Collaborator

Personally, the avenue I'm looking into next for optimization is to move VERA work into smaller batches and closer to the point where 65C02 instructions would cause changes, leaning heavily into added buffers and prerendering assets so that we don't have to expand or calculate a variety of things, we can just blit.

Some of this is very straightforward - every time we make a change to any of the layer properties, we refresh all of that layer's properties in that 65C02 instruction. Obviously we could set a dirty flag instead and only update the layer's properties the one time once we need to draw a line. This especially makes sense in the common case of someone changing a bunch of layer properties at once during VBLANK.

Some of this seems like it has obvious wins - if we keep a buffer of uncompressed color indices for a layer's tiles, we don't have to expand it on-the-fly for each line, and it's fairly trivial to keep this data in-sync and it should require no extra work. This seems like pure savings after the initial expansion of VRAM into the a tile image backbuffer (and that initial expensive step would be done only in response to refreshing dirty layer properties).

If the cost of that refresh proves excessive, such as in cases where someone is using line IRQs to swap around layer properties a bunch (though this seems like it would be rare) we could optimize it by keeping a pool of previous settings, paying a little bit of cost on 65C02 instructions that write to VRAM in order to speed up swaps between multiple layer settings by keeping old settings cached that we can choose to use. Then we just need to find a decent hash for the layer properties, to try and ensure we don't needlessly clobber the cache, and how hard could it be to find a decent scattering of 3 bytes' worth of data?

The part that's harder to anticipate whether it'll be a win or a loss is prerendering whole layers onto backbuffers, so that line draws only have to deal with scaling and translation. This is something I'd look into last, but potentially gives us the biggest wins on the hot path. It's somewhere I'd be more comfortable working in DirectX rather than SDL2, but hey, opportunity to learn. Though actually, to accomplish this with maximum speed in a cross-platform fashion, it looks like OpenGL would really be the way to go, so that a tilemap could be described as a series of UV coordinates onto a mesh, and we could render it to a backbuffer with a single draw call. Still, SDL2 provides what it calls an accelerated blit-blitting function that can perform horizontal and vertical flips (SDL_RenderCopyEx), so that may prove to be a win.

The way that prerendering a whole layer could easily spiral out of control, though, is if someone is updating the tile data after the layer has been rendered. Perhaps a better approach would be to use SDL2 calls to render individual rows of tiles into a backbuffer, so we're paying for the draw once every 8 or 16 unscaled lines, then can trivially blit individual pixels. Again, unless or until the layer's properties or VRAM data is changed to invalidate the backbuffer.

Does this make sense, or am I sounding like a madman?

@indigodarkwolf
Copy link
Collaborator

In related news, my RPi4 arrived recently, the emulator runs at... 36%. Very "oof", much room for improvement. But now I have a platform that isn't a $3500 gaming PC, that I can use for testing performance tweaks. And there seems to be no shortage of interest in running the emulator on RPis...

@indigodarkwolf
Copy link
Collaborator

Well, that's avenue is not a bust, but even once you get the rendering down to just "copy lines from backbuffers and then replace with final palette colors", an RPi4 only goes from 36% to 40%, and there's a substantial "warm-up" time, naturally, from building up the cached data. Still, it's about half of the way to -warp mode's 45% speed (which, in retrospect, I should have tried sooner, but... eh, it was fun to try, anyways).

One thing I really like from my efforts, though, was that the refactor to drawing backbuffers creates trivial opportunities to insert multithreading, and I did some experimenting with OMP. I don't have measured numbers because I knew I wouldn't be comfortable committing any multithreaded code at this point, but OMP substantially reduced the cost of warming up the video cache, not quite to the point where the warm-up cost at boot time was unnoticeable, but on my desktop it got me to running at top speed within a cursor blink on the default console, and the demos and games I threw at it didn't even appear to flinch.

@indigodarkwolf
Copy link
Collaborator

If you're interested, my work is in this direction is committed to https://github.com/indigodarkwolf/x16-emulator/tree/vera-fast-v1.

@mikey-b
Copy link
Author

mikey-b commented Jul 26, 2020

Hi Steven,

This direction
Good effort, but im not measuring much benefit on my x86 machine, but no worse either.

Not overly bothered about the startup time, Its not that noticable. But does these changes enable multithreading in the hot paths at all?

Thoughts on direction
bottlenecks still appear to be render_layer_line_* functions. And within those is normally operations on the props->, e.g.

3.4% uint16_t x_add = (xx * props->bits_per_pixel) >> 3;
5.4% uint8_t col_index = (s >> (props->first_color_pos - ((xx & props->color_fields_max) << props->color_depth))) & props->color_mask;

I don't know enough about the x16 vera stuff, When can these props values change? They surely can't change during a render_layer_line_* call so I think, while we would have to duplicate code, we should attempt to break these out and hoist these tests and calculations higher up the call tree?

@indigodarkwolf
Copy link
Collaborator

Looks like you focused on the performance of bitmap layers.

Bitmap rendering was almost untouched by my changes, and bitmap layers aren't being prerendered onto a backbuffer like text and tile ones. So I'm not surprised that their performance was not substantially changed. I knew tile and text would be more involved (in particular, if someone replaces a single tile/character on the map then it should only redraw that individual tile/character and not the entire backbuffer), so I focused on those.

Timing of property changes
You're right that those props can't change while rendering a line, or at least that's my understanding of how the hardware behaves. With my branch, a change to those particular properties would invalidate the backbuffer, causing the current buffer and its settings to be cached when we are about to draw the next line, and we'd either retrieve a valid cache entry or expire an old one and redraw the layer from a fresh backbuffer.

The other thing to consider, as well, is that this work is shifting some of the work out of the previous hot path and into the CPU emulation. For instance, when someone executes an instruction to write a byte into VRAM, if that byte touches an active or cached tilemap, we go ahead and immediately update the appropriate backbuffers on those tilemaps (assuming those backbuffers have not been invalidated for other reasons).

Multithreading potential
It would be pretty straightforward to prerender bitmap layers to a backbuffer at this point, though, so I can add that.

When there's a backbuffer to rely on, drawing the backbuffer itself is trivial to multithread, and I'm sure that work on the hot path will be just about as easily threaded.

The main difference is that the blit on the hot path looks like this:

const uint32_t hscale = reg_composer[1];
uint32_t       xaccum = props->hscroll << 7;
for (uint16_t x = 0; x < hsize; ++x) {
	const uint16_t eff_x = xaccum >> 7;
	layer_line[layer][x] = prerender_line[eff_x & max_buffer_x];
	xaccum += hscale;
}

While each backbuffer render looks like this:

for (int y = 0; y < buffer_height; ++y) {
	prerender_layer_line_text(layer, y, props->prerendered_layer_color_index_buffer + (buffer_width * y));
}

It's trivial, then, to get large benefits from inserting an OMP #pragma to the latter:

#pragma omp parallel for
for (int y = 0; y < buffer_height; ++y) {
	prerender_layer_line_text(layer, y, props->prerendered_layer_color_index_buffer + (buffer_width * y));
}

I'm just new enough to OMP to have not sussed out the correct way to thread the trivial copy case, with its incrementing xaccum parameter, or whether OMP is smart enough to figure out for itself the appropriate blocks of work so threads are not being dispatched for a single trivial a[x] = b[y].

Now that I'm thinking about it, if we needed to speed up the execution of keeping backbuffers up-to-date from CPU writes to VRAM, that would be pretty easily parallelized, as that's a fairly trivial for-loop as well:

for (int i = 0; i < num_layer_properties_allocd; ++i) {
	if (is_layer_map_addr(i, address)) {
		poke_layer_map(i, address, value);
	}

	if (is_layer_tile_addr(i, address)) {
		poke_layer_tile(i, address, value);
	}
}

Other thoughts
Also, I have not substantially changed sprites, either, and they showed up in my own profiling as a pain point.

My concern was initially centered around animations requiring a much larger cache of sprite settings (which would require a proper hash table or map structure to store them in, unlike the layers where there were few enough I could get away with a linked list). Sprites don't look like tile layers, in the sense of having a base address and asset size, followed by a reference to an asset index. Sprites just have a base address, so a 6-frame animation is, immediately, 6 cache entries. And if it can be h-flipped or v-flipped, that work either goes back into the hot path, or we potentially multiply the number of cache entries to account for flips.

It's something I'll continue to think about.

@indigodarkwolf
Copy link
Collaborator

I still need to think about sprites, but I've gone ahead and sped up bitmaps in the same way that I did tile and text layers. The render_line for them is now essentially a blit with horizontal scaling.

I also realized that the compositing step before final color selection could be sped up by only bothering with the arrays that were needed - a single switch with 7 valid options that are all one-liners seems like a no-brainer optimization over copying potentially 4x the data into local stack variables before juggling one into the final array that matters. A little less helpful in the worst case when sprites and both layers enabled, but still better to replace branching three times per pixel with a single branch per line.

I also wonder if SDL2 could speed up the final step where we replace color indices with actual palette colors. Something to experiment with.

And then there's PS/2
Another thing that's starting to show up in my profiling, depending on my test cases, is the PS/2 emulation. If I run a demo program that's hands-off, the PS/2 emulation basically isn't present in my results (good thing), but if I run, say, Chase Vault and play a few levels, PS/2 emulation is almost as heavy as the VERA emulation (seems like a target for optimization).

Something else for me to think about this week.

@mist64
Copy link
Collaborator

mist64 commented Aug 26, 2020

What is the status of this?

@indigodarkwolf
Copy link
Collaborator

So far, using an RPi4, the runtime difference between r38 and my current branch is 37% for r38 versus 44% for my branch when idling after a boot. Not bad.

But if I have it loop on 10PRINT"HELLO COMMANDERX16!":GOTO10, r38 maintains 37% while my branch drops to 31%. Not ideal, and unless/until this is addressed, I'm not sure I want to submit a PR.

Chase Vault runs at 23% for r38, 33% for my branch.

So, various pros and cons.

Update of what I've been doing and thinking about, specifically.

The majority of the changes in my branch are to maintain backbuffers for the layers and sprites. By pre-drawing all the layer data onto a backbuffer, the actual write is a trivial for-loop. Also, the way that the backbuffer draw occurs is trivial to multithread, as previously discussed. That said, I don't have the multithreading checked in anywhere yet, as I got distracted by chasing other butterflies.

One of the thoughts I hit on was that writes to VRAM are "relatively rare", so in order to make layer and sprite backbuffer operations as fast as possible, I've changed all VRAM writes to essentially perform 4 writes: one each assuming the data is 8bpp, 4bpp, 2bpp, and 1bpp, to parallel VRAM arrays. That way, we don't have to expand windows of the data later for various layer and tile sets. It makes the act of changing layer settings somewhat quicker, and means we don't have to otherwise update or refresh the expanded data behind a layer when a write occurs. I think this was a worthwhile trade-off.

And that's about where I've stalled out in making the VERA faster, with rapidly updating tiles on a layer being the new expensive problem. Potentially a problem for OMP to sort out, if we officially want to go that route.

In my own profiling, PS/2 is the next biggest single hotspot, but I'm not sure it can be easily addressed. I made a small tweak to the ringbuffer implementation in the PS/2 code, since that was a mild hotspot with the for-loop approach of checking whether the ring buffer had room for new data. That optimization is included in the runtime figures up top. Beyond that, I'm seeing time lost to branching, and I just don't know if there's anything to be done about that: I took a stab at refactoring the ps2_port_t struct in the hopes of replacing some branching with bit ops, but it's only a small win: it gets me up to 46% on my RPi4 while idling.

So then the next place to go digging, according to my profiling, is the 65C02 implementation itself. One easy-ish change to reduce needless branching is to pull the bcd logic out of adc and sbc, and instead create bespoke operations which are substituted into the instruction table whenever we execute sed, and then replace with the non-bcd flavors on cld. But the branching issues in the 65c02 code run a lot deeper than that: each potential status flag change is a branch based on an appropriate conditional statement, and each time the processor reads or writes a value is a branch to check whether it should take the value from the accumulator instead. But I'm inclined to believe Michael Brown's analyzer when it says that we'd perhaps only get a 10% benefit from substantially reworking all that anyways, especially since any hit to reading or writing an actual memory address is an unavoidable set of branches to deal with I/O, banked RAM, and ROM.

And that's where I'm at right now.

@lmihalkovic
Copy link

lmihalkovic commented Mar 6, 2021

@indigodarkwolf If I may .... before slicing and dicing though the video.c implementation, I would first try to mimic more closely the separation of hardware that does exist on the real machine.

In real life, the main proc is not bothered at all by any of the work done by the video... it writes to some registers and things happens.. that's all ... it is neither blocked nor slowed down ... no interactions whatsoever ... the emulator should attempt to replicate that model, by all possible means!!!

in the end... maybe SDL should be running the main loop, and emulator loop should be running in its own background thread ... because that is how x16 works. it becomes very clear that it works this way when you carve out the SDL code from the VERA code in video.c (I have separate video.c and video_vera.c with a clean complete software vera separate from everything else).

I also cleaned up the memory related code and RAM is no longer patcheable by whoever and their dog... having done that made it very clear what the real contact points are between the 6502's running ROM code and the rest of the world.

@mikey-b
Copy link
Author

mikey-b commented Mar 6, 2021 via email

@lmihalkovic
Copy link

lmihalkovic commented Mar 6, 2021

@mikey-b .. well ... my cleaned up code is a lot closer to making it possible .. and it only took a couple days to a complete newb to do the clean up. the code works just fine as it is, but the real final test will be running 2 video cards.... will try that next

yes, one tick is very small ... and it should have NOTHING to do whatsoever with the video behavior :) .... my point exactly... proof that they are conceptually separate? .. IRQs .. vera can trigger IRQ on some special conditions (the v/h syncs, the sprite collisions) to completely disrupt the normal flow of the 6502, meaning that they are indeed not sync at all by nature.

...... ... hmmm ... WASM ... ... it does have access to the worker threads.. no? (I will have to check again .. made a poc at work for a cad thing targeting wasm over a year ago)

@mikey-b
Copy link
Author

mikey-b commented Mar 6, 2021 via email

@lmihalkovic
Copy link

lmihalkovic commented Mar 6, 2021

@mikey-b the 2 cards is not directly related .. I will just use that step as a validation that I did isolate things properly. if I did... then a simple basic app can directly poke to the second card with zero impact on the main one (the little BALLS example is a good one to start from)

with the contact surface so greatly reduced, I can then try to run all the rendering for card2 inside a separate thread (the debugger will have to be cut off for the moment). the rendering code can do what it wants. hmmm .... I just realized that I have to look at the rom code it see what it does (if anything) with the v/h syncs...

@lmihalkovic
Copy link

Personally, the avenue I'm looking into next for optimization is to move VERA work into smaller batches and closer to the point where 65C02 instructions would cause changes, leaning heavily into added buffers and prerendering assets so that we don't have to expand or calculate a variety of things, we can just blit.

Some of this is very straightforward - every time we make a change to any of the layer properties, we refresh all of that layer's properties in that 65C02 instruction. Obviously we could set a dirty flag instead and only update the layer's properties the one time once we need to draw a line. This especially makes sense in the common case of someone changing a bunch of layer properties at once during VBLANK.

Some of this seems like it has obvious wins - if we keep a buffer of uncompressed color indices for a layer's tiles, we don't have to expand it on-the-fly for each line, and it's fairly trivial to keep this data in-sync and it should require no extra work. This seems like pure savings after the initial expansion of VRAM into the a tile image backbuffer (and that initial expensive step would be done only in response to refreshing dirty layer properties).

If the cost of that refresh proves excessive, such as in cases where someone is using line IRQs to swap around layer properties a bunch (though this seems like it would be rare) we could optimize it by keeping a pool of previous settings, paying a little bit of cost on 65C02 instructions that write to VRAM in order to speed up swaps between multiple layer settings by keeping old settings cached that we can choose to use. Then we just need to find a decent hash for the layer properties, to try and ensure we don't needlessly clobber the cache, and how hard could it be to find a decent scattering of 3 bytes' worth of data?

The part that's harder to anticipate whether it'll be a win or a loss is prerendering whole layers onto backbuffers, so that line draws only have to deal with scaling and translation. This is something I'd look into last, but potentially gives us the biggest wins on the hot path. It's somewhere I'd be more comfortable working in DirectX rather than SDL2, but hey, opportunity to learn. Though actually, to accomplish this with maximum speed in a cross-platform fashion, it looks like OpenGL would really be the way to go, so that a tilemap could be described as a series of UV coordinates onto a mesh, and we could render it to a backbuffer with a single draw call. Still, SDL2 provides what it calls an accelerated blit-blitting function that can perform horizontal and vertical flips (SDL_RenderCopyEx), so that may prove to be a win.

The way that prerendering a whole layer could easily spiral out of control, though, is if someone is updating the tile data after the layer has been rendered. Perhaps a better approach would be to use SDL2 calls to render individual rows of tiles into a backbuffer, so we're paying for the draw once every 8 or 16 unscaled lines, then can trivially blit individual pixels. Again, unless or until the layer's properties or VRAM data is changed to invalidate the backbuffer.

Does this make sense, or am I sounding like a madman?

I would stay away from this for now ... .. I am not sure if I am the madman myself, but I see video.c as a CRAZZZZY GEM of an opportunity as it exists today. I know that I make no sense at all if I do not explain what I mean .. but better than explaining, I want to show it.. and that will take me a little bit more time to get there. and in the meantime, I am hoping that nobody destroys this opportunity.

note: I had the same kind of discussion a few years ago with the Arduino team ... trying to make them see that there was sooo much potential in the IDE source code, which they did not seem to be able to see.

@indigodarkwolf
Copy link
Collaborator

indigodarkwolf commented Mar 6, 2021

Seems I missed a fair amount of conversation. Please forgive me the length and breadth of a reply as I catch up.

@imihalkovic It may may a certain amount of sense to run the SDL loop on a separate thread from the rest of the machine, but by that I mean divorcing SDL from the VERA behavior as well, so that SDL is snapshotting completed framebuffers and drawing that.

However, I suspect that will buy us very little in the way of performance. The emulator simply does not perform a lot of work through the SDL API - it is essentially just grabbing a framebuffer generated by the VERA, and blitting it to the window. Easy-peasy, this is not our bottleneck.

De-synchronizing the VERA from the CPU by placing them on separate threads will be a non-starter for the emulator for the simple reason that the official emulator concerns itself deeply with cycle-accurate behavior, from the CPU ops to the PS2 interface to the VERA, and everything in-between. I suspect we've already broken that in minor way with other VERA optimizations, but we can't really know unless or until we have official hardware. Allowing each to spin freely, regardless of the other, will break cycle-accuracy in a major fashion.

And perhaps, instead of spending so much time programming for the emulator, I should be building test applications and asking Frank to run some comparisons between the emulator the real hardware, so we can improve or restore the accuracy of the emulator.

Or by all means, fork the emulator. In truth, I'm working on my own fork of the emulator that makes a number of rather deep changes, in particular swapping to C++17 so that I can make use of the STL, instead of hand-rolling a bunch of data structures and algorithms I wish to have in trying to tackle further VERA optimizations.

But if you want my opinion of how best to speed up the VERA, the answer is very simple:

Multithread the for-loops inside each render_layer_line_* function. These iterative steps are trivial to perform in arbitrary order, and are where the bulk of the VERA spends its time.

Multithread the for-loop below "// Calculate color without border." within static void render_line(uint16_t y). Again, trivial to perform in arbitrary order, and the VERA spends substantial time here.

Multithread the final framebuffer color assignment and NTSC overscan for loops in static void render_line(uint16_t y). Again, it is obviously trivial to perform each iteration in arbitrary order.

The only reason I haven't already submitted PRs with those, many months ago, is because I was uncomfortable with adding a dependency on openmp without being able to build the project on Mac platforms.

@mikey-b
Copy link
Author

mikey-b commented Mar 6, 2021 via email

@indigodarkwolf
Copy link
Collaborator

It wasn't order-of-magnitude differences in performance, but yes, I did see improvements from multithreading the various render_layer_line() functions. I believe I also took the liberty of spinning the calls to those functions, themselves, into separate threads, so everything was run in parallel.

Note that my test platform was a Raspberry Pi 4, because I feel that PC-based performance is likely "good enough", and RPi4s were the trending subject on the Facebook group at the time. So optimizations against PC performance are not interesting to me.

Also note that even the CPU emulation is slow on the RPi4, so the VERA is not the only corner of the emulator that needs help if the emulator is to run at full speed on one of those. And that may be asking too much.

The changes I'd experimented with also rolled back the LAYER_PIXELS_PER_ITERATION arrays that... well, I'm going to be honest, I have no idea how they would help, in any context. They don't meaningfully improve cachability, the compiler won't automatically cast them to 32-bits to try and batch operations together with larger type sizes, the arrays aren't vector types and the loops aren't using vector ops... I could go on, but at this point I don't even think I would believe any amount of theory-crafting and I simply need someone to whap me on the nose with data from an instrumented build to be convinced. And then there's the calls to memset when initializing these arrays, and I can't immediately recall whether the compiler is optimizing those away, but if it isn't then there's a complete waste of time for arrays of only 4 elements.

@mikey-b
Copy link
Author

mikey-b commented Mar 7, 2021 via email

@indigodarkwolf
Copy link
Collaborator

indigodarkwolf commented Mar 7, 2021

I broadly agree with all of these.

Additionally, with the current VERA behavior, we could probably exec6502 in increments of 239 cycles and then pump individual instructions until the VERA signals a new line has started. If there are per-pixel behaviors lerking in the VERA hardware, though, we don't know about them yet and in a worst case it means the VERA would have to render smaller segments to the framebuffer based on accesses to certain I/O registers.

There would also be open questions about other subsystems like PS/2, which also depend on being pumped every CPU instruction. As part of the work I've done in my own emulator, I've actually largely rewritten the PS/2, just to understand it, and in the process ended up making it substantially faster by making it do fewer checks based on I/O state, and making those checks much faster. My implementation could be back-ported to C, itself a win, but it could also probably be modified to update only when the CPU reads or writes to the appropriate VIA ports, or when the KB/M devices post updates into it. This would represent another win in performance. (And I look at it as an example of how my fork, though it's making some brazen changes, can still create opportunities for me to contribute back to the original emulator once Michael Steil is available again).

I haven't looked at audio subsystems yet, to determine to what extent they may benefit from a similar treatment of only updating state when the CPU accesses their I/O, in the hopes of enabling the CPU to tight-loop as much as possible. And I haven't looked at SPI yet, either, for SD Card access.

@lmihalkovic
Copy link

lmihalkovic commented Mar 8, 2021

my turn to play catch-up ..

@indigodarkwolf divorcing VERA from SDL it is what I did. I carved out all the VERA code into vera_video.c and added a public API (not yet good) in places (wink wink .. debugger) where some of the rest of the emulator was getting too close to that code. as you said, video.c is now remarkably simple, and update just does get a snapshot of the framebuffer which it bliss out. when I saw that the code was so cleanly written I just couldn't believe the doors that it did open .... because they are so completely disjointed (in HW too), then it does make all the more sense to me to run it completely OUTSIDE the simulator mainloop (years ago I was working on military sims, and one of the lessons I remember from working with sims of 1M people in a city during a catastrophic event was that the more independent things are, the more degrees of freedom the system is keeping).
@indigodarkwolf we could probably exec6502 in increments of 239 cycles ... I would not .. it should be coming from the card .. the same way interrupts signal the events in the hardware... IMHO.

this SIM is a beautifully written piece of clean code... there are still features missing before going into aggressive optimization phase, which invariably results in damaging the layer boundaries...

for e.g. there is no easy way now for someone to simulate their own extension cards... which will be a key feature of x16... so before cutting corners, the APIs should be finished and the missing abstractions created/completed. To do that, some abstractions are clearly missing... and knowing what they should look like should be a higher priority than to optimize the ones that are currently in the system, because they may have to change in light of what is currently missing.

when all the abstractions are there, then and only then, it will be time to slash through them to try to aggressively speed up what can be. you know what Knuth said, right!?

@mikey-b
Copy link
Author

mikey-b commented Mar 10, 2021 via email

@indigodarkwolf
Copy link
Collaborator

Well, I can at least answer that the VERA has separate RAM from the main RAM, it is not shared.

As for batch-executing the 65C02, I'd just point out that if we have a signal from the VERA module that a new scanline has been produced, then we have at minimum 239 cycles we can execute before the current implementation will care about any changes, within or without the VERA. This is just taking 8,000,000 cycles per second, divided by 60 frames per second, divided by 525 scanlines per frame, minus 14 cycles to account for "slop" from a previous instruction taking us as far as 7 cycles into the scanline, and some "final" instruction taking up to 7 cycles. This is actually probably conservative by 7 cycles, but even so we get at minimum 34 instructions before needing to go back to stepping the CPU by a single instruction and then processing the VERA.

But you're right, it would be ideal to finish working on the core features before doing anything drastic that might blur the boundaries between systems.

For a target, our known platforms are Windows, Mac, Linux, and WebAssembly. I'm not familiar with building WebAssembly, and can't test a Mac build. I don't know how to instrument and profile a WebAssembly build either; haven't even looked into it. The RPi4 runs on Linux and happens to be a platform that has some optimization peculiarities which I believe it has in common with WebAssembly, and since it runs at around 40% speed at present it seems like a platform where it'll be easy to observe concrete gains, even without instrumentation.

@mikey-b
Copy link
Author

mikey-b commented Mar 10, 2021 via email

@indigodarkwolf
Copy link
Collaborator

The VERA's I/O is mapped to a variety of memory addresses as registers, using parallel communication. See also, https://github.com/commanderx16/x16-docs/blob/master/VERA%20Programmer's%20Reference.md

Communication is processed immediately.

We're not sure whether the VERA latches certain values when starting to draw each scanline, or whether that was a misunderstanding of something Frank once said, which may instead have been referring to a behavior where it draws lines to a backbuffer. The original VERA implementation processed between each CPU instruction and drew individual pixels as the scanline would have swept across the display.

@indigodarkwolf
Copy link
Collaborator

If we go with the assumption that the VERA is drawing per-pixel and latches nothing, then we do at least know that the VERA draws to a back buffer, so for instance if you're counting lines from VSYNC, then it draws line 33 (the first visible line) during line 32 (the last line of the back porch), then swaps buffers so that the result of line 33 is presented while line 34 is drawn to the back buffer.

This means we'd have to rollback the optimization that draws entire lines at a time, and this would also make the VERA much less likely to see wins from multithreading, unless we did one of those "blurring the boundaries"-type optimizations where the VERA would only process based on I/O, and every so many CPU cycles.

The biggest performance hit from per-pixel processing is the sprite logic. That loops through 128 sprites, and order of processing matters because of how work units are calculated, and doing that task 18 million times per second is... unappealing. But there were obviously other wins to be had from drawing whole lines, due to improved cache behavior. It would be sad to roll that back, as it would seriously hurt the emulator's performance.

@lmihalkovic
Copy link

lmihalkovic commented Mar 11, 2021

@indigodarkwolf I know I already said this... but I in the real hardware, VERA is independent from the clock cycle of the 6502. Considering that the software needs to be in lock-step with VERA is creating an artificial constraint.

@mikey-b like many of the computers of that era, VERA is a memory mapped device, where a region of memory (32 bytes) corresponds to 32 registers in the FPGA. the code I cleaned up in memory.c made it possible for me to have a second VERA allocated to a second block of 32 io ports. there are roughly 2 families of ways in which the FPGA code can be implemented, and they are veeeery different. having no access to the hardware or the people working on it, I cannot comment on which of the 2 approaches has been chosen (just know what I would have done to 'go fast').

as for what it 'can or cannot do', the vga spec is ultimately the guide. Yes, to prepare the frame buffer VERA manages it own local ram, outside of the main cpu ram (all GPUs do). that ram is effectively split into 2: the vram and the frame buffer. as the frame buffer is actually small, it is possible that it is implemented using the FPGA's bock ram (otherwise it is also in the ram chip).

that ram contains different types of data: there are the layer, the sprites, the tiles, the active color palette... the frame buffer is what gets bit banged to the monitor at the pixel clock rate (value depends on vga mode). vga defined the 'dance' .. i.e. the time for sending data and the time for sending nothing so that old monitors could sync the ray guns. this was (and is) all analog signaling (.7v), hence the need for a dac (you can either use an existing IC or make your own, which is what I thing happened on the VERA board).

Performance of such a memory mapped GPU could be abysmal.. but VERA uses the auto-increment (with user selectable stride) to improve the situation.

so there is no SPI involved... even though in my own little proto I am using a PCF8574 to transfer the 32x1byte onto an SPI bus (a simple non-FPGA way of doing mem mapped IO ports). well .. technically there is likely some SPI: if VERA does not have all the memory on the FPGA, and depending on the selected sram chip, it is either a parallel or a serial sram chip. can't tell from the proto 1 video if the sram is the 2 SMDs or one of the 2xdip-8... dip-8 sram would mean SPI to the FPGA.

@indigodarkwolf actually, I was reading the vera doc again.... eventhough it would make a lot of sense to write the results from the composer to a FB, thechnically, there is nothing in the doc that says it does... but if you think about the aamount of operations, it makes sense to have a render phase separate from the bitbanging to the monitor “Rendering is started one line earlier, so at line 524. Composing the outputs of the renderers (2 layer renderers and 1 sprite renderer) is performed while outputting the pixel data” ... the question is .. was that describing the emu hit is line based) or the card. My interest stems from making a HW replica of vera as a pet project.

@indigodarkwolf
Copy link
Collaborator

indigodarkwolf commented Mar 19, 2021

@lmihalkovic I think the subtlety being missed is that just because the VERA runs on its own clock doesn't mean the VERA will spin for, say, 10ms without the CPU being allowed to have any influence on it. Or vice versa, it won't just go to sleep for 10ms while the CPU does stuff and then wake up. But these circumstances are real hazards of multithreading - you don't get to control when the thread executes, only the OS does. Any time you ask the thread to sleep for a while, you're putting your faith in the OS to wake it up in a timely manner, except there's no guarantee of that. Even if you spin-wait instead, there's no guarantee the CPU won't interrupt you for "however long it wants".

So you can't just spin the VERA off onto its own thread and call it done. You have to synchronize it with the CPU on a periodic basis, or the behavior will, in fact, be different from the real hardware. And that periodic basis will basically be "every single CPU instruction, with the VERA being told how many clock cycles elapsed and thus how many ticks it is allowed to process until it must wait for the CPU again". But at that point, you've got all the same overhead as our single-threaded approach now, only you've added the extra weight and complexity of threading.

You might get a performance win with that approach, anyways, but only if the VERA implementation is brought back to a state of calculating output on a per-pixel basis instead of per-line. But that's still a world in which the VERA is essentially in lock-step with the CPU.

@mikey-b
Copy link
Author

mikey-b commented Mar 20, 2021 via email

@indigodarkwolf
Copy link
Collaborator

Mike, I think you and I are on the same page here. I like that plan.

@mikey-b
Copy link
Author

mikey-b commented Mar 20, 2021 via email

@Elektron72
Copy link
Contributor

I would like to know exactly what advantages moving the VERA implementation to C++ would provide. (I have never used C++, so the answer might be obvious.)

@mikey-b
Copy link
Author

mikey-b commented Mar 20, 2021 via email

@lmihalkovic
Copy link

@mikey-b @Elektron72 ... same here ... don't get me wrong ... C++ is the 2nd language I ever learned (zortech c++ for the history buffs) ... after 6502 assembly .. so I do love C++. I am thinking about moving my version of the sim to C++, but for other reasons: the c definition of mapped hardware registers as plain structs has it own limitations which C++ overcomes (dan saks). but the 'we need meta programming' argument flies high above my head at the moment ... I think I understand what you mean ... constexpr could help in a couple places ... but I am reluctant to say 'sure .. I share your view' .. because I am still not sure you are being the gold pot you are sitting on and seem so eager to 'blast a whole into' ;-)

@indigodarkwolf
Copy link
Collaborator

Hot damn. I bloody well suspected there was gold to mine in that PS2 and VIA emulation code, tools notwithstanding. I'm seeing substantial wins from PR #336, I would appreciate if other eyeballs could verify it.

@indigodarkwolf
Copy link
Collaborator

A kind soul from the unofficial Discord is saying that their Pi4 is seeing a gain from 61% to 77%. Not full speed like my test, but they're running headless so are accessing it through VNC. Still a good win. I wonder what the differences are between their Pi4 environment and mine.

And their Pi3 is apparently running the official repo version at 16%, and my PR at 19%-21%. And probably still through VNC. :)

@mikey-b
Copy link
Author

mikey-b commented Mar 30, 2021 via email

@lmihalkovic
Copy link

@indigodarkwolf yeah ... surely that code is 'gold' ... no worries, I never managed to get the Arduino team to see how they were about to trash their IDE (and they did it) ... so I don't expect I will be more successful this time around. :-)

@mikey-b
Copy link
Author

mikey-b commented Mar 31, 2021 via email

@indigodarkwolf
Copy link
Collaborator

@mikey-b Thanks. Although there's ambiguity in the intended target, I'm choosing to assume that @lmihalkovic's apparent sarcasm about golden code was directed towards the code powering Raspbian. If it was directed towards my own contributions, I would welcome constructive feedback or, failing that, specific criticisms of what could use improvement. And of course I would be happy to answer questions about my code.

Obviously my contributions aren't perfect: I've personally introduced some bugs into the Vera emulation, specifically, and of course I've opened some questions about how the Vera truly behaves versus optimizations that I've made to try and make the Vera performant.

But also, anyone can write code and submit a PR. Even if we can't see eye-to-eye on certain technical issues, I would still welcome more contributors working to improve the emulator, and it's ultimately up to Michael Steil to decide whose PRs are accepted.

@lmihalkovic
Copy link

@indigodarkwolf I apologize for the confusion I obviously created: there was, and still is, ZERO sarcasm in what I am saying... life is too short and precious for that. I am as serious as I was writing something similar to M Banzi. This codebase is very well written .. and I don't mind admitting that I did not even see its full potential when I proposed #325. it is only after I was done implementing it that my blindness downed on me. I explained in #331 that I was not going to explain everything in this forum, but that I have no reservations explaining things in a more private setting. my apologies again if my comment had the appearance of a sarcasm.

@indigodarkwolf
Copy link
Collaborator

@lmihalkovic All good mate, Let's make everything fast together. :D

@lmihalkovic
Copy link

lmihalkovic commented Mar 31, 2021

@indigodarkwolf as I said elsewhere, given a choice my first goal would be to preserve that unique opportunity which I see for the x16 team within this code, and then only to make it faster... which may be at odds with your own. saying this, I am keenly aware that I have done nothing to build any credibility within this group, while the work you have done and shared speaks for you (again just plain honesty).

... maybe one day I should share my private Arduino IDE repo to show what the arduino team missed by refactoring their codebase as they did: for years they had said that certain features were not doable at a reasonable cost because there was no file manager in the code.... what they never saw was that there was one ... or more so, how easy it was to make one emerge from the existing code. once I was done doing that, many things became trivial to implement, which they gave up on doing (basically feature parity with Arduino Create ... I still wonder to this day if they didn't just want to keep a difference with Create in order to push new users towards the web). At the time I did not want to share the code for free, considering how many problems it was solving for their professional development team. In one single 'improvement' refactoring they completely closed the door one many of these long asked-for improvements.

@mikey-b
Copy link
Author

mikey-b commented Apr 2, 2021 via email

@indigodarkwolf
Copy link
Collaborator

Yes, I am regularly on the unofficial Discord server, my username there is "Indigo".
https://discord.gg/nS2PqEC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants