Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback from my practical testrun #1

Open
hpingel opened this issue Sep 1, 2023 · 7 comments
Open

Feedback from my practical testrun #1

hpingel opened this issue Sep 1, 2023 · 7 comments

Comments

@hpingel
Copy link

hpingel commented Sep 1, 2023

Dear @onnokort,

Thank you for mentioning my livestream attempts on your README! I am planning to give you some feedback on my testrun of semu-c64 but as the whole livestreaming action has taken so much time and energy I will only get around to do this in the next couple of days.

I wanted to do a table with timestamps for a start. But didn't have time for this yet.

By the way: If you don't like non-issues to be reported in "Issues" you could also open the "Discussions" tab in Github for more philosophical questions of people interested in this project. But this might only add to the complexity of maintaining this project.

Cheers,
@hpingel aka emulaThor

@onnokort
Copy link
Owner

onnokort commented Sep 1, 2023 via email

@hpingel
Copy link
Author

hpingel commented Sep 1, 2023

Just four ideas/remarks that are quickly written:

  • In Vice I couldn't use the Delete key to delete a typed character at the prompt. Have to retest with your latest quickstart approach..
  • C128 has a 2Mhz mode for the MOS 8502 CPU where the VIC-IIe is unusable but the VDC 80 Char screen can still be used for text output (VDC operations are always painfully slow but one could maybe have the option to turn log ouput off until root login. The assembler code has to be adopted for C128 memory layout. I guess if it sticks to using 64 KB it should be easier as doing trickery with the MMU on the 128 but I'm not an expert here.
  • The Segfaults/Crashes in my videos could just been due to me being too sloppy to fine-tune the timings of the RAD Expansion Unit to the machines used. But: How to fine-tune if one has to wait 14 minutes for the first loglines. The testing would take ages.
  • A 16 MB REU never existed as a consumer product in the 80s or 90s. REUs were up to 512 KB in the 80s and there were ways to get a 2 MB REU in the 90s. Potentially it would have been possible to produce a 16 MB REU in the 80s but it wouldn't have been economically feasible. Nobody would have been able to afford it and the space it would have needed would have been outrageaus. The existance of 16 MBs REU solutions are due to the enthusiast community "products" that came out as of 2010 IMHO (icomp TurboChameleon, gideon 1541 Ultimate, Ultimate64). There is just no plain 16MB REU in any collector's hands to test your project with. Modern rebuilds/emulations exist but are probably even rarer and harder to get as 512 KB REU from the 80s (Orange Cart by Zeldin, REUPlus2C by Jeff Burrell). IMHO the RAD Expansion Unit by Frenetic (and I am a fanboy but not a Influencer here) is the only solution that is easy to build, get and setup. Disclaimer: I am one of Frenetic's testers so I am always in favour of his projects.
  • EDIT: The CMD RAMLink is expandle up to 16 MB but I'm not very familiar with it.

@Chlorobyte-but-real
Copy link

  • A 16 MB REU never existed as a consumer product in the 80s or 90s. REUs were up to 512 KB in the 80s and there were ways to get a 2 MB REU in the 90s.

Could the system be stripped down to somehow boot on a mere 2 MB of RAM? If so, not only would we get closer to an 80s Linux experience (90s as opposed to 2010s), but a Commander X16 port might potentially be feasible (its 65C02 CPU goes up to 8 MHz).

@hpingel
Copy link
Author

hpingel commented Sep 1, 2023

@hpingel
Copy link
Author

hpingel commented Sep 1, 2023

Coming back to the topic of boot duration:
According to looking through my livestream data I have calculated the time needed from the start of the semu PRG on the C64 (PAL) to the complete appearance of the logline "buildroot login:". If i didnt't calculate wrongly it is

38 hours and 43 minutes.

Timestamps are:

  • August 29, 08:51pm CEST - Start booting PRG semu [1]
  • August 31, 11:34am CEST - Complete appearance of "buildroot login:" [2]

[1] In video at 66 minutes after recording start: https://www.youtube.com/watch?v=JBb90jnPGlY
[2] In video 4h10m after recording start: https://www.youtube.com/watch?v=TN9zf7wd3VI

@onnokort
Copy link
Owner

onnokort commented Sep 1, 2023

In Vice I couldn't use the Delete key to delete a typed character at the prompt. Have to retest with your latest quickstart approach..

Yes, the UART emulation is quite lacking. All I do (if you look in uart.c) is to translate upper- and lowercase for PETSCII-/ASCII-translation. It doesn't implement color escape sequences either, which is why ls shows strange characters when listing. Also, this translation is not really 100% compatible with kernalemu either (but that might also be something to fix in kernalemu). PRs welcome :D

I forgot that 2 Mhz mode on C128 isn't compatible with REU.

Ok, I see. I have no real experience with the C-128 (nor with a REU to be honest - I have ordered one (Ultimate 64, which is in backorder) but never owned one :) I developed it all testing it out on Vice, assuming that Vice will be very close to reality. Which apparently, it is :D

EDIT: The CMD RAMLink is expandle up to 16 MB but I'm not very familiar with it.

The code in this respect is quite modular. Just replace the emulator RAM access functions in reu.c, reu.h with whatever you like and it should work with banked memory or other implementations. Thinking about it, it is probably a good idea then to rename them to something more generic, like "emu_memory" or so.

The Segfaults/Crashes in my videos could just been due to me being too sloppy to fine-tune the timings of the RAD Expansion Unit to the machines used. But: How to fine-tune if one has to wait 14 minutes for the first loglines. The testing would take ages.

What would you need for easier tuning?
I think a good idea would be to implement some kind of "emulator status page" showing things like PC etc., triggered by the NMI, so that it doesn't slow down the emulation but would allow for regular status checks ... again, PRs welcome ...

Could the system be stripped down to somehow boot on a mere 2 MB of RAM?

Good question. I actually had to reduce the guest system to fit by tweaking the kernel configuration (and got the Linux kernel down from ~21MiB to ~5MiB), but I do not know whether there are further large optimizations possible on that front. My goal was to get it running at all. Yes, in earlier times, Linux/x86 could boot and run in 2MB (tomsrtbt etc.) but it might well be that you'd need to port an older version of Linux to RISCV for that to happen. In any case, the REU is the emulated RISCV-memory pretty much, meaning that 0x000000 to 0xffffff in REU space directly maps to the "physical" addresses 0x0000_0000 to 0x00ff_ffff in RISCV-32 space. I think it would be a great idea to keep a simple memory map, as this allows to cross-check with the PC implementation which allows to catch some bugs in the emulator before checking on 6502. Furthermore, a simple, direct memory map makes emulated RAM accesses simpler as no additional translation needs to happen, which will likely help performance.
A great deal of time is spent in the MMU code now, and you could run either MMU-less Linux (which I consider cheating) or implement a much better cache in virtual address space, which would maybe then also alleviate any potential performance penalty of more complex RISCV<-> REU address translation. It might make sense to modularize the code to be able to compile a MMU-less semu.

With the persistence patch I posted yesterday, the last page (4kiB) of the RISCV memory are taken to save CPU/peripheral state, so the memory map is actually not that simple anymore. But I think it especially makes sense to keep the low memory directly mapped to REU, as simply limiting the available memory to Linux should be enough to keep the guest from misbehaving and overwriting high memory (asssuming that Linux honors all memory settings). Linux simply boots from address 0x0, which is the beginning of the REU, which is then translated to 0xc000_0000 VM space as soon as the MMU is set up (by Linux itself, the emulator just does what it is supposed to).
The space behind the initcramfs could then be also used for other things as well, such as a JIT cache or similar.

That all said, there is nothing keeping anyone from adjusting the constants to get semu to run with a smaller, real REU. That should really be all it takes (well and fixing the persistence check at REU address 0xfff000 now). Then, you can load any image you might think should work, the emulated machine is actually very, very simple (if you are interested, I can upload the micropython port I did first, that should run as-is and easily in 2MiB). I initially ported semu because it all looked so simple, and as far as I understand now, the original is a project by a CS prof and his students to teach operating systems? In any case, a great fit for the C-64 and very understandable. But I would really like to keep everything so it compiles on C64 as well as PC. Upstream might benefit then as well and it makes testing a lot easier.

Sure, this project could be extended in myriads of ways, such as loading compressed REU data from floppy, adding drivers for floppy disks, floppys as cache, network devices, VIC graphics etc. pp.. the ideas are endless ... :D

But I think what makes most "sense" is to try to speed it up until it might be at remotely usable with a pre-booted linux state. With response time in hours or minutes for a simple shell command, I think it just stays a completely unusable curiosity. As I said, maaaaaaybe it is possible to get it so that you could edit a text file with the system reacting quickly enough?

The very best bet for that would probably to have a native llvm-mos compiled linux that does all far memory accesses through MMU-emulating trampoline code. Then it is not really a "100% MMU" anymore, meaning you could of course do anything with native 6502 code. But it would still catch things like NULL-ptr derefs in pure C code and still give you some sort of virtual memory.
I have added a comment to llvm-mos's issue tracker regarding this.
But I probably won't be the person doing that port (at least not all by myself), and I expect it to be a lot more involved than porting semu :D
Though I think that should be remotely and theoretically feasible.

As I wrote elsewhere, Linux RISCV32 uses about 95Megacycles to boot now. 95 Megacycles 6502 is about one and a half minutes. HOWEVER, that's just 8 bit and the 6502 is no RISC, coming in at ~3 cycles for each byte that has been processed. So if you just do adds, ands, ors etc. you need at least 15 or so instructions for 32 bit (moving in and out of ZP regs), meaning the 6502 needs at least ~50 cycles for each emulated RISCV instruction, even with perfect binary JIT translation or so (which is a pipedream ...).

However, currently, the 6502 sits at about 1500 cycles per RISCV INSN, which seems like massive improvements are still possible. But could you actually use even a 10x faster system? Would you? :D

Yes, maybe on 8MHz Commander X16 or so, maybe giving you 80x. But then, again, a native LLVM port would make most sense in the long term IMO.
I don't want to discourage anyone, though. This might be a great opportunity for a competition in the "Commodore demoscene". What could be possible in terms of RISCV emulation? Maybe you could also do things like precalculate / prebuild an 'acausal cache" from address hit patterns observed (see below) from running the emulator on PC or similar? If I am not mistaken, I noticed that only very few addresses get most of the accesses in the guest system. Maybe it would make a lot of sense to pull them into the ZP (either manually by precalculation or automatically).

As others have pointed out, there is all kinds of overhead due to the LLVM compilation still. LLVM still routes quite a few accesses through indirect jumps, which seems unnecessary. (It does do an overall great job, however, to avoid that, as far as I can tell with the link time optimizations..). For that, it would probably be a good idea to modularize the code so that the RISCV emulation could run on a single instance of ZP variables (probably using some ugly kludge of lots of #include or so) instead of running on the vm struct in the code. Or to implement it in assembly right away, which I am not too interested in doing personally. I actually like to keep it all modular and if there is assembly code involved, still in a way that it can be swapped with a C implementation.

According to looking through my livestream data I have calculated the time needed from the start of the semu PRG on the C64 (PAL) to the complete appearance of the logline "buildroot login:". If i didnt't calculate wrongly it is 38 hours and 43 minutes.

That is a bit faster than the boot times I calculated from a kernalemu/fake6502 run (~39.4h). So something might be off there, but maybe also on my side. I thought that my figures are actually optimistic, because kernalemu counts no cycles for the I/O routines nor does it emulate the IRQ overhead nor the REU access overhead.

Maybe I should emphasize it here, but you can actually compile the code to PC still by setting "C64=0" in the Makefile. You can then simply boot the absolutely identical images that go into the REU on the resulting PC-semu and they should behave just the same.
Boot time on a recent-ish Ryzen is about 3s, so you can see why that makes testing a lot easier :D

You can also test the 6502 code with the patched kernalemu, which boots in about 8.5min on a ryzen, so that also makes it easier than Vice with Alt-W.

But before going into the details of tweaking 6502 assembly or similar, I think it makes sense to take a high level look at general Linux memory access patterns etc. to guide that. I have instrumented the PC code with instruction and memory traces to get a high level picture where it might make sense to optimize the MMU or memory code (on PC as well). Basically, write a huge file with all virtual and physical memory accesses (fetch, load, store) and then run some hacky Python/numpy code on top to calculate hit/miss ratios for different caching schemes etc.

I have ideas with a modular semu on PC which have nothing to do with the C-64 at all, going into the direction of fuzzing, and maybe symbolic execution and similar. What kept me from exploring this direction with the other code that is out there that was the sheer complexity of the software that people developed to do such things on architectures like x86 etc. I really like to be able to understand and read myself and I like simple, non-complex code. I write that here so that you know where I am coming from.

@hpingel
Copy link
Author

hpingel commented Sep 3, 2023

Thank you for this detailed response that I can read 10 times over the next months and still gain new thoughts out of I guess. ;-)

I was reading your project README again and wanted to point out that although it is good to have the changelog at the top of the README I'm missing a project summary at the top of the README that includes the RISC and semu explanation. Basically you could move one or two paragraphs from "Further notes" to the top of
the README if you want to make me happy. I know you want PRs but this is all I can offer ATM. :-)

EDIT: Same applies for the llvm-mos reference in Thanks.
EDIT2: The "Thanks" section itself is the shortest summary of the project and should be at the top. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants