-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port PCRE2 JIT to Linux on IBMz (s390x) #89
Comments
@zherczeg Do you know of someone in the PCRE2 JIT community who would like to work on this issue? |
Perhaps @carenas might be interested. |
the development branch s390x shows (only in the cloud though, not qemu) some encouraging progress :
pcre2 itself fails (even without JIT) and will need work (pcre2test segfaults with anything that touches invalid UTF) functionality wise (ex: non including a commented out test that will segfault for the JIT part) :
and as zherczeg pointed out will likely need a significant amount of work to get close to that performance objective |
I can't access the branch created by carenas. Is that the same as the branch from linux-on-ibm-z? How much effort do you estimate to make PCRE2 functional? How much effort to optimize it for the platform? What amount of bounty would make it interesting, probably split into sub-milestones. |
my bad, fixed the link. it is mostly the linux-on-ibm-z code but with a few fixes on top to fight the bitrot so it will build on top of the current tree for further development.
it is too early to know but the original code didn't support an FPU and will need to have put_label support added before it can work with current PCRE2 (including bad UTF8), so the sooner we can get it in a shape good enough for merging (even if it doesn't perform well) the better IMHO to make sure it will not bitrot further. |
IBM can redirect the current bounty towards the basic enablement. We would appreciate guidance from the PCRE2 community on a plan and a sizing for the various steps to achieve basic enablement suitable for merger and then further optimization to achieve optimal performance, |
@edelsohn: is instruction cache coherency an issue with Z?, the original code went into great extents to try to avoid clearing the instruction cache when updating code is required (ex: when jump address or constants were updated, and now when put_labels are updated), and eventhough it is using a somehow independent pool to keep those values, it is still in the same memory segment than the rest of the code and therefore likely to be considered for caching with the instructions AFAIK. in positive news side while it doesn't yet fully work (because of put_labels) at least doesn't segfault. |
Thanks for the progress. IBMz has a fairly strong memory consistency model, like x86. Are you experiencing unusual behavior or just checking? I'm not aware of other self-modifying IBMz code explicitly flushing / clearing caches. |
just checking, and hoping to have a (likely incomplete) version that could pass all tests by end of the week. |
building and running all tests successfully in : https://travis-ci.com/github/carenas/sljit/builds/182588796 as well as when applied to a TRUNK version of PCRE as shown by (running in the IBM cloud) :
note that it is not enabled by default and it is mainly available to coordinate further development, as it is missing functionality which would trigger abort(), has no FPU (neither SIMD) support, and has not been optimized or even profiled so it might be even slower than the interpreter. |
Awesome progress, @carenas ! Let us know when you and the community have analyzed the situation and have sizings for the next steps. |
@zherczeg: would the following make sense for sizing and next steps?
note that the original fork was maintained for almost 2 years but it might be still 60% done and there is also the need to learn the architecture (which by being proprietary makes things slightly more difficult), but I am optimistic it could be done thanks to the cloud availability and a good starting base, which is why I'd been focused on last week to make sure all previous investment is not wasted. |
I would like to avoid what happened with the TileGX port. It has never been completed, no maintainers and thus it would be the best to remove it. As for a new port, I would prefer to have a cross compiler, qemu, gdb tools which I can use to build and test the new port and also some cpu / abi documentation since I need to learn it at some level. I am curious what "proprietary" exactly means here. No public documentation available? No free tools available? |
Yes, s390x is proprietary, compared with, say, RISC-V, but not compared with Intel x86_64. Proprietary doesn't mean secret. GCC, LLVM, PyPy (including NumPyPy), node.js, OpenJDK, and Mono all are ported to s390x. There is plenty of documentation. You can request access to a system (VM/VPS) through the LinuxONE Community cloud that I mentioned in the first message. The LinuxONE Community Cloud also hosts Travis-CI instances, as Carlo used. IBM s390x experts are available to answer questions. 40 hours seems a reasonable estimate. We can continue to discuss how to set up incremental milestones and associated bounties. |
both gcc and clang can crosscompile code for s390x (and indeed I have both setup in the CI in 2 versions of Ubuntu), an I was originally using the standard gcc crosscompiler package from Ubuntu 20.04 to build the code and develop it further, so unlike TileGX this is fairly more open (with CI / cloud VMs available and several linux distributions to use in real hardware) qemu-user targets s390x and does work, but it fails tests either because the instructions we are using are not correctly emulated in the default CPU it uses, or because the original code targets z12 and the implementation differs enough. running with an emulated z12 or z13 might help but using a real VM is easier. which is why I mention CI and was hoping in the third step to broaden cpu support, which will also help real users with older hardware, even if initially the target will be z13 (which includes vector support). in that context my "proprietary" label, meant we will need to rely for now in the linux one cloud for development, but unlike TileGX it is likely to get better later. |
I have checked the documentations. At first glance the system design matches well to sljit, it uses two-complement number format, the ABI is similar to PowerPC, and it has condition codes, wide multiply, and IEEE single / double precision floating point. It also has an incredible number of instruction forms, even more than ARM-T2, probably because it is an old architecture. I have one more question, do we need to support EBCDIC? Because PCRE2 has many ASCII / UTF related optimizations which will not work with it. |
Linux on IBMz is ASCII and we only are asking to port PCRE2 JIT for Linux, not to z/OS in EBCDIC. z14 added some additional SIMD instructions, so you might want to target that instead of z13, if it's beneficial. |
is z14 what is being used by the Linux ONE cloud provided vm? the travis containers are running older than usual Ubuntu versions as well so I was suspecting they might be constrained by the CPU version there as well. note the main concern here is on being able to maintain the code base moving forward, which is why QEMU was ideal (since it can run locally on each developer workstation, and we have a lot of different OS to support there, including one folk with z/OS that is likely to benefit from this port if done in a compatible way). Of course CI and remote access to a VM in native hardware is a good enough substitute but it is less scalable, hence why I was hoping will be only needed through the original bootstrap. |
I believe that the LinuxONE systems (hosted at Marist College) now are z15. Mainframe processors are not available in laptops, sorry. I understand the appeal, but I would recommend relying on the remote access over QEMU. I am not certain what OS levels are available through Travis CI on s390x. |
I think this issue should not be closed |
How is the enablement work proceeding? |
slower than planned, mainly to my poor planning; but in the right track after the first phase was completed (with a lot more changes and still a few more reservations than were originally expected). |
I have started to do some fixes but I need to learn a lot more before I can do them in a way I want and my time on voluntary work is quite limited. |
@carenas Any updates about this project? |
phase 1 is included in the RC1 for PCRE 10.36, which means that with the right setup we could now get more people working in parallel to cleanup and complete the implementation to be ready for end users. after merging phase 1 it was clear that my original plan was going to incur on too much tech debt and so it is reasonable to expect that phase 2 (getting vector operation support) and phase 3 (cleaning up old tech debt) will benefit to have wider distribution and therefore more hands than what I was originally expecting to have (only me) this obviously doesn't qualify for the bounty terms (which in all fairness I have to admit, I was uncomfortable with, after pulling so much volunteer work reviewing phase 1), but I am still committed to get this out (hopefully with a little more help, and even if that means I will have to "subcontract") that work. apologies for the delays, but I am hoping at the end the PCRE release that will be enabled for user support with JIT in s390x will be then of better quality. it is important to note that most of the work I'd been doing (in PCRE, not in sljit) was to add for phase 2 support for FPU in the same way it is done for the other architectures, but my concern is that it might be too hacky and therefore increase tech debt unnecessarily even if it is possible (was hoping to get mostly vector instructions in, while leaving everything else untouched as was planned originally for phase 1), but I am concerned that with adding a third implementation it might also make sense to refactor the other 2 moving code from PCRE into sljit as well, which will be obviously a bigger undertaking that planned. |
@carenas Let's figure out how to solve this. There are multiple options:
I'm a little concerned with the architecture redesign because IBM would like to have some PCRE JIT available on IBMz sooner rather than later. We can collaborate on a redesign as a second phase. |
I never saw any Jitsi invitation. I can join via Bluejeans. |
Jitsi doesn't do invitations AFAIK, but I was going to create the meeting and share the link at that time; sent you an invite to the email associated with your github so you can manage your calendar, anyway |
Maybe I can be there a bit longer. Jitsy seems like a good solution. |
And share the link in Github comments? This really isn't the right medium for an interactive conversation to schedule a meeting, nor to share links to a meeting. |
I have tried to do some minor improvements in the code, so I tried to log in into the virtual machine I created. However I got an "An unknown error has occurred,Please try again later." After a few days I still got the same error, so I decided to delete the vm and create a new one (the vm quota is 1). However, when I try to create or upload an ssh key for the new vm I got the same error. Since this is the only service where we can test the code, we probably need to wait until they fix it. |
There is no known, system-wide problem with the VMs at Marist. Have you reported the problem through the support system? |
Is there an easy way to insert a breakpoint instruction on s390? I tried |
I'm not a s390x expert. The GDB breakpoint instructions seem to be bytes 0x0,0x1 . I don't know to what instruction that corresponds. Have you tried examining a s390x breakpoint in GDB to see what instruction it inserts? |
It looks like gdb cannot decode it as an instruction with I found another strange thing. The
I can type |
I have another question about vector registers. It seems the CPU has 32 vector registers: the first 16 is mapped to the fpu registers and the other 16 can be accessed with the RXB field of the instruction. However the ABI does not mention anything about the vector registers: https://refspecs.linuxbase.org/ELF/zSeries/lzsabi0_zSeries.html#AEN413 The libc memchr routine uses vector registers above 16 without saving. Can I assume that registers < 16 has the same saving rules as the fpu registers and registers >= 16 are all volatile? |
0x0,0x1 seems to be the byte sequence that GDB uses for a breakpoint, but I didn't suggest that it was a normal instruction. I suggested that you could examine a breakpoint set by GDB (which should be the same 0x0,0x1), not that GDB would provide a useful disassembly of the sequence. I don't know about the x/i $pc error. Is it possible that the JIT code is not setting up the registers completely? The JIT code doesn't have to follow the normal calling convention and maybe it is avoiding some steps for speed, but would complicate debugging. I'm not an expert on IBMz and don't know about the vector register ABI. I will try to find someone from IBM to join this issue and answer the questions. |
This is a bug. It looks like a known problem in GDB that may occur on s390x when the code being stepped has no binary associated with it.
All vector registers are volatile, except for the bits that overlap with the nonvolatile FPRs. |
@zherczeg: to workaround not being able to use gdb to disassemble the generated code I integrated sljit with capstone and added a feature to dump the assembler that you could use from my capstone branch branch (not yet ready to be merged though and a little rusty but stable enough to be useful IMHO) |
Really? From reading the comments in this issue, I only understand that GDB doesn't disassemble the breakpoint instruction "0x00 0x01". Everything else should be disassembled as usual, right? And the breakpoint instruction shouldn't normally occur in any code. GDB does insert it when the user says "break ", but then GDB hides the breakpoint instruction in the disassembly and shows whatever was there before. So this should only be an issue if the breakpoint instruction was inserted in some other way, and I wonder why that would be necessary. |
sorry, for oversimplifying the problem, but you are correct, the problem is not in disassembing the code, but in disassembling the code and being able to run into it which my code changes also don't address (as I found more useful to avoid doing breakpoints in the generated code anyway) my proposed changes add two options to the "runTest" that come with sljit to allow for dumping the generated code assembler (like gdb would do) and to jump into it to validate it does the right thing after a test was created for it, it is in the test where I add logic to validate that the validated code does what it is expected so then I have no need for either gdb or breakpoints. |
Carlo originally was looking for some instruction to manually set a breakpoint. When his attempts failed, I suggested the BREAK instruction sequence (0x0,0x1) inserted by GDB. I think some of the confusion stems from ambiguity about what Carlo really wants to accomplish. What does he want the instruction to do? He wants an instruction that generates some sort of a "trap" operation that non-destructively suspends the program? He wants an instruction that is visible to a debugger (as opposed to the magic 0x0,0x1 sequence that GDB knows to ignore)? I believe that the solutions are operating correctly but not what Carlo expected and we need more clarification about what Carlo really is trying to accomplish. |
this was work Zoltan was doing though, but yes we both had the same need which I will try to explain below and that is not even specific to s390x but also why we were earlier mentioning QEMU support as something that would be nice to have (since with an emulator you have full control on how the CPU state changes from instruction to instruction) sljit is creating machine code dynamically, and we make the generator based on our undestanding of the documentation for that instruction. more often than not (at least for me) I get the opcodes wrong, or misunderstand the side effects of the instructions and end up running something that could trap badly (if lucky) at that point the only way to know what went wrong is to isolate the code generated and step through it, looking at the CPU and memory effects and for that adding a breakpoint to an specific memory address (when loaded into the PC) helps going to the interesting part of the opcodes faster. the alternative I was using (and proposed in my dev branch) sidesteps the issue by allowing me to programatically tell the sljit program to print the generated assembler (which capstone allows, and will helpful spot invalid opcodes or even invalid sequences, that way without having to crash), and once I feel confident enough, execute the function instead and make sure its effects are what was expected (based on the documentation) which is what the test suite does for every single instruction on every supported cpu. In this last step, I might add some asserts and iterate with some throwaway varations of the test case, to confirm all documented side effects are really observed (to guard against any documentation bug) and to make sure the generator fails safely if pushed hard against the specific instruction constrains (ex: when some possible combinations result in invalid instructions)
would let Zoltan explain here, but from his comments I suspect he is trying to add the missing FPU instructions that are needed to generate vector operations. |
The 0x0, 0x1 works nicely form my side. If you run the code without gdb, it stops with illegal instruction, but it does not matter for me. In gdb, it stops after 0x0, 0x1, and I can disassembly the rest of the code (from $pc). And the vm is more responsive than it was last year, so I can do some meaningful development this week. |
PCRE2 has 3 simd accelerated functions, and I have landed code for the simplest one: The good news is it seems s390x simd is quite good for text processing. The bad news is that I had to add some TODOs because the s390x port emulates CPU status flags. I don't know why since the CPU has hardware support for flags. I also noticed that many instructions are generated for simple computations which needs to be improved as well. |
Right, I wondered about that, too, and asked the original author Mike Monday about this. He answered:
I don't understand that explanation, though, because z/Architecture instructions usually do indicate a zero result in the condition code (CC), if applicable. Maybe SLJIT has a special way of dealing with these flags that makes it difficult to map them to the CC, but I'd be very surprised if that couldn't be handled somehow. At the time when I asked Mike about this, he was already off the project, and he considered the port to be in a prototype state, so we didn't follow up on this issue. |
I identified the following problems:
Overall it seems to me that SUBTRACT instructions cannot be used to set status flags in general, since COMPARE instruction produces different flags. ADD / SUBTRACT LOGICAL can be used if the zero result needs to be checked, but they might set two different values. Setting some combinations (e.g. overflow + zero) might need some emulation, but they probably never needed in practice. As for the operation, a COMPARE is likely needed after an operation when non-zero status flags are requested. It cannot be done before the operation, since the operation may overwrite flags. Hence, if the result register is the same as one of the source registers, its original value needs to be preserved for the comparison. As for shifting, an Whatever flag model we choose, we need to check / update manually every instruction which set conditional codes one-by-one. There are a lot of them, so this looks like a lot of work. |
Right. However, in C semantics a signed integer overflow would be an undefined condition. I don't know how SLJIT expects those to be handled. Also, if necessary, or until a better way is found, you can always add a "load and test" instruction such as LTR or LTGR to compare a signed integer against zero. (Source and target registers can be the same here.)
Well, COMPARE indicates the same flags as SUBTRACT, except when SUBTRACT would overflow. Basically, COMPARE sets the flags according to the full arithmetic result. Similarly, in C semantics two signed integers can always be compared, but not always be subtracted.
All the "LOGICAL" instructions operate on unsigned integers, so a "LOGICAL" result can never be smaller than zero.
Not in general, but there are cases where certain instructions can be used in place of others to prevent the condition code from being set. An example is "rotate then and/or/xor/insert selected bits", where there are variants like RISBGN.
Not sure. It really depends on how SLJIT expects signed integer overflows to be handled. |
Let me explain how status flags works in sljit. Sljit only supports two status flag bits, one is "result is zero", the other depends on the instruction and customizable. For example it can be a signed greater flag, an unsigned less than equal or carry for a subtraction operation. Setting a status flag must be explicitly requested, otherwise the flag is undefined. For example, if carry flag is requested for an addition, and the next jump depends on the zero or signed less flag, the behavior is undefined (in debug mode this triggers an assertion in the compiler). So a valid sljit code can only use the carry flag after the previously mentioned addition. Unless it is explicitly stated that an instruction does not modify flags (e.g. mov, call, jump instructions), the flags are undefined. Hence it is valid to do some mov operations before the carry flag is used, but not another addition. Sljit does not have an "undefined on overflow" behavior: if zero flag is requested to be set, it must be set. Obviously emulation might be needed in some cases, but currently 2+ instruction are used when zero flag is used, and that is a lot. The ideal is 0 instructions in the commonly used cases. I may misunderstand something, but COMPARE LOGICAL sets 0 for equal, 1 for less, and 2 for greater, and SUBTRACT LOGICAL sets 1 for less, 2 for equal, and 3 for greater. Apart from less, these do not match. We can use the latter all the time, and discard the result when compare is needed, but compare might have instruction forms which are useful. |
sljit assumes that negative numbers are represented in Two’s Complement format and therefore mostly ignores overflows (even the ones that could be considered as undefined per C) in practical terms this is not a problem because all CPUs supported use that format, and most modern CPUs that might be added will also support that format, but it is for sure something to consider if we ever have to generate code for a CPU that has 1 bit complement or something else. |
Thanks, that explanation helps.
Well, the only case where the result of a signed addition can be zero and an overflow occurs is when calculating
Your understanding is correct. COMPARE LOGICAL sets the CC as usual, while SUBTRACT LOGICAL is designed to prepare the borrow for a multi-precision integer subtraction. Perhaps we could add another s390-specific field to the |
Zoltan reported the following progress: I had some free time last week and I have added SIMD support for s390x. The code is landed and the next step could be measuring the improvement, but virtual machines are not suitable for this. The SIMD support is currently enabled by default, although we might need to support compile or runtime check for availability. I have noticed that the code generator needs improvement. For example, adding 16 to a register (r3 = r4 + 16) is translated to four instructions: lghi %r1,16 This is less efficient compared to other ports which can do this with one machine instruction. Probably the code generator is less complex than other ports, which tries to exploit the strengths of their corresponding instruction set. I suspect improving this is not a trivial task (maybe we need to completely redesign it), but it should be since this is a serious disadvantage for the s390x port. @aarnez Can you help measure the performance when you have a moment? And maybe you have a suggestion for Zoltan's code generation observation. |
On Tue, Jan 12 2021, David Edelsohn wrote:
Zoltan reported the following progress:
I had some free time last week and I have added SIMD support for
s390x. The code is landed and the next step could be measuring the
improvement, but virtual machines are not suitable for this. The SIMD
support is currently enabled by default, although we might need to support
compile or runtime check for availability.
If anyone can give me instructions on what/how to test, I could take a
shot at taking measurements.
I have noticed that the code generator needs improvement. For example,
adding 16 to a register (r3 = r4 + 16) is translated to four
instructions:
lghi %r1,16
lgr %r0,%r3
algr %r0,%r1
lgr %r4,%r0
(Zoltan, you probably meant `r4 = r3 + 16`, since that's what these
instructions really do.) Yes, this looks bad. These four instructions
could be replaced by
lgr %r4,%r3
algfi %r4,16
Or if the condition code is not needed, "load address" could be used
instead:
la %r4,16(0,%r3)
It seems that the current code generator follows a rigid scheme that
happens to yield four instructions in this case.
|
#106 should be the next step in this work. The patch focuses only flags, the optimal instruction selection is still far away. I cannot measure the perf on a virtual machine. |
Enable PCRE2 JIT on Linux on IBMz (s390x-linux) and optimize to achieve equivalent speedup over non-JIT code as x86_64. Goal is full functionality and passing testsuite with JIT enabled.
An unfinished port of PCRE2 JIT to s390x exists in the Linux-on-IBM-z Github account and can be used as a starting point. IBM will contribute the code as necessary.
https://github.com/linux-on-ibm-z/pcre2/tree/s390x-experimental
A $5000 bounty from IBM available posted on Bountysource
https://www.bountysource.com/issues/92353837-port-pcre2-jit-to-linux-on-ibmz-s390x
Systems access is available through the LinuxONE Community Cloud at Marist
https://linuxone.cloud.marist.edu/#/register?flag=VM
This issue is tied to a feature request to the Exim community for PCRE2 support.
https://bugs.exim.org/show_bug.cgi?id=2635
The text was updated successfully, but these errors were encountered: