Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crashes with invalid utf-8. #59

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kurnevsky
Copy link

According to the discussion in emacs bug#74922 it's possible that emacs passes invalid strings to dynamic libraries.

Fixes #58

@ubolonton
Copy link
Owner

ubolonton commented Dec 20, 2024

Thanks for the investigation and the fix. Great finding!

According to the discussion in emacs bug#74922 it's possible that emacs passes invalid strings to dynamic libraries.

That's inconsistent with the answer here, though it's consistent with my observation in #28 (comment).

Either way, Emacs's documentation here is too fuzzy, leaving this an undefined behavior. First, we'll need to figure out what Emacs's intended behavior is. Then, we can decide on the suitable approach. Similar to how we dealt with the Emacs's GC bug in #2.

@kurnevsky
Copy link
Author

@Eli-Zaretskii could you please check the answer above?

@Eli-Zaretskii
Copy link

@Eli-Zaretskii could you please check the answer above?

Sorry, I don't understand what you are asking me to check. Which "answer above", and what to check in it?

@kurnevsky
Copy link
Author

Can we decide that the answer here is not correct and it's expected that copy_string_contents can return invalid utf-8? And therefore we need to validate it in the external library.

@Eli-Zaretskii
Copy link

Can we decide that the answer here is not correct and it's expected that copy_string_contents can return invalid utf-8? And therefore we need to validate it in the external library.

I don't know. I didn't yet have time to step through the code with a debugger and see what happens there and why. And FWIW, it doesn't help that the original bug report doesn't provide any reproduction recipe except by using a module written in Rust, which I cannot even compile on my development system.

All I can say at this point is that the module-support code in emacs-module.c seems to behave according to the documented interface of the function from coding.c it is using, so this ought to behave as you expect. But the Emacs encoding machinery is very complex, and its design doesn't inherently guarantee that any raw bytes will be flagged; instead, it makes every effort to produce those raw bytes back as they were input, thereby leaving the handling of this to the application. It could very well be that what you see is one consequence of this basic design. I also doubt that this particular behavior was extensively tested, since the assumption was that modules deal with text, not with binary junk.

If you are in a hurry to get the answers, I suggest that you step with GDB through the code and tell us what you see, including why Emacs doesn't signal an error in this particular case.

But in general, I don't recommend modules to rely on behavior that is only documented in comments to the source code. If you need valid UTF-8 strings, I suggest to verify that in the first place.

Having said all that, I have this issue on my todo, and will get to investigating it eventually, as my time permits.

@Eli-Zaretskii
Copy link

I think I know what happened: your module called copy_string_contents with a unibyte string. In that case, copy_string_contents will return you the original string without doing anything. The code in copy_string_contents that signals an error relies on the fact that encoding the input string yields nil if the input includes non-Unicode characters. But that cannot be established with unibyte strings, because a unibyte string doesn't hold characters, it holds raw bytes.

What you should do is make sure the string passed to copy_string_contents is a multibyte string. If I do that, i.e.

    (switch-to-buffer "foo")
    (set-buffer-multibyte t)
    (insert-file-contents "/path/to/wg-private-pc.age")
    (setq str1 (buffer-string))

and then call copy_string_contents with the resulting string str1, I get the result you expected.

You need to realize that copy_string_contents is a variant of text-encoding routines: it encodes the input multibyte string in
UTF-8. The encoding routines in Emacs always return unibyte strings without doing anything, because a unibyte string is already encoded, or at least is supposed to be encoded.

And before you ask: no, copy_string_contents cannot by itself signal an error if passed a unibyte string, because a unibyte string can legitimately be a valid UTF-8 string. So in this case, copy_string_contents relies on the caller to make sure the input is valid UTF-8.

@Eli-Zaretskii
Copy link

I think I know what happened: your module called copy_string_contents with a unibyte string. In that case, copy_string_contents will return you the original string without doing anything. The code in copy_string_contents that signals an error relies on the fact that encoding the input string yields nil if the input includes non-Unicode characters. But that cannot be established with unibyte strings, because a unibyte string doesn't hold characters, it holds raw bytes.

What you should do is make sure the string passed to copy_string_contents is a multibyte string. If I do that, i.e.

    (switch-to-buffer "foo")
    (set-buffer-multibyte t)
    (insert-file-contents "/path/to/wg-private-pc.age")
    (setq str1 (buffer-string))

and then call copy_string_contents with the resulting string str1, I get the result you expected.

You need to realize that copy_string_contents is a variant of text-encoding routines: it encodes the input multibyte string in UTF-8. The encoding routines in Emacs always return unibyte strings without doing anything, because a unibyte string is already encoded, or at least is supposed to be encoded.

And before you ask: no, copy_string_contents cannot by itself signal an error if passed a unibyte string, because a unibyte string can legitimately be a valid UTF-8 string. So in this case, copy_string_contents relies on the caller to make sure the input is valid UTF-8.

Ping! Would someone please confirm or disprove my guess above?

@kurnevsky
Copy link
Author

kurnevsky commented Dec 29, 2024

Hi! I just checked - passed str1 from your snippet to the dynamic library and still got an invalid utf-8.

And FWIW, it doesn't help that the original bug report doesn't provide any reproduction recipe except by using a module written in Rust, which I cannot even compile on my development system.

You can reproduce it with any module, not necessarily written in rust. Just call copy_string_contents and get bytes that it returns. Then you can validate if it's a utf-8 string in the rust playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=776fd7b5eac3e9556ca4993a1b616a06

@kurnevsky
Copy link
Author

kurnevsky commented Dec 29, 2024

Ah, wait, I didn't check it properly - I was still calling (encode-coding-string str1 'utf-8). So do you say that if I don't call it explicitly then it should work fine?

@Eli-Zaretskii
Copy link

Ah, wait, I didn't check it properly - I was still calling (encode-coding-string str1 'utf-8). So do you say that if I don't call it explicitly then it should work fine?

You need to pass a multibyte string, not a unibyte string. That is the important part. encode-coding-string produces a unibyte string, a string of raw bytes, because that function is intended for producing encoded strings to be written to external media.

According to the discussion in emacs bug#74922 it's possible that
emacs passes invalid strings to dynamic libraries.
@kurnevsky
Copy link
Author

ok, this makes sense, thanks!

@ubolonton I updated the MR to just remove unwrap that crashes emacs with utf-8-validation. Since it depends on the caller whether string is valid utf-8, it would be nice to have a proper validation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emacs string is not always a valid utf-8
3 participants