Search and replace UTF-8 #14

LdBeth · 2023-06-03T12:48:40Z

I try to replace some UTF-8 characters in a file use FS command: (save the text below to a file and call EI)

0J<@FS/ᴇ/E/;>
^[^[

This works with TECOC since I guess that is 8bit clean. However this failed to work with TECO-64.
Neither using ^Q to quote would work:

0J<@FS/^Qᴇ/E/;>
^[^[

I also tried to put ^Q before every byte of ᴇ but had no luck. Input ᴇ using 225i180i135i works so I'm surprised searching UTF-8 would not work.

Is this a limitation of TECO-64 or there is a way to work around this?

The text was updated successfully, but these errors were encountered:

fpjohnston · 2023-06-03T19:08:01Z

My immediate answer is that I'm embarrassed that TECO C can do something that TECO-64 can't, so I will certainly look into it. I am not sure why it is behaving this way, but it was not an intentional limitation. I am currently preparing another version for release, and I will endeavor to include a fix for you.

Thanks for bringing this to my attention.

fpjohnston · 2023-06-03T21:09:55Z

I expect to have a new version available tomorrow, once I finish some other changes. But Unicode characters are now displayed and echoed as I think they should be:

teco -n foo
Editing file: foo
*v abcdᴇghij *fsᴇ`E
*v``
abcdEghij

If you are curious, it wasn't so much that TECO C was doing anything special, nor that I had broken anything. Rather, I had provided backwards-compatibility for TECO-32's handling of 8-bit characters on VMS, and hadn't realized how that might affect users in other (and more modern) operating environments. (And to be honest, I hadn't anticipated that anyone might use TECO with UTF-8, so I never thought to test with it.)

Also, the way this will work is that there is a new bit for the E3 flag, which is enabled by default for non-VMS builds. Which reminds me that I should probably update the documentation for that.

LdBeth · 2023-06-03T22:59:49Z

Thanks for the explanation. I’m pumped up for the next release!

I don’t use TECO as my daily editor since there are many non ASCII encoded files there I need to handle. I do enjoy use TECO as a terse script language. Your TECO-64 has definitely made programming more convenient.

fpjohnston · 2023-06-04T13:35:39Z

Version 200.36.1 has been released. I will close out this issue once you have confirmed that it has been resolved.

Please note that the change I made does not affect anything in display mode, which uses ncurses to handle output, and therefore would require different, and quite possibly much more extensive, modifications.

LdBeth · 2023-06-04T14:00:25Z

Ah, I can confirm the visual display is now working, but the fs replace does not work.

Editing file: test
*fsᴇ`E``
?SRH   Search failure: '<^!><^?><^?>'
*e3&64=``
64
*E3=``
323
*ht``
abcdᴇghij
*fsᴇ`E``
?SRH   Search failure: '<^!><^?><^?>'
*0j``
*fsᴇ`E``
?SRH   Search failure: '<^!><^?><^?>'
*fsa`b``
*ht``
bbcdᴇghij
*

LdBeth · 2023-06-04T14:17:15Z

Btw I did the test on OS X, I’m going to try on Linux later today.

fpjohnston · 2023-06-04T14:34:46Z

Strange. The FS command worked for me, as in the following macro:

@I/abcdᴇfghij
/
0J
@^A/before: / HT
< @FS/ᴇ/E/; >
@^A/after:  / HT

Which prints out:

before: abcdᴇfghij
after:  abcdEfghij

fpjohnston · 2023-06-04T14:38:38Z

By the way, I had intended for TECO-64 to work on OS X, and had made some work toward porting it when I had access to a MacBook at my last job, but then Covid happened and my company had to downsize, so I don't presently have any way to test in that environment.

LdBeth · 2023-06-04T15:25:22Z

It does work on Linux to me. Could be I didn’t do a clean before rebuild on OS X. I’ll retry on OS X and report back any updates.

LdBeth · 2023-06-04T20:43:19Z

So I did some experiment and find match_str has different behavior on Linux and OS X.

First I patched src/search.c to get a trace:

--- search.c.old	2023-06-03 07:26:45.000000000 -0500
+++ search.c	2023-06-04 15:38:30.000000000 -0500
@@ -269,6 +269,7 @@
 
     int match = *s->match_buf++;
 
+    tprint("match: %d\n", match);
     if (match == CTRL_E)
     {
         if (s->match_len-- == 0)
@@ -378,6 +379,7 @@
         {
             int c = read_edit(s->text_pos++);
 
+            tprint("c: %d\n", c);
             if (c == EOF || !match_chr(c, s))
             {
                 return false;

Then I run this test file both on Linux and OS X

@I/aᴇf
/
0J
@^A/before: / HT
@FS/ᴇ/E/
@^A/after:  / HT

on Linux it is:

*eitest``
before: aᴇf
c: 97
match: -31
c: 225
match: -31
c: 180
match: -76
c: 135
match: -121
after:  aEf

On OSX:

eitest``
before: aᴇf
c: 97
match: -31
c: 225
match: -31
c: 180
match: -31
c: 135
match: -31
c: 102
match: -31
c: 10
match: -31
?SRH   Search failure: '<^!><^?><^?>'

Now, I don't know how to interpret the negative integer in match. I'm only aware that c is the buffer content. I hope this can help trace down the cause of the difference.

fpjohnston · 2023-06-04T22:12:13Z

I don't have a complete answer for you, but what I can say is that the -31, -76, and -121 are the result of sign-extending an 8-bit value representing the Unicode characters. In unsigned decimal, they would be 225, 180, and 135, respectively.

What I'm guessing is that you've tripped across a difference in either processor architecture or compiler options between your Linux and Mac systems, such that a plain char isn't treated identically in both environments when it is negative.

I thought that I had specified that char was to be unsigned by default, but I'm obviously misremembering that, or perhaps there was a good reason for it being signed by default that I've forgotten. I think anything in the edit buffer should certainly be positive, as it otherwise creates confusion when trying to debug, as we have both discovered.

In any case, I will continue to investigate.

fpjohnston · 2023-06-04T22:25:32Z

Okay, I have a test I'd like you to run. Please change line 46 of Makefile to read as follows:

CFLAGS = -c -std=gnu11 -Wall -Wextra -Wno-unused-parameter -fshort-enums -funsigned-char -MMD

This will change a plain char to be unsigned.

Then rebuild on OS X, and let me know if it makes any difference to the result.

You may retain the tprint() statements for the time being.

Thanks.

fpjohnston · 2023-06-04T22:26:45Z

This shouldn't break any existing commands, so I will probably leave it in regardless. I re-ran my entire test suite, and nothing failed.

LdBeth · 2023-06-04T22:30:58Z

Yes it works now!
I think the issue may be closed now. Thank you very much for the help.

fpjohnston · 2023-06-04T22:38:27Z

You're welcome. For what it's worth, I noted that although the edit buffer has type uchar, the match buffer is just char. That may be relevant for your problem. In any case, I will change the match buffer in addition to adding the -funsigned-char flag.

fpjohnston · 2023-06-07T14:39:12Z

I have not posted a new release yet because I had some progress in getting Unicode characters to display correctly in display mode, and I thought I'd see how far I could go with that. But since you had a workaround, I didn't think you needed anything else just yet. Feel free to let me know if that's not the case.

For what it's worth, I had originally wanted to use unsigned char (or its equivalent) throughout my code, but I ran into issues early on with lint and compiler warnings when I would use standard library functions that used char. I was reminded of this recently when I tried to use unsigned char throughout my code.

Another reason I'm holding off on a new version, though, is to make sure there isn't any issue with the use of the -funsigned-char compiler option. I only say that because I've known of that option for years, so I'm wondering whether I had a good reason for avoiding it. So far none all of the various builds seem to work okay, and all of my smoke tests are passing, so I'm willing to accept that I probably just glossed over it.

In any event, I expect that I'll upload whatever I have by this weekend.

fpjohnston · 2023-06-08T10:13:42Z

Version 200.36.2 has been posted. I have decided not to try to fix the display of UTF-8 character sequences, as it would involve major changes to TECO, which historically always treated bytes and characters as synonymous. I'm sure it could be done, but I just don't see any reason to embark on such as huge effort right now.

Thanks again for your assistance with this.

fpjohnston · 2023-06-08T10:14:33Z

This issue is now closed.

fpjohnston closed this as completed Jun 8, 2023

fpjohnston mentioned this issue Jun 8, 2023

Binary files #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search and replace UTF-8 #14

Search and replace UTF-8 #14

LdBeth commented Jun 3, 2023 •

edited

Loading

fpjohnston commented Jun 3, 2023

fpjohnston commented Jun 3, 2023

LdBeth commented Jun 3, 2023

fpjohnston commented Jun 4, 2023

LdBeth commented Jun 4, 2023 •

edited

Loading

LdBeth commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

LdBeth commented Jun 4, 2023

LdBeth commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

LdBeth commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

fpjohnston commented Jun 7, 2023

fpjohnston commented Jun 8, 2023

fpjohnston commented Jun 8, 2023

Search and replace UTF-8 #14

Search and replace UTF-8 #14

Comments

LdBeth commented Jun 3, 2023 • edited Loading

fpjohnston commented Jun 3, 2023

fpjohnston commented Jun 3, 2023

LdBeth commented Jun 3, 2023

fpjohnston commented Jun 4, 2023

LdBeth commented Jun 4, 2023 • edited Loading

LdBeth commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

LdBeth commented Jun 4, 2023

LdBeth commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

LdBeth commented Jun 4, 2023

fpjohnston commented Jun 4, 2023

fpjohnston commented Jun 7, 2023

fpjohnston commented Jun 8, 2023

fpjohnston commented Jun 8, 2023

LdBeth commented Jun 3, 2023 •

edited

Loading

LdBeth commented Jun 4, 2023 •

edited

Loading