Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests to reverse-string to teach correct string handling #1661

Open
tobega opened this issue May 26, 2020 · 23 comments
Open

Add tests to reverse-string to teach correct string handling #1661

tobega opened this issue May 26, 2020 · 23 comments
Labels

Comments

@tobega
Copy link

tobega commented May 26, 2020

Currently the tests for reverse string only contain strings with 7-bit ASCII. This is teaching people the wrong way to handle strings as you can just treat it as an array of bytes. Incorrect handling of non-ASCII characters (or surrogate pairs if you use something like Java) is one of the biggest sources of bugs in the industry.

I propose adding at least a test with a multibyte UTF-8 character as is common in most European languages, e.g. "skåp" which becomes "påks" (this would make a difference in the Julia solutions, for example)

For even better benefit (for UTF-16 languages), it could be good to have a string containing a surrogate pair, e.g. "\uD834\uDD1E", a G-clef, which should come out as it was.

That is probably enough, and it might be too difficult to also handle combining marks, e.g. "as⃝df̅" should become "f̅ds⃝a", not "̅fd⃝sa" (from rosettacode.org)

@SaschaMann
Copy link
Contributor

Could you post a screenshot of the last example, please? I'm not sure if it's rendered correctly for me:

grafik

@siebenschlaefer
Copy link
Contributor

I agree that students should be aware of encodings.

But as you wrote reversing unicode strings is hard because you have to be aware of surrogate pairs, composite characters, combining marks, and things like the BOM or direction marks.
Also, not all programming languages have builtin facilities to handle Unicode, e.g. the standard strings in C and C++ are just sequences of bytes. They would require external libraries.

Currently the exercise is pretty easy, and AFAIK comes early in the core track of many languages. Changing it would make it much much harder, and IMHO make it a different exercise.

@iHiD
Copy link
Member

iHiD commented May 26, 2020

@tobega I would suggest creating a new exercise that centres around working with unicode characters. This would make a great "Concept Exercise" for Exercism version 3. If you let me know which language(s) you are proficient in I can point you in the right place to create that :)

@glennj
Copy link
Contributor

glennj commented May 26, 2020

The micro-blog exercise deals with handling unicode characters, so there's a bit of overlap with this suggestion.

@tobega
Copy link
Author

tobega commented May 26, 2020 via email

@tobega
Copy link
Author

tobega commented May 27, 2020 via email

@tobega
Copy link
Author

tobega commented May 27, 2020

After considering the objections here, a good resolution to this would be to specify that the strings are 7-bit ASCII in the problem. It makes quite a difference in Julia, for example.

@iHiD
Copy link
Member

iHiD commented May 27, 2020

This feels like it will be very different? In Ruby, for example, Strings are arrays of chars, not bytes, so the test would be a noop (so fine adding it, but won't do a huge amount either way):

"skåp".reverse
 => "påks" 

"skåp".chars 
 => ["s", "k", "å", "p"] 

For some languages, it will be make the exercise much harder, which would then (as @siebenschlaefer says) mean those tracks needed to move it around. This is the problem at the heart of #1560.

My suggestion would be to open this PR at exercism/julia and make the change locally in that repo for now, and we can readdress this in this repo once we've understood how we're going to handle situations like this.

@ErikSchierboom
Copy link
Member

We've recently discussed the fact that we have fairly little unicode exercises. The reverse-string exercise could be a good candidate for that. It would increase the difficulty for tracks that don't have nice unicode support. My suggestion would therefore to add a new test case with a unicode input string, but add the unicode scenario to it to allow tracks to exclude this test case. Thoughts?

@siebenschlaefer
Copy link
Contributor

siebenschlaefer commented Jan 19, 2022

I'd rather not do that. I'd prefer doing that in a different exercise.

Currently it's a pretty simple exercise, well suited for beginners.
Handling unicode properly would make it much more complicated (depending on language of course), would shift the focus from the reversal to unicode, and make it essentially a different exercise.

BTW: It's surprisingly hard to reverse a unicode string and cover all the cases, e.g. combining characters, ligatures, directions markers, BOM. (I forgot I already mentioned that.)

@kotp
Copy link
Member

kotp commented Feb 15, 2022

That said, we use this as the basis for the same exercise with the additional requirements. It means knowing how to reverse a string (ASCII) and applying the concerns that UTF-8 bring.

So I would vote for not changing this exercise, but using it as the base for the more complex exercise. "Reverse String Revisited".

It would mean not really losing out on the existing exercise, while being opt-in for tracks that have this as a concern.

@iHiD
Copy link
Member

iHiD commented Mar 2, 2022

@siebenschlaefer Could you expand a bit more on not adding the test/scenario pair? Doing so would mean that you could opt out for your tracks if you wanted to so as not to make the exercise harder, but for other tracks who want to ensure that UTF8 is handled properly, they could add it relatively trivially. As this is sort of the point of the "scenarios", I want to understand more about why you don't feel this would work well.

@siebenschlaefer
Copy link
Contributor

siebenschlaefer commented Mar 7, 2022

@iHiD: tobega is of course correct, strings that contain human language are more complicated than a sequence of bytes, I get the objection that we should not teach that simple view, and I agree that handling Unicode/UTF-8/UTF-16 strings incorrectly is far too common.

But "just" treating a string as a sequence of independent multi-byte UTF-8/UTF-16 code points is also not correct, and I think we should not teach that either. That would treat one "issue" and ignore too many others.

E.g. the "å" in "skåp" can be encoded as "'\u00e5" (LATIN SMALL LETTER A WITH RING ABOVE) or decomposed as "\u0061\u030a" (LATIN SMALL LETTER A followed by COMBINING RING ABOVE). Only handling the former correctly is IMHO similar to the current situation where we only handle single bytes correctly.

Yes, we could support some subset of western languages where characters are encoded as single code points, e.g. by specifying that the input will be "canonically combined". But do we really want to follow the western tradition of ignoring the rest of the world? Many of our students come from cultures with more complicated scripts. And AFAIK there are quite a few graphemes without a composed form.

Or what about bi-directional text? (AFAIK Hebrew is written from right to left but any decimal numbers are written from left to right.) Sometimes the direction is implicit, sometimes there are explicit directional embeddings.

Writing a full-fledged Unicode-aware reverse_string() function would IMHO be far too complex (although I would be thrilled to be proven wrong.) And anything less than that will have restrictions, e.g. no bidirectional text, no combining characters, or only ASCII-characters (which is basically what we have right now).

To be clear: I'm by no means an expert on Unicode. I find the complexity fascinating and each time I try to understand something I learn more about what I don't know. (E.g. country flags are encoded with two letters from the range U+1F1E6 ... U+1F1FF, reversing them can turn Spain into Sweden :-)

So IMHO these are the options:

  1. Leave this exercise like it is and add a big disclaimer that it will only handle ASCII or 1-byte characters.
  2. Expand this exercise so that it handles some additional characters, e.g. single code points possible encoded with multiply bytes. And add a big disclaimer about the limitations.
  3. Change this exercise to reverse-array as suggested by @tobega

Personally I favor (1). This exercise is very easy to understand, even for novice programmers, everybody is familar with "text" and I guess almost all of us did try to write out names in reverse, right? It is pretty easy to solve but still offers the opportunity to practice writing idiomatic and efficient code. (And I've seen quite a few students who still struggle with it, e.g. trying to include string[string_length] in the result.
My second favorite would be (3). That avoids all the complexities of natural language and encodings, but IMHO it's more abstract than text and therefore a little bit less approachable.

@ErikSchierboom I like the idea of exercises targeting Unicode. But (1) I'm not sure whether reversing a string would be a good fit, and (2) I'd rather have a separate exercise than making this exercise much more complicated (as @kotp wrote).

@iHiD
Copy link
Member

iHiD commented Mar 7, 2022

@siebenschlaefer Thanks for the detailed answer. I understand more what you're saying now. My summary of what I've taken from what you've written is that supporting a few unicode characters is probably not too hard, but supporting all of unicode is complex, and we should either do none or all.

That said, I do think the complexity is going very language dependent. For Ruby, for example, things like flags "just work". But I don't know about Hebrew etc.

str = "skåp 🇸🇪"
str.grapheme_clusters.reverse.join.to_s # "🇸🇪 påks"

But considering the bigger picture, I think I agree with you. Having an exercise that includes some of the complex examples you use (flags, hebrew, other emoji etc), would be a very interesting exercise, and better way of approaching this, but then I do think we also need to make explicit that this exercise only uses ASCII.

@siebenschlaefer
Copy link
Contributor

siebenschlaefer commented Mar 7, 2022

@iHiD I didn't know that ruby comes with grapheme clustering built into the language. That's really cool!

Oh, I forgot about stylistic ligatures: For compatibility some ligatures that are often used in books have a direct representation in Unicode, e.g. "fi" (U+FB01) or "ffi" (U+FB03). Should reverse_string("fi ffi") keep them intact ("ffi fi ") or break them up ("if iff")? Granted, that doesn't come up often but it shows how complex Unicode can be.

str = "\ufb01 \ufb03"
puts str # "fi ffi"
puts str.grapheme_clusters.reverse.join.to_s # "ffi fi"
puts str.unicode_normalize(:nfkc).grapheme_clusters.reverse.join.to_s # "iff if"

@SaschaMann
Copy link
Contributor

SaschaMann commented Mar 7, 2022

But considering the bigger picture, I think I agree with you. Having an exercise that includes some of the complex examples you use (flags, hebrew, other emoji etc), would be a very interesting exercise, and better way of approaching this, but then I do think we also need to make explicit that this exercise only uses ASCII.

I can't follow how you arrive on that conclusion. I agree that there are probably tracks that don't want to add these cases for the reasons outlined. But this is exactly what scenarios were created for and tracks that don't want it can choose to not opt into those cases without the usual need to consider them one case at a time thanks to the scenario. They don't change what the exercise is about, they don't take away any options from maintainers, but they enable maintainers to make the exercise more interesting (or in some cases viable at all) for their tracks without the need to write those cases from scratch. Why not add those cases to easily allow maintainers to add such cases to the exercise if they want to?

@iHiD
Copy link
Member

iHiD commented Mar 7, 2022

Because I want some exercises in problem specs to be simple and others to be complex. The nature of all exercises is
to become more complex; Exercism's history has consistently shown that. So sometimes it makes sense to fork exercises to keep one simple and one more complex (for example, when we forked "Hello World" and "Two Fer").

The aim of scenarios (as I see them) is to allow for some edge cases that make things a bit more interesting in some languages, but not others. My initial feeling was that this was the perfect example for that, as for Ruby it would be effectively trivial but for other languages it would be complex. However, what @siebenschlaefer has shown is that for any language handling Unicode involves learning about the Unicode domain, and so fundamentally shifts this exercise from being about reversing an ascii string (language-specific simple task) to reversing unicode (complex domain knowledge required). This shifts this exercise from something that is simple in basically all languages to something that is complex in all languages. I want to protect this exercise's simplicity but am a strong 👍 on forking this and creating a more complex version that teaches the unicode complexities as part of the exercise.

@kotp
Copy link
Member

kotp commented Mar 7, 2022

I am also all for leaving the simple exercise alone, and developing a UTF-8 or even other string related encoding exercise.

@SaschaMann
Copy link
Contributor

I'd like to point out that these tests have been part of the Julia implementation of this roughly since the issue was opened: exercism/julia#261

So this is not about introducing new test cases to prob-specs, it's about "sharing" existing test cases with other tracks, more along the original intentions of prob-specs. But it should also allow you to look into whether or not this has made the exercise too complex as that should be noticeable in the number of submissions, mentoring requests/comments etc. by comparing those numbers from before and after the tests have been added.

@ErikSchierboom
Copy link
Member

The majority seems to be of the opinion that Unicode handling of strings is preferrably not added to this exercise and instead done in a separate exercise.

Just to try and make things even more complex (note: I'm not really trying to do that), maybe there is an intermediate, compromise solution? @siebenschlaefer showed that the Unicode spec can be incredibly complex, but is it true that some Unicode characters are easier to handle than others and wouldn't warrant having to know the entire Unicode spec? If so, we could consider adding a test case with just a simple Unicode example and with a scenario added to it.

I'm not saying this is something we should be doing, but I just wanted to point this out as an option.

@iHiD
Copy link
Member

iHiD commented Mar 8, 2022

it true that some Unicode characters are easier to handle than others and wouldn't warrant having to know the entire Unicode spec

That was what I thought before, but I now think we'd end up in a more messy, complex place. What happens when someone wants to add a test for Hebrew - do we reject it because Hebrew is more complex than a European language? I think we either get into messy territory or just take the first step towards large complexity. I don't understand the value of having a test case for an emoji to teach some unicode, but where the lessons learnt will then fall down on a more complex unicode example. It's clear from @siebenschlaefer's example above that the "naive" implementation that I posted in Ruby is insufficient, and I suspect the Julia example file would not deal with that case either.

For this, I feel we either should either cover reversing all strings, or we cover reversing ASCII strings. And I think this exercise should cover ascii, and a more complex should cover unicode, and be open to as many complex examples as people can come up with.

@SaschaMann Final note of what I'm trying to explain here. If you look at the change that happened in the example file in that PR in Julia, by adding that test the exercise shifted from teaching how to manually reverse, to instead using the reverse function(?) and the complexity more being around using graphemes too. This is exactly the sort of thing I think would be a loss to all the tracks if it happened globally - the reversing bit gets lost in the increased complexity of the exercise. I appreciate you could just choose to still manually reverse, but I'm not convinced maintainers across 55 tracks won't all default to what Julia defaulted to, which I feel would be a loss to students across Exercism.

@ErikSchierboom
Copy link
Member

I think I stand correct about the "simple" Unicode approach, as it turns out to not be simple for any case.
This entire discussion is very useful, as we're basically carving out the space in which we decide we want this exercise to operate.

@SaschaMann
Copy link
Contributor

SaschaMann commented Mar 8, 2022

This is exactly the sort of thing I think would be a loss to all the tracks if it happened globally - the reversing bit gets lost in the increased complexity of the exercise. I appreciate you could just choose to still manually reverse, but I'm not convinced maintainers across 55 tracks won't all default to what Julia defaulted to, which I feel would be a loss to students across Exercism.

It's the opposite. It's a win for students.

The previous example taught them an antipattern: indexing into strings. This is not safe unless you know exactly what the string will look like which is not the case here1. We also opted against that in discussions about a strings concept exercise. Explaining why it isn't safe would require talking about unicode, which would introduce the same complexity, in particular if you keep in mind that a frequent argument of yours is that people should be able to understand every thing mentioned in an exercise even if it's not relevant for solving it. In the discussion of that concept exercise we have not found a way to teach strings while ignoring the existence of unicode without also teaching antipatterns. You just can't ignore it. The same is the case here.

(Side note: the reverse method you mention reverses arrays/collections, it has nothing to do with strings or chars)

The key takeaway from this exercise changed from:

  • You can index into strings [but shouldn't, which wasn't mentioned]

to

  • Unicode is hard, let a library deal with it.
  • [For students who wanted to write it manually or looked at the source of the functions used in the example] It's so hard that Julia itself also relies on external libraries specialised in it.
  • This is how you can iterate over strings and manipulate them in a safe way without using indices.

That aside, as I mentioned earlier this has been deployed for several months so if it truly is a loss for students, you should be able to see that by looking at the database.


but I'm not convinced maintainers across 55 tracks won't all default to what Julia defaulted to

Reminder: the default is that the case (or any case) isn't added to the track. The scenario makes this even easier for maintainers because they don't have to consider the consequences for each case individually but instead see that several are tagged and can be considered at once.

You're only taking away things from students by not allowing these test cases to exist in prob-specs, you're not adding anything valuable with such a (spec-violating, it's going back to pre-#1560 times) restriction.

Footnotes

  1. there are libraries that add ascii-strings and such for those use cases fwiw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants