-
-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests to reverse-string to teach correct string handling #1661
Comments
I agree that students should be aware of encodings. But as you wrote reversing unicode strings is hard because you have to be aware of surrogate pairs, composite characters, combining marks, and things like the BOM or direction marks. Currently the exercise is pretty easy, and AFAIK comes early in the core track of many languages. Changing it would make it much much harder, and IMHO make it a different exercise. |
@tobega I would suggest creating a new exercise that centres around working with unicode characters. This would make a great "Concept Exercise" for Exercism version 3. If you let me know which language(s) you are proficient in I can point you in the right place to create that :) |
The micro-blog exercise deals with handling unicode characters, so there's a bit of overlap with this suggestion. |
Maybe "reverse string" should then be remade into "reverse array"?
…On Tue, 26 May 2020 at 17:25, Glenn Jackman ***@***.***> wrote:
The micro-blog
<https://github.com/exercism/problem-specifications/blob/master/exercises/micro-blog/description.md>
exercise deals with handling unicode characters, so there's a bit of
overlap with this suggestion.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1661 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGACJ52FYRRYXKVX77YE2KTRTPNNFANCNFSM4NKJGBUA>
.
|
The rendering is correct, the ring is attached to the 's' and covers the
letter after the 's'
…On Tue, 26 May 2020 at 16:17, Sascha Mann ***@***.***> wrote:
Could you post a screenshot of the last example, please? I'm not sure if
it's rendered correctly for me:
[image: grafik]
<https://user-images.githubusercontent.com/20866761/82911543-4d46ee80-9f6c-11ea-9b85-f68020576a47.png>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1661 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGACJ5ZA6H2ICIQSSAJOJV3RTPFOXANCNFSM4NKJGBUA>
.
|
After considering the objections here, a good resolution to this would be to specify that the strings are 7-bit ASCII in the problem. It makes quite a difference in Julia, for example. |
This feels like it will be very different? In Ruby, for example, Strings are arrays of chars, not bytes, so the test would be a noop (so fine adding it, but won't do a huge amount either way):
For some languages, it will be make the exercise much harder, which would then (as @siebenschlaefer says) mean those tracks needed to move it around. This is the problem at the heart of #1560. My suggestion would be to open this PR at exercism/julia and make the change locally in that repo for now, and we can readdress this in this repo once we've understood how we're going to handle situations like this. |
We've recently discussed the fact that we have fairly little unicode exercises. The reverse-string exercise could be a good candidate for that. It would increase the difficulty for tracks that don't have nice unicode support. My suggestion would therefore to add a new test case with a unicode input string, but add the |
I'd rather not do that. I'd prefer doing that in a different exercise. Currently it's a pretty simple exercise, well suited for beginners.
|
That said, we use this as the basis for the same exercise with the additional requirements. It means knowing how to reverse a string (ASCII) and applying the concerns that UTF-8 bring. So I would vote for not changing this exercise, but using it as the base for the more complex exercise. "Reverse String Revisited". It would mean not really losing out on the existing exercise, while being opt-in for tracks that have this as a concern. |
@siebenschlaefer Could you expand a bit more on not adding the test/scenario pair? Doing so would mean that you could opt out for your tracks if you wanted to so as not to make the exercise harder, but for other tracks who want to ensure that UTF8 is handled properly, they could add it relatively trivially. As this is sort of the point of the "scenarios", I want to understand more about why you don't feel this would work well. |
@iHiD: tobega is of course correct, strings that contain human language are more complicated than a sequence of bytes, I get the objection that we should not teach that simple view, and I agree that handling Unicode/UTF-8/UTF-16 strings incorrectly is far too common. But "just" treating a string as a sequence of independent multi-byte UTF-8/UTF-16 code points is also not correct, and I think we should not teach that either. That would treat one "issue" and ignore too many others. E.g. the Yes, we could support some subset of western languages where characters are encoded as single code points, e.g. by specifying that the input will be "canonically combined". But do we really want to follow the western tradition of ignoring the rest of the world? Many of our students come from cultures with more complicated scripts. And AFAIK there are quite a few graphemes without a composed form. Or what about bi-directional text? (AFAIK Hebrew is written from right to left but any decimal numbers are written from left to right.) Sometimes the direction is implicit, sometimes there are explicit directional embeddings. Writing a full-fledged Unicode-aware To be clear: I'm by no means an expert on Unicode. I find the complexity fascinating and each time I try to understand something I learn more about what I don't know. (E.g. country flags are encoded with two letters from the range U+1F1E6 ... U+1F1FF, reversing them can turn Spain into Sweden :-) So IMHO these are the options:
Personally I favor (1). This exercise is very easy to understand, even for novice programmers, everybody is familar with "text" and I guess almost all of us did try to write out names in reverse, right? It is pretty easy to solve but still offers the opportunity to practice writing idiomatic and efficient code. (And I've seen quite a few students who still struggle with it, e.g. trying to include @ErikSchierboom I like the idea of exercises targeting Unicode. But (1) I'm not sure whether reversing a string would be a good fit, and (2) I'd rather have a separate exercise than making this exercise much more complicated (as @kotp wrote). |
@siebenschlaefer Thanks for the detailed answer. I understand more what you're saying now. My summary of what I've taken from what you've written is that supporting a few unicode characters is probably not too hard, but supporting all of unicode is complex, and we should either do none or all. That said, I do think the complexity is going very language dependent. For Ruby, for example, things like flags "just work". But I don't know about Hebrew etc.
But considering the bigger picture, I think I agree with you. Having an exercise that includes some of the complex examples you use (flags, hebrew, other emoji etc), would be a very interesting exercise, and better way of approaching this, but then I do think we also need to make explicit that this exercise only uses ASCII. |
@iHiD I didn't know that ruby comes with grapheme clustering built into the language. That's really cool! Oh, I forgot about stylistic ligatures: For compatibility some ligatures that are often used in books have a direct representation in Unicode, e.g. "fi" (U+FB01) or "ffi" (U+FB03). Should str = "\ufb01 \ufb03"
puts str # "fi ffi"
puts str.grapheme_clusters.reverse.join.to_s # "ffi fi"
puts str.unicode_normalize(:nfkc).grapheme_clusters.reverse.join.to_s # "iff if" |
I can't follow how you arrive on that conclusion. I agree that there are probably tracks that don't want to add these cases for the reasons outlined. But this is exactly what scenarios were created for and tracks that don't want it can choose to not opt into those cases without the usual need to consider them one case at a time thanks to the scenario. They don't change what the exercise is about, they don't take away any options from maintainers, but they enable maintainers to make the exercise more interesting (or in some cases viable at all) for their tracks without the need to write those cases from scratch. Why not add those cases to easily allow maintainers to add such cases to the exercise if they want to? |
Because I want some exercises in problem specs to be simple and others to be complex. The nature of all exercises is The aim of scenarios (as I see them) is to allow for some edge cases that make things a bit more interesting in some languages, but not others. My initial feeling was that this was the perfect example for that, as for Ruby it would be effectively trivial but for other languages it would be complex. However, what @siebenschlaefer has shown is that for any language handling Unicode involves learning about the Unicode domain, and so fundamentally shifts this exercise from being about reversing an ascii string (language-specific simple task) to reversing unicode (complex domain knowledge required). This shifts this exercise from something that is simple in basically all languages to something that is complex in all languages. I want to protect this exercise's simplicity but am a strong 👍 on forking this and creating a more complex version that teaches the unicode complexities as part of the exercise. |
I am also all for leaving the simple exercise alone, and developing a UTF-8 or even other string related encoding exercise. |
I'd like to point out that these tests have been part of the Julia implementation of this roughly since the issue was opened: exercism/julia#261 So this is not about introducing new test cases to prob-specs, it's about "sharing" existing test cases with other tracks, more along the original intentions of prob-specs. But it should also allow you to look into whether or not this has made the exercise too complex as that should be noticeable in the number of submissions, mentoring requests/comments etc. by comparing those numbers from before and after the tests have been added. |
The majority seems to be of the opinion that Unicode handling of strings is preferrably not added to this exercise and instead done in a separate exercise. Just to try and make things even more complex (note: I'm not really trying to do that), maybe there is an intermediate, compromise solution? @siebenschlaefer showed that the Unicode spec can be incredibly complex, but is it true that some Unicode characters are easier to handle than others and wouldn't warrant having to know the entire Unicode spec? If so, we could consider adding a test case with just a simple Unicode example and with a scenario added to it. I'm not saying this is something we should be doing, but I just wanted to point this out as an option. |
That was what I thought before, but I now think we'd end up in a more messy, complex place. What happens when someone wants to add a test for Hebrew - do we reject it because Hebrew is more complex than a European language? I think we either get into messy territory or just take the first step towards large complexity. I don't understand the value of having a test case for an emoji to teach some unicode, but where the lessons learnt will then fall down on a more complex unicode example. It's clear from @siebenschlaefer's example above that the "naive" implementation that I posted in Ruby is insufficient, and I suspect the Julia example file would not deal with that case either. For this, I feel we either should either cover reversing all strings, or we cover reversing ASCII strings. And I think this exercise should cover ascii, and a more complex should cover unicode, and be open to as many complex examples as people can come up with. @SaschaMann Final note of what I'm trying to explain here. If you look at the change that happened in the example file in that PR in Julia, by adding that test the exercise shifted from teaching how to manually reverse, to instead using the |
I think I stand correct about the "simple" Unicode approach, as it turns out to not be simple for any case. |
It's the opposite. It's a win for students. The previous example taught them an antipattern: indexing into strings. This is not safe unless you know exactly what the string will look like which is not the case here1. We also opted against that in discussions about a strings concept exercise. Explaining why it isn't safe would require talking about unicode, which would introduce the same complexity, in particular if you keep in mind that a frequent argument of yours is that people should be able to understand every thing mentioned in an exercise even if it's not relevant for solving it. In the discussion of that concept exercise we have not found a way to teach strings while ignoring the existence of unicode without also teaching antipatterns. You just can't ignore it. The same is the case here. (Side note: the The key takeaway from this exercise changed from:
to
That aside, as I mentioned earlier this has been deployed for several months so if it truly is a loss for students, you should be able to see that by looking at the database.
Reminder: the default is that the case (or any case) isn't added to the track. The scenario makes this even easier for maintainers because they don't have to consider the consequences for each case individually but instead see that several are tagged and can be considered at once. You're only taking away things from students by not allowing these test cases to exist in prob-specs, you're not adding anything valuable with such a (spec-violating, it's going back to pre-#1560 times) restriction. Footnotes
|
Currently the tests for reverse string only contain strings with 7-bit ASCII. This is teaching people the wrong way to handle strings as you can just treat it as an array of bytes. Incorrect handling of non-ASCII characters (or surrogate pairs if you use something like Java) is one of the biggest sources of bugs in the industry.
I propose adding at least a test with a multibyte UTF-8 character as is common in most European languages, e.g. "skåp" which becomes "påks" (this would make a difference in the Julia solutions, for example)
For even better benefit (for UTF-16 languages), it could be good to have a string containing a surrogate pair, e.g. "\uD834\uDD1E", a G-clef, which should come out as it was.
That is probably enough, and it might be too difficult to also handle combining marks, e.g. "as⃝df̅" should become "f̅ds⃝a", not "̅fd⃝sa" (from rosettacode.org)
The text was updated successfully, but these errors were encountered: