Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto-square implementations disagree with README in many tracks #190

Closed
matthewmorgan opened this issue Mar 7, 2016 · 9 comments
Closed

Comments

@matthewmorgan
Copy link
Contributor

Taken from xjava:

Chunk sizes are wrong: The README gives several examples of how to deal with strings of text that can be made into perfect ( ie n X n ) squares, and those that can be made into imperfect ( r X c ) squares. In all the README examples, the imperfect squares are r X c with the additional constraint that r >= c.

As documented by @jtigger here most of the tracks take a 26-letter string and chunk it in to 5, 5, 4, 4, 4, 4. This is not the behavior specified in the README, and seems like a error from early on that got propagated to many tracks.

There's no json file for this in x-common, but it appears this needs to be changed in many places, not just on the Java track.

If I'm missing something please let me know, but it seems like this is a good case for common test data.

@kytrinyx
Copy link
Member

kytrinyx commented Mar 8, 2016

I remember having trouble with inconsistencies on this before, but I've never taken the time to go through everything and tidy it up.

I'd love to have a canonical list of test cases (crypto-square.json would be great).

@kytrinyx
Copy link
Member

kytrinyx commented Mar 8, 2016

Pinging @exercism/track-maintainers in case you're not watching the repo :)

@ErikSchierboom
Copy link
Member

Just to be clear, in the linked post it shows show variants of how tracks implement the algorithm for the strings "Madness, and then illumination." and "Vampires are people too!". What is the correct output for those strings?

@petertseng
Copy link
Member

There are two fundamental choices to be made in this problem.

  1. Should the imperfect squares (which have r rows and c columns) should have r >= c or c >= r? This affects the order of letters in the output, regardless of how the output is formatted/grouped.
  2. Now that we've decided on the order of letters in the output, how should it be formatted?

I believe all tracks have reached consensus on the first question, but not on the second (but it would be good to check that). Let's examine the first question.

r >= c or c >= r?

I look at the .yml file for the exercise, it lists source of http://users.csc.calpoly.edu/~jdalbey/103/Projects/ProgrammingPractice.html, and the example on that page is a c >= r example, as it says:

so it is written into a rectangle with 7 rows and 8 columns.

This is supported by the following sentence in the README:

A message between 5 and 8 characters long should use a rectangle 3 characters wide.

So the message "abcde" fills in this way (c >= r):

abc
de

Rather than this way (r >= c):

ab
cd
e

The following portion also supports c >= r

Which has length: 30
And splits into 5 6-character rows:


On a related note, I found a contradictory sentence in the README:

Broken into 8-character columns, it yields 7 rows.

This sentence is contradictory because if a column has 8 characters, that implies there will be 8 rows, by definition. The sentence should read something similar to "Broken into 8 columns, it yields 7 rows" (this is a c >= r definition)

This would bring it in line with the example right below that sentence (the example from the page linked above, in fact!):

ifmanwas
meanttos
tayonthe
groundgo
dwouldha
vegivenu
sroots

So what are the expected outputs, you ask?

"Madness, and then illumination.", normalized, has 26 characters, so we want to decide between 5x6 and 6x5.

"Vampires are people too!", normalized, has 20 characters, so we want to decide between 4x5 and 5x4.

If we believe that it should be c >= r, the output for "Madness, and then illumination." is "msemoaanindninndlaetltshui" and the output for "Vampires are people too!" is "vrelaepemsetpaooirpo"

If we believe that it should be r >= c, then the output for "Madness, and then illumination" is "mstlnnashladaeutnnnmiediio" and the output for "Vampires are people too!" is "viaeearrotmeepopsplo"

Notice the choice we make significantly changes the order of letters in the output. Also I didn't add spaces to the output, because that's a decision to be made later.

I used the code I had written to solve this problem to generate c >= r outputs (since it was written for a track that assumes c >= r) and http://rumkin.com/tools/cipher/coltrans.php to generate the r >= c outputs - for example you enter in "vampiresarepeopletoo" and a key of "1 2 3 4 5" if you want 5 columns. Because fundamentally this crypto square is a columnar transposition but with no rearrangement of columns and column size strictly based on the size of the message. My code agreed with the output generated by that page for c >= r, so I believed the page was reliable for generating r >= c outputs.

It seems to me that the various outputs pointed out at exercism/exercism#2432 (comment) all are consistent with c >= r. They just are spaced differently. Which brings us to the next point.

Spacing of output

As long the order of the letters is correct, I'm much less fussed about how they are spaced. In fact, "no spacing" is a perfectly valid option to me.

However, there are obvious disadvantages of choosing to space out "mstlnnashladaeutnnnmiediio" as "msemo aanin dnin ndla etlt shui" since it trivially allows recovery of the plaintext - read the first letter of every word, read the second letter of every word, read the third...

Unfortunately, such a spacing is what is currently prescribed in the readme ("Output the encoded text grouped by column.")... and is also what is used in the examples in http://users.csc.calpoly.edu/~jdalbey/103/Projects/ProgrammingPractice.html . So there is an argument for doing it this way, despite my protests. This is the chunking of 5, 5, 4, 4, 4, 4 which actually is the behavior specified in the readme (since the number of chunks equals the number of columns)

Before 89d1274, the old readme just said always output in chunks of five (regardless of message size!) which makes it harder to recover the plaintext (unless the message size is exactly 25 characters in which case you're out of luck!). This is where the "msemo aanin dninn dlaet ltshu i" grouping comes from.

It appears some tracks also have chunking where the number of letters in each chunk (as opposed to the number of chunks) is the same as the number of columns ("msemoa anindn inndla etltsh ui"). A possible solution, sure, but there need be no relation between the number of columns and the number of letters per chunk.

As for "vrelaepemsetpaooirpo" (the output for "vampires are people too"), the grouping consistent with the current README is "vrel aepe mset paoo irpo" (again, plaintext easily recovered, and number of chunks equal to number of columns), and the grouping for "always five" is "vrela epems etpao oirpo" (which also happens to be the "chunk size equals number of columns" since the number of columns was five)


So we should make a decision on these two things and standardize, I suppose.

My personal recommendations:

  • Definitely change that contradictory sentence in the README
  • For deciding letter order, c >= r, since it seems all tracks already implement this and it's consistent with the source of this problem.
  • For output spacing, I don't really care that much, since this is so much less important than getting the order of the letters right. If you compelled me to pick, I'd pick "no spacing at all" as my most favored option.

Footnote: I had originally written this comment assuming the tracks were in disagreement on the first question rather than the second. So that's why I went into so much detail defending c >= r for the first question... and then read exercism/exercism#2432 (comment) and realized they all had the same letter order, so in fact the disagreement was on the second question instead. OOPS. So I had to backtrack and also write about output spacing. Then I decided leaving the entire comment intact would be helpful for anyone else who was confused between the two issues, and to underscore their relative importance in my mind.

@kytrinyx
Copy link
Member

@petertseng This is a fantastic summary of the problem, thank you so much for doing the analysis and exposé.

The origin of this exercise is the "square code" exercise from here: http://users.csc.calpoly.edu/~jdalbey/103/Projects/ProgrammingPractice.html

@matthewmorgan
Copy link
Contributor Author

@petertseng thanks so much. @jtigger and I have been working on a PR for x-common to clarify the parameters for this exercise. Can we continue the discussion there?

@ErikSchierboom
Copy link
Member

Wow, thanks a lot for the clarification @petertseng !

@verdammelt
Copy link
Member

Total tangent here - sorry - but while we are talking about cryptosquares I wanted to share this phrase in Latin which reads the same even when encoded as a cryptosquare:

sator arepo tenet opera rotas

(It was found in Pompeii (amongst a few other places)).

IanWhitney pushed a commit to IanWhitney/x-common that referenced this issue May 11, 2016
Mentioned in exercism#190, but I don't see that this has its own issue

Currently 18 tracks implement this exercise. I did [a
survey](https://gist.github.com/IanWhitney/4912b100cca0b02a5c3ce21096a3e673)
of their test suites and found that they fell into two types:

Type One: Test Just Encoding
--

- [python](https://github.com/exercism/xpython/blob/master/exercises/crypto-square/crypto_square_test.py)
- [lisp](https://github.com/exercism/xlisp/blob/master/exercises/crypto-square/crypto-square-test.lisp)
- [go](https://github.com/exercism/xgo/blob/master/exercises/crypto-square/crypto_square_test.go)
- [elixir](https://github.com/exercism/xelixir/blob/master/exercises/crypto-square/crypto_square_test.exs)

Are the tracks that followed this approach. Just an `encode` method (or
some alias) is implemented and tested.

Type Two: Test Intermediate Methods
--

The remaining 14 tracks followed this style. There's not a lot of
variation between these suites.
[Ruby](https://github.com/exercism/xruby/blob/master/exercises/crypto-square/crypto_square_test.rb) is a good, representative example.

In these test suites several methods are implemented and tested:

- `normalized_plaintext`
- `size`
- `plaintext_segments`
- `ciphertext`
- `normalized_ciphertext'

Again, exact method names may vary.

In implementing this json file I followed the second type. I did this
for a few reasons:

*It's already the majority*: With 14 of 18 tracks already implementing
tests like this, there is some value in following the crowd.

*I think it's the best approach*: This one is more subjective,
obviously. My problem with the Test Just Encoding approach is that
there's a huge gap between starting the exercise and getting a useful
passing test. Students have to implement the full algorithm to get
tests passing.

By breaking the steps down in to smaller methods, each with their own
tests, the lag time between starting to code and getting a passing test
is smaller. And the tests are ordered so that each new method builds on
the methods already built.

The downside of this approach, I think, is that we're doing a lot of the
design up front. In the Test Just Encoding approach students can
implement the algorithm using as many methods as they want. In the Test
Intermediate Methods approach, students end up locked to the methods
defined in the test suite.

In this case I think the trade off is worth it.

But that's just my opinion. My kata group also worked through this
exercise. 3 people did it in Test Just Encoding languages (Elixir and
Python). 2 people did it in Test Intermediate Methods languages (Ruby
and Javascript).

Their opinions largely mirrored mine. Those that used Just Encoding
found it a lot of work to get the 2nd test to pass (since the first test
encodes an empty string). But once they got the 2nd test to pass, all
tests passed.

Those who used the Intermediate Methods approach found the steps between
tests easer to manage and though that this approach was better for
learning.

Though, as an argument for Just Encoding, the Python people were
impressed at the variety of designs people used to solve the problem.
And our Elixir programmer liked that they could make up their own mind
about internal implementations.

A suggested middle ground was to have one exercise offer a Intermediate
Methods test suite, while a later exercise could cover similar ground
with a more free-form Just Encoding test suite.

Removed Tests
---

I dropped one set of tests that existed in the Test Intermediate Methods
approach: `size`. I didn't see a reason for this method. I don't see it
being used as part of a 'real' crypto library (though if your real
crypto library is using Crypto Square then you probably have other
problems). And I didn't see that testing it offered any useful tests not
already provided by the `plaintext_segments` tests.

Tweaked Method Names
---

Method naming varies between current implementations and the Readme.
I've tried to use method names that follow the readme.

My methods

- normalized_plaintext
- plaintext_segments
- encoded
- ciphertext

Terms used in the Readme

- 'input is normalized'
- 'plaintext in a rectangle'
- 'encoded text'
- 'cyphertext' or 'encoded text chunks'

`plaintext_segments` is the method name that deviates most from the
readme. It comes from the current implementations and I could not think
of a better name. Names = hard.
@IanWhitney
Copy link
Contributor

I believe this can be closed. the r vs c question has been covered, and there is now a canonical test file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants