weird language detection behavior #31

krz · 2020-06-30T15:41:48Z

I used Languages.jl to solve the cryptopals challenge Set 4: https://cryptopals.com/sets/1/challenges/4
The challenge asks to brute-force many lines of ciphertext to find out the one matching plaint text in the English language.
I used a for loop to check if the result is an English sentence using Languages.jl

Here's my code

using Languages
detector = LanguageDetector()

f = open("4.txt")
lines = readlines(f)

for i ∈ 1:length(lines)
    for j ∈ 1:255
        res = hex2bytes(lines[i]) .⊻ repeat([j], length(hex2bytes(lines[i])))
        #println(String(UInt8.(res)))
        try
            detector(String(UInt8.(res)))
            if detector(String(UInt8.(res)))[1] == Languages.English()
                println(String(UInt8.(res)))
                #break
            end
        catch y
        end
    end
end

close(f)

However, this approach finds many false positives, for example

[julia> detector("TH]→XOS‼gXp◄pWi6{yC▬rDxPq")
(Languages.English(), Languages.LatinScript(), 1.0)]

Not sure if this is an issue of the underlying language detection algorithm or an implementation error.

The text was updated successfully, but these errors were encountered:

aviks · 2020-07-06T11:20:30Z

So yeah, the algorithm will detect "TH]→XOS‼gXp◄pWi6{yC▬rDxPq" as English, I would expect that.

The algorithm takes a sequence of characters, and determines the closest match to sequences in known languages. Given that "encrypted gobbledygook" is not in the list of languages in the model, English is the closest in this case. Further, since the sequence of characters in "encrypted gobbledygook" is non-deterministic, this algorithm would not be able to detect it in any case.

In general, I think proving the negative in any classification algorithm is difficult. Philosophically, it gets back to the whole "Absence of evidence is not evidence of absence" issue

aviks · 2020-12-18T19:42:34Z

So I'll close this. The current algorithm is indeed susciptible to false positives with "gobbledygook" text. So it's not good for this use case, but works very well for other useses. Until someone implements a different algorithm, this is the best we have.

aviks closed this as completed Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

weird language detection behavior #31

weird language detection behavior #31

krz commented Jun 30, 2020

aviks commented Jul 6, 2020 •

edited

Loading

aviks commented Dec 18, 2020

weird language detection behavior #31

weird language detection behavior #31

Comments

krz commented Jun 30, 2020

aviks commented Jul 6, 2020 • edited Loading

aviks commented Dec 18, 2020

aviks commented Jul 6, 2020 •

edited

Loading