Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weird language detection behavior #31

Closed
krz opened this issue Jun 30, 2020 · 2 comments
Closed

weird language detection behavior #31

krz opened this issue Jun 30, 2020 · 2 comments

Comments

@krz
Copy link

krz commented Jun 30, 2020

I used Languages.jl to solve the cryptopals challenge Set 4: https://cryptopals.com/sets/1/challenges/4
The challenge asks to brute-force many lines of ciphertext to find out the one matching plaint text in the English language.
I used a for loop to check if the result is an English sentence using Languages.jl

Here's my code

using Languages
detector = LanguageDetector()

f = open("4.txt")
lines = readlines(f)

for i ∈ 1:length(lines)
    for j ∈ 1:255
        res = hex2bytes(lines[i]) .⊻ repeat([j], length(hex2bytes(lines[i])))
        #println(String(UInt8.(res)))
        try
            detector(String(UInt8.(res)))
            if detector(String(UInt8.(res)))[1] == Languages.English()
                println(String(UInt8.(res)))
                #break
            end
        catch y
        end
    end
end

close(f)

However, this approach finds many false positives, for example

[julia> detector("TH]→XOS‼gXp◄pWi6{yC▬rDxPq")
(Languages.English(), Languages.LatinScript(), 1.0)]

Not sure if this is an issue of the underlying language detection algorithm or an implementation error.

@aviks
Copy link
Member

aviks commented Jul 6, 2020

So yeah, the algorithm will detect "TH]→XOS‼gXp◄pWi6{yC▬rDxPq" as English, I would expect that.

The algorithm takes a sequence of characters, and determines the closest match to sequences in known languages. Given that "encrypted gobbledygook" is not in the list of languages in the model, English is the closest in this case. Further, since the sequence of characters in "encrypted gobbledygook" is non-deterministic, this algorithm would not be able to detect it in any case.

In general, I think proving the negative in any classification algorithm is difficult. Philosophically, it gets back to the whole "Absence of evidence is not evidence of absence" issue

@aviks
Copy link
Member

aviks commented Dec 18, 2020

So I'll close this. The current algorithm is indeed susciptible to false positives with "gobbledygook" text. So it's not good for this use case, but works very well for other useses. Until someone implements a different algorithm, this is the best we have.

@aviks aviks closed this as completed Dec 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants