Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect JSON/NDJSON detection #57

Open
hunter-gatherer8 opened this issue Feb 16, 2024 · 2 comments
Open

Incorrect JSON/NDJSON detection #57

hunter-gatherer8 opened this issue Feb 16, 2024 · 2 comments
Labels
content type to import in KB This issue is about info on a new content type, and needs to be imported in our KB misdetection This issue is about a misdetection on a content type currently supported missing content type This issue requests for support of a new content type

Comments

@hunter-gatherer8
Copy link

hunter-gatherer8 commented Feb 16, 2024

These are pretty minor, but:

  1. Simple JSON example that is recognized as "Generic text document (text)":
    no_whitespace.json. If you add a whitespace after ":" it will be "JSON document (code)"
  2. Same example with multiple newline-delimited JSON-objects is recognized as JSON, which is understandable, but also incorrect, as NDJSON-document is not a valid JSON: ndjson.txt

Magika version: 0.5.0
Default model: standard_v1

@invernizzi invernizzi added the misdetection This issue is about a misdetection on a content type currently supported label Feb 21, 2024
@reyammer reyammer added content type to import in KB This issue is about info on a new content type, and needs to be imported in our KB missing content type This issue requests for support of a new content type labels Feb 21, 2024
@reyammer
Copy link
Collaborator

Thank you for the report. I need to admit I've never heard about ndjson before. Feels very similar to JSONL (i.e., one json per line)?

@hunter-gatherer8
Copy link
Author

@reyammer yes, it's the same thing, different names. NDJSON stands for "Newline-Delimited JSON", and apparently there are 2 separate community-driven specifications with very minor (completely irrelevant for Magika, IMO) differences. But obviously the format itself long predates both specifications, and is just that: a valid JSON-object per line, possibly with some empty lines.

Both communities are aware of each other:

Anecdotally, I've seen only "ndjson" in MIME-types, but it appears jsonl is actually more popular name nowadays, and ndjson pretty much abandoned. So, yeah, you'd be probably better off with "jsonl" as a name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content type to import in KB This issue is about info on a new content type, and needs to be imported in our KB misdetection This issue is about a misdetection on a content type currently supported missing content type This issue requests for support of a new content type
Projects
None yet
Development

No branches or pull requests

3 participants