Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a standard single-file format #15

Closed
kpreid opened this issue Mar 21, 2017 · 14 comments
Closed

Define a standard single-file format #15

kpreid opened this issue Mar 21, 2017 · 14 comments
Assignees

Comments

@kpreid
Copy link
Contributor

kpreid commented Mar 21, 2017

I was asked to review the SigMF specification by @bhilburn.

Data formats consisting of multiple files are frequently awkward to work with; for example, when downloading them from a web site. People will likely decide to distribute them in archives instead.

Therefore, I propose that the specification preemptively define a simple single-file format, which straightforwardly contains the metadata and data files. There are a lot of possibilities here, so here are some suggested constraints:

  • The format should be a ZIP archive. (Commonly used as a multi-file container; precedent in e.g. JAR and EPUB files. Can be written in streaming fashion (no seeks).)
  • The single-file format should have a unique filename extension such as .sigmf, even if it is itself a standard archive format such as ZIP. (Same argument as in Use more specific filename extensions #14, and should be coordinated with that.)
  • If the format is a standard archive format, it should be required to have specific filenames (nontrivial relative, or absolute, paths prohibited).
  • No other contained files besides the SigMF metadata and data files are permitted. (This ensures that a round-trip between the two-file and single-file formats does not lose anything; thus avoiding introducing any more complexity than necessary to support single-file.)
  • Implementations SHOULD avoid attempting to compress the data file, since doing so is highly unlikely to give any advantage.
@bhilburn bhilburn self-assigned this Mar 22, 2017
@bhilburn
Copy link
Contributor

I really like this suggestion a lot, and you're right - if we don't specifically call this out, there are going to be SigMF recordings floating around as zip, rar, tar, tar.gz, tar.gz2, 7-zipped directories, and a dozen other things.

I agree with every bullet in @kpreid's top-level comment, but do want to follow-up on one item:

If the format is a standard archive format, it should be required to have specific filenames (nontrivial relative, or absolute, paths prohibited).

Is the suggestion, here, that the filenames of the metadata and dataset files within the recording be fixed?

@kpreid
Copy link
Contributor Author

kpreid commented Mar 22, 2017

The most important part is: pathnames referring to some other directory, relatively or absolutely, are prohibited; only filenames are allowed.

Whether the filenames themselves are fixed has a tradeoff:

  1. If we require a specific name (the.data, the.meta, modulo Use more specific filename extensions #14), then things are simple and strict but using a regular unarchiver generates unhelpful-for-distinguishing names (though often in an automatically-created directory named after the archive, which seems like a decent result).
  2. If we allow the name prefix to vary (foo.sigmf may contain foo.data, foo.meta and so on), then this is convenient for unarchiving, but means that renaming a .sigmf container potentially leaves an old, “wrong”, name behind to surprise you later.

Since the goal here is to make a format good for “archiving” (in the scientists-and-librarians sense) data, I think that the surprise should be avoided and option 1 above should be chosen. (If one wants to have a name intrinsic to the data set, well, that's what the contents of the .meta file are for.)

@smunaut
Copy link
Contributor

smunaut commented Mar 22, 2017

You can't write a ZIP (or any format) in a streaming fashion when TWO part of it are streaming (i.e. data and metadata) ...

@kpreid
Copy link
Contributor Author

kpreid commented Mar 22, 2017

@smunaut Yes, but you can either write the metadata first (if it is known that the recording will have exactly one segment and no annotations), or keep it in memory and write it second after the recording ends (the metadata will most likely be very small compared to the sample data).

@smunaut
Copy link
Contributor

smunaut commented Mar 22, 2017

"most likely" ...

I'm generating annotation for every GSM bursts in a real-time scanner app, that's about 2000 annotations per second (for a single GSM channel), I can assure you it grows quickly.

The whole thing has been designed to support fully stream-able annotations (but not segments), breaking this now would be a shame

@kpreid
Copy link
Contributor Author

kpreid commented Mar 22, 2017

@smunaut And such applications can use the two-file format for that. I'm proposing specifying only that “if you want a single file, do it this way”.

@smunaut
Copy link
Contributor

smunaut commented Mar 22, 2017

Mmm, my bad, I was understanding was that you wanted to mandate the use of the single file format.

But then what would be required from reader / writer application to be deemed compliant ?

  • Writer : write either format
  • Reader: Read both ? or Read either ?

@kpreid
Copy link
Contributor Author

kpreid commented Mar 22, 2017

We should obligate readers to embed ZIP unarchiving only if we're also willing to obligate them to embed HTML parsing (see #7). It's the same situation: libraries are commonly available and it wouldn't be hard, but it's still reasonable to choose to not have implementations need library dependencies.

So, assuming we take the no-dependency choice, let's say they must support the two-file format (unless they are such that two-file doesn't make sense at all) and there's a standard unpacking tool to go from one-file to two-file (which is just a thin wrapper around a regular unzip which recognizes the .sigmf extension and renames the contents and maybe validates them).

I don't think this is a particularly great situation; my initial claim is not that there should be two ways to store a SigMF recording but that there inevitably will be and we can do better by standardizing it than not.

@smunaut
Copy link
Contributor

smunaut commented Mar 22, 2017

Well

(1) I wouldn't mandate HTML in the first place.
(2) ZIP isn't all that simple. (I wrote a python zip stream generator, so I know). Things like utf-8 filename and more importantly here ZIP64 support (for > 4G files) is inconsistent and buggy in OS / libraries. For instance OSX CLI utility can uncompress and generate zip64 just fine but the UI (through finder) can't.

If you want ZIP (or really any other) format as the reference, you'll need to explicitly reference what you consider to be the canonical spec for that format, including which extension it must support (because all those formats can have extensions of their own and revisions, etc ...).

@bhilburn
Copy link
Contributor

I think this is worth putting in the spec - it's just a matter of selecting the compression format.

If ZIP isn't a great option, what about the other common formats: gzip, bzip2, and xz? These are all supported on Windows through 7-Zip, and are common on *nix systems. Thoughts?

@kpreid
Copy link
Contributor Author

kpreid commented Apr 11, 2017

@bhilburn We need an archive format, not a compression format. Some things, like zip, are both, but gzip, bzip2, and xz are pure compression formats, not archive formats (they "contain only one file"); they have to be combined with e.g. tar (which is only archive and no compression).

(Certainly we could consider the use of tar for this purpose, but I have no knowledge about its suitability.)

@bhilburn
Copy link
Contributor

Fair point, @kpreid. I was definitely treating them as one-and-the-same in my earlier comment.

The real question, then, is whether tar is suitable. I think that it is, actually, and I can't imagine any platform that doesn't have access to tar utilities.

Input from anyone regarding the specification of tar as the archive format?

@bhilburn
Copy link
Contributor

bhilburn commented May 3, 2017

Okay, #44 is up! Please review and comment.

@bhilburn
Copy link
Contributor

#44 has been merged!

bhilburn added a commit that referenced this issue Aug 15, 2017
Reference implementation of archive format from issue #15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants