Schema vs Model Distinction #70

bhilburn · 2017-09-20T15:21:54Z

One of the comments we got at GRCon about SigMF is that it seemed to make working with datasets difficult.

Specifically, what this person wanted to be able to do was SELECT something in a database, parametrically, based on the metadata, and then have it return a chunk of samples. The obvious solution is chunking the SigMF data file by capture segment and then storing those chunks with the segments as keys - but this no longer represents a compliant recording per the standard. Possible? Yes. But not standard.

Is this something we should address? I agree that it is a useful structure and I think a lot of users will want to use something like it. Even if we don't want to make this a compliance requirement, are there things we can do in the standard to make it easier to accomplish?

The text was updated successfully, but these errors were encountered:

kpreid · 2017-09-20T16:59:56Z

I claim that as a general principle of software engineering, one should not call an application noncompliant just because:

it internally, as an implementation detail, stores data in a format different than the standard, or
is capable of returning the data in a nonstandard but useful format.

Rather, compliance of an application should be such conditions as:

it can read/import/intake files in the standard format (if the application reads such files);
it can write/export files in the standard format (if the application writes such files);
it does not produce nonstandard-format files it claims are standard; and
there are no compliant files which it cannot read, other than due to size limits.

djanderson · 2017-09-20T17:18:13Z

@bhilburn, did you get any more insight into what specifically makes it difficult with databases? I think the fact that we split metadata from data, break data into capture segments, and provide unique keys in the form of sample_start to find those capture segments makes it pretty straight forward to load into a database. For the record, I'm storing SigMF data in a relational db, though I don't give each capture its own row. Though I do store in a db for more efficient searching/filtering/seeking into data, I wouldn't want the actual sigmf format to be anything other than a flat file.

I'm honestly not sure what we could do to make SigMF easier to drop into a database, and as @kpreid said, there's nothing about the spec that stops or even discourages them from creating an application that does so.

bhilburn · 2017-09-20T17:34:00Z

The biggest proponent for this, actually, was @namccart. He was explaining that one of the reasons that he really likes VITA49 for this particular application is that it provides pre/easily 'chunkable' data.

So, based on my understanding from @namccart, for example, if you load a SigMF recording into a database and search over sample_start as a key, once you identified one you wanted you would then still have to load the entire dataset to index to the key. As you said, @djanderson, "[you] don't give each capture it's own row", which I think is what Nick's issue is?

Nick, can you comment?

mbr0wn · 2017-10-03T19:44:01Z

I'm also not convinced this is really a SigMF problem. I see how it makes writing SQL <-> SigMF converters a bit more complicated, but they also solve really, really different problems.

namccart · 2017-10-04T01:07:47Z

I think Ben captured my issue pretty well. Given what the Darpa folk want to do with SigMF, I think you really want to consider how SigMF plays nicely with the overall problem of data retrieval from big RF data archives. Hearing Tom talk about his gnuradio-SQL idea (which I also want and think is inevitable), everything starts with being able to retrieve arbitrary I/Q based on reasonable query strategies. I'm sure there are better solutions to this problem than I can imagine. I think hdf5 has an entirely different solution to searching the archive than trying to chunk the data into a database... what I don't know is whether hdf5 plays nicely with hdfs or other distributed setups. I know very little about hdf5 except that it's intriguing. In any event, if you accept that this is in fact a problem SigMF should address, I think it makes sense to move in a direction that supports one or more existing ways of archiving and searching lots of data (time-series data or otherwise). For me, those candidates are databases (Couchdb, mongodb, postgres), database-like things (elasticsearch), and hdf5... Cheers, Nick M.

…

On Tue, Oct 3, 2017 at 3:44 PM Martin Braun ***@***.***> wrote: I'm also not convinced this is really a SigMF problem. I see how it makes writing SQL <-> SigMF converters a bit more complicated, but they also solve really, really different problems. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEZpNZyNJxU7-oyjWPZc3j4iDs2aujw7ks5soo6FgaJpZM4PeC3U> .

bhilburn · 2017-10-09T22:29:46Z

Okay, so, SigMF already provides a solution to this, but we should discuss whether there are changes that would improve it:

So, what @namccart cares about, per my comment above, is the ability to load smaller "chunks" of data than the entire dataset, which makes it much easier to work with databases. SigMF allows for this using the offset field of the core namespace, which allows you to break datasets up into multiple files that represent a continuous recording. You could, for example, break a dataset into five .sigmf-data files that have five matching .sigmf-meta MD files, with offsets that connect each one to the one that precedes it.

So, the question here, then, is "What, if anything, could we do to make this better?" Is there some change we should recommend? If we just provide a tool that cleanly splits your dataset into multiple files, of a parametizable size, does that solve the issue?

bhilburn · 2017-10-26T19:15:37Z

Spun on this a bit. @kpreid had a really good point, early on, that we shouldn't call something non-compliant because of anything it does "internally" or "locally". It's really all about the ingress and egress.

Per my previous comment, what @namccart wants to do is already pretty doable with SigMF. We could make it easier by providing a tool, for example, that showed you how to chunk the data based on metadata segment, but there really isn't anything difficult, here, in my opinion.

So, I think the final question that should be debated is whether or not this is a format that we want to be able to distribute SigMF Recordings in. Right now, a compliant recording can not be distributed where the binary data has been chunked into a bunch of files and one metadata file references all of them. We specifically decided against allowing the 1-to-many case in #19. It was in the context of multiple streams of data, but the reasoning still applies here, I think.

So, before we close this issue out with either a do nothing or a make an example chunker program decision, does anyone think we should revisit 1-to-many given this usecase? We do now have an archive format described, which we didn't at the time of #19, so it would (presumably) be easier to distribute multiple files in a recording.

dharasty · 2017-11-13T20:15:40Z

I'm new here, so if these comments are missing the point, I apologize.

One feeling I had as I read the spec (as an experienced spec reader and writer) is that the current draft spec conflates semantic content of the metadata with the transfer encoding/format of the data.

In plain English: it seems to me the definition of "what are the allowed tags and values in SigMF metadata" can (and should) be separate from HOW the tag value pairs are encoded.

I'm all for SigMF metadata including "datatype" and "sample_rate" and "version", and so forth; consider this the "schema" of "SigMF metadata". But I think the spec would be strengthened by separating out the fact that "it must be a JSON file".

I feel the SigMF spec SHOULD say: when SigFM metadata is written to a file, it then must be a JSON structured UTF-8 file, with a single object per file, and use the following extension.

If ALSO a standard way for "writing a SigMF object to a SQL database" is needed... then that should be specified as an alternate way to store an SigMF metadata (and maybe the dataset, too).

Should one write the JSON version of the metadata as text-blob to a single VARCHAR field? Or should each field of the metadata get its own SQL field? Personally, I don't care; I find both of these reasonable in certain cases. Should the SigMF spec weigh-in on the "correct"/standard way to do this? Only if the community thinks it is helpful.

And then what if I want to store SigMF data -- both metadata and the dataset -- in a document database such as MongoDB? Do we need to define a "standard" -- that is "compliant" -- way to do that?

MY MAIN POINT is that because the verbiage of the spec conflates "SigMF metadata is a JSON object with this format", I think it leads to the ambiguity that is being discussed in this thread.

My advice: separate sections for the semantic part of "what is SigMF metadata", and then requirements for how they should be serialized into a file (JSON), and -- if desired by the community -- recommendations for "best practices" when stored in relational records, or a document database, or -- as needed -- in other portable files/containers/transfer mechanisms.

kpreid · 2017-11-14T03:25:48Z

I agree that distinguishing consistently between schema/model and encoding would be useful, but I think that "separate sections" is a bad idea unless those sections are interleaved: the value of making the distinction clear is less than the value of making it obvious how to implement SigMF's intended primary use — an interchange file format.

dharasty · 2017-11-14T17:11:33Z

@kpreid: It is a pretty common technique in standards documents to separate the schema from encodings. In fact, in many standards documents, ALL the encodings show up as examples/supplementary information in appendices.

For SigMF to really catch on, I think it need to address is motivating use case of FILE interchange, but ought to give SOME consideration for logical next steps, such as storing both the dataset and metadata in either relational and document databases. (After all, a filesystem and a tarfile are simply ONE instance of a "document database" or "document datastore".)

Actual file storage might be many users primary use case... but for me, it probably won't be. Minor adjustments to the contents and the format of the spec might ensure my use case is well covered, too. This will be a boon to the spec if we can achieve it without impeding the file use case... and I feel we can.

All that said, I have no trouble if we inline/interleave JSON-file examples in the text, provided there is 1) a clear editorial distinction between "schema requirements" and "JSON-file encoding requirements", 2) there is some other place in the document that addresses the needs of other encodings (possibly appendices).

bhilburn · 2018-07-17T19:56:13Z

So, it's taken me far too long to address this.

@dharasty - I think you make really excellent points, and I appreciate you providing your insight, here. I would like to make the change you suggest (i.e., distinguishing between schema and model) as part of the v0.0.2 stuff I'm hacking on, now.

I'm interested to know your thoughts on the best way to go about doing this. Is there any chance you would be up for putting together a PR that demonstrates an approach you thinks works well?

bhilburn · 2019-07-12T15:56:39Z

Some minor changes that clearly distinguish between the schema and file encoding will be made in the v0.0.2 release per the discussion above.

jacobagilbert · 2021-05-27T02:31:57Z

I feel like this is an important conversation, but it should probably be pushed to v1.1+ so as not to delay the timely release of v1.0.0.

gmabey · 2021-06-14T21:08:35Z

@bhilburn do you agree with @jacobagilbert 's comment? I do.

gmabey · 2021-07-09T23:38:49Z

@bhilburn ping

bhilburn · 2021-07-28T18:45:11Z

I actually think the fundamental change we are talking about here is super simple and pretty light-touch. I'll get a PR together that does it once we've got the major churn done (merging #135 and #140)

gmabey · 2021-08-04T21:57:59Z

@bhilburn It is pretty exciting to this that this is the only issue still languishing in the "Not Started" bucket for the 1.0 release ... I wait with baited breath for progress :-D

777arc · 2024-12-26T05:03:11Z

When it comes to storing a lot of SigMF metadata within a database in a standard way, for the sake of software reuse within applications built on top of SigMF, what if we did something along the lines of what the STAC standard (a JSON-based standard for storing geospatial metadata that goes along with imagery similar in many ways to SigMF) did for interacting with STAC metadata via database, which was creating a postgres schema for their standard. A postgres schema is a set of tables, views (premade queries), data types, and functions. So for example it would include the SQL to set up a table for SigMF objects, which would include mapping JSON datatypes to postgres datatypes just like @dharasty was talking about. It could also include functions for importing and exporting SigMF metadata JSON to/from the database. There could be views for common queries like seeing the breakdown of annotation labels. STAC's version is called pgstac and it's used within tons of other open source STAC-oriented libraries and applications, including ones that have scaled to 100M's of rows, we can learn a lot based on what they did.

What I'm proposing here wouldn't require any changes to the core SigMF specs, it would essentially be a separate SigMF subproject, e.g. pgsigmf, taking the form of a postgres schema (tables, views, functions) and mostly written in SQL. It would probably only be of interest to folks who want to store a ton of metadata and have performant querying and maintenance utils already worked out. As much as I like mongo, the flexibility it brings isn't great when you're talking about creating a standard and specific way to do something (store/query/import/export SigMF metadata). As far as SigMF allowing additional/arbitrary fields, postgres has an arbitrary JSON datatype that can be used for that.

Regarding the data side, raw IQ should never be stored in a database imo, there are plenty of ways to serve binary files in a way where you can grab partial chunks, eg HTTP range requests, FTP read flag, or S3/Azure blob API. Security can be dealt with using tokens. But related to the postgres idea- there could be a function built into the schema that returns a string with the HTTP range request corresponding to grabbing a specific sample index and count (or annotation) from a specific recording, it would do the math based on the datatype and fill in the url of the data file assuming you have the data stored on http/ftp/cloud. Our traceability extension can be used as the standard way to include the file location of the data and meta file within the JSON and thus database.

bhilburn added enhancement suggestion labels Sep 20, 2017

bhilburn self-assigned this Sep 20, 2017

bhilburn changed the title ~~Compatibility with Databases~~ Schema vs Model Distinction Jul 17, 2018

bhilburn added this to the Release v0.0.2 milestone Jul 12, 2019

bhilburn modified the milestones: Future (v2.x), Release v1.0.0 May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema vs Model Distinction #70

Schema vs Model Distinction #70

bhilburn commented Sep 20, 2017

kpreid commented Sep 20, 2017 •

edited

Loading

djanderson commented Sep 20, 2017

bhilburn commented Sep 20, 2017 •

edited

Loading

mbr0wn commented Oct 3, 2017

namccart commented Oct 4, 2017 via email

bhilburn commented Oct 9, 2017

bhilburn commented Oct 26, 2017

dharasty commented Nov 13, 2017 •

edited

Loading

kpreid commented Nov 14, 2017

dharasty commented Nov 14, 2017 •

edited

Loading

bhilburn commented Jul 17, 2018

bhilburn commented Jul 12, 2019

jacobagilbert commented May 27, 2021 •

edited

Loading

gmabey commented Jun 14, 2021

gmabey commented Jul 9, 2021

bhilburn commented Jul 28, 2021

gmabey commented Aug 4, 2021

777arc commented Dec 26, 2024 •

edited

Loading

Schema vs Model Distinction #70

Schema vs Model Distinction #70

Comments

bhilburn commented Sep 20, 2017

kpreid commented Sep 20, 2017 • edited Loading

djanderson commented Sep 20, 2017

bhilburn commented Sep 20, 2017 • edited Loading

mbr0wn commented Oct 3, 2017

namccart commented Oct 4, 2017 via email

bhilburn commented Oct 9, 2017

bhilburn commented Oct 26, 2017

dharasty commented Nov 13, 2017 • edited Loading

kpreid commented Nov 14, 2017

dharasty commented Nov 14, 2017 • edited Loading

bhilburn commented Jul 17, 2018

bhilburn commented Jul 12, 2019

jacobagilbert commented May 27, 2021 • edited Loading

gmabey commented Jun 14, 2021

gmabey commented Jul 9, 2021

bhilburn commented Jul 28, 2021

gmabey commented Aug 4, 2021

777arc commented Dec 26, 2024 • edited Loading

kpreid commented Sep 20, 2017 •

edited

Loading

bhilburn commented Sep 20, 2017 •

edited

Loading

dharasty commented Nov 13, 2017 •

edited

Loading

dharasty commented Nov 14, 2017 •

edited

Loading

jacobagilbert commented May 27, 2021 •

edited

Loading

777arc commented Dec 26, 2024 •

edited

Loading