Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impact on the model change #24

Open
lmichel opened this issue Mar 24, 2021 · 33 comments
Open

Impact on the model change #24

lmichel opened this issue Mar 24, 2021 · 33 comments
Labels
question Further information is requested

Comments

@lmichel
Copy link
Collaborator

lmichel commented Mar 24, 2021

This imporant issue comes in continuation of MANGO Annotation Scope.

It continues the discussion whose content is recalled here:

have to be mapped. The rest can (must) be ignored. The mapping
block represents a subset of the model. If the model changes keep
the backward compatibility, the 'old' annotations remain consistant
and the interoperability between dataset mapped with different DM
versions is preserved.

Yes -- that's a minor version. These aren't a (large) problem, and
indeed I'm claiming that our system needs to be built in a way that
clients don't even notice minor versions unless they really want to
(which, I think, so far is true for all proposals).

If you are saying that clients must be updated to take advantage of
new model features, you are right, whatever the annotation scheme
is, this is just because. new model class => new role => new processing.

No, that is not my point. My point is what happens in a major
version change. When DM includes Coord and Coord includes Meas and
you now need to change Meas incompatibly ("major version), going to
Meas2 with entangled DMs will require new Coord2 and a DM2 models,
even it nothing changes in them, simply to update the types of the
references -- which are breaking changes.

With the simple, stand-alone models, you just add a Meas2 annotation,
and Coord and DM remain as they are. In an ideal world, once all
clients are updated, we phase out the legacy Meas annotation. The
reality is of course going to be uglier, but still feasible, in
contrast to having to re-do all DM standards when we need to re-do
Meas).

@lmichel lmichel added the question Further information is requested label Mar 24, 2021
@lmichel
Copy link
Collaborator Author

lmichel commented Mar 24, 2021

I very concerned by the question of the model changes. But I cannot figure out how a class describing a measure could be upgraded in a way that breaks the backward compatiblity.

Do you have some example?

@msdemlei
Copy link
Contributor

msdemlei commented Mar 25, 2021 via email

@lmichel
Copy link
Collaborator Author

lmichel commented Mar 26, 2021

I cannot imagine a int -> real cast breaking anything. All clients are able do deal with this.
I remember an old discusssion about data typing in models and sometime I regret the absence of a number type in VODML.

I tried to figure out different situations where model changes would be be letal.

  1. Inapropriate downcasting (e.g. string -> numeric)
  2. Splitting classes (stupid e.g. use one object from RA and another for DEC instead on one for [RA,DEC])
  3. Merging classes (Coords attributes moved to Measure class)
  4. Removing things (no more link between CoordSys and CoordFrame)
  5. Renaming things (long -> longitude)

In the real life none is a real threat. The analogy with caprole must be carefully handle because our models are talking about physical quantities observed in the sky and caprole are talking about computer protocols that are completely abstract and flexible things.

Considering the worst, all of these cases would break the interoperability which represents a seriouser issue than encompassing model versionning:

  • Data providers have to revise their annotation procedures
  • Client code have to deal with the new version and even worse, manage the cohabitation of both versions.

I would even say that having encompassing model is safer, since the upgrade process garanties that all components still can work together both structuraly and semanticaly.

@msdemlei
Copy link
Contributor

msdemlei commented Mar 26, 2021 via email

@Bonnarel
Copy link
Contributor

My 2 cents on this. What kind of situation can we imagine for a model change ? Where does it has impact when we are considering data transport using VOTable ?

Why is one (or several) of the models changing ? I can imagine two reasons:
- The model doesn't allow an optimal interpretation of the existing transported data. A change in one (several) of the models will allow data providers to annotate better their data and clients to do better handling. The existing data tablkes don't change. Only the "separeted-from-data" annotation changes. I think it's probably possible to have two different annotation structures for clients who didn't make the evolution to remain working as before.
- Data providers make a major change in their data release which happens to be difficult to work out with current datamodel versions and is anyway an intrinsic issue for clients. In that cas a change in the models and in the annotation will be an help for evolution of the client software.
- The key thing is the independance of the annotation and of the data.

@lmichel
Copy link
Collaborator Author

lmichel commented Mar 31, 2021

going from STC1 to STC2

Good example, I guess you agree on that the major concern about moving from STC1 to MCT is not the update of the host models (that do not exist any way). This is my point.
Let's imagine that I've e.g. a model named lm_dm_1 that embeds stc1.13, Char_1 and dataset_1.5
Now stc1.13 has been moved to stc2:
In consequence I've to update my model to lm_dm_2 (stc2, Char_1 , dataset_1.5).
My points are

  1. upgrading lm_dm is not big deal
  2. using lm_dm_2 might be safer than using individual (stc2, Char_1 , dataset_1.5) instances since this guaranties that all three models are compliant each to other (e,g. vocabulary mismatches).
  3. last but mot least: lm_dm (1 or 2) keeps giving a complete description of the modeled objects (e.g. Cubes) out of the scope of any particular data container. I know you deny this requirement, but I insist to claim that this the interoperability key.

@msdemlei
Copy link
Contributor

msdemlei commented Apr 1, 2021 via email

@msdemlei
Copy link
Contributor

msdemlei commented Apr 1, 2021 via email

@mcdittmar
Copy link
Collaborator

On this topic, I'm finding myself agreeing more with Markus. (ack!)
To a certain level anyway, as I agree with Laurent's assertion 3 in this comment

Models are going to change, I think that's pretty much a given.

  • as we fold in new products (Mango, TimeSeries, SED)
  • broaden usage domains. eg: Radio or Planetary needs different Target type than we have spec'd in Dataset
  • or just forgot something: "Spectrum doesn't support Echelle Spectra".

VODML compliant models import specific versions of their dependencies

  • to make new features available under the parent models will require an update to the model (VODML/XML).

This is true, whether a major version or minor version change.

What is the impact?

  • At the rate we are cranking out models in the VO, I don't think this is going to be a burden
  • At the rate we SHOULD be cranking out models in the VO, this could be annoying. This annoyance may be mostly on our side, updating the VODML/XML for the model hierarchy just to change the dependency.
  • Operationally a model namespace is constant for minor version changes, so users should only be impacted when a major version change is made.
    • Impact on clients/providers should be marginal. IF they want to produce/use the newer version content, they make the appropriate changes (if any, other than recognizing the new namespace) and are done. This is needed even to handle new content in minor version changes to the models.
    • Most changes would be contained to the parser/writer level, not to the classes/objects the implementations convert them to. eg: the parser package I'm using (rama) will produce an AstroPy SkyCoord for any version of meas:Point. No change to the application using it is required.
    • If they are integrating a major version change, then they should expect to be updating their implementation.

I don't think decoupling the models makes this go away:

  • If a client reads Cube/Mango instances, and there is a change to Meas model, and they want to use/produce the new Meas content, is it any easier to test/verify
    1. Mango-vA with (Meas-vA or Meas-vB) content instead of
    2. (Mango-v1 with Meas-v1) or (Mango-vB with Meas-vB)

Where it DOES have a big impact is on the annotation. This is probably a good case to mock-up (since we don't have multiple versions of any models) and annotate.
With the current model/annotation relations, I believe the client would need to:

  • Annotate 2 versions of the parent model objects (Source/SparseCube) to accommodate the 2 flavors of Measurement leaves.
    • looking at Standard_Properties and TimeSeries examples (Mapping syntax and ModelInstanceInVot syntax)
      • ~100:150 lines of annotation for Source instance with 5 Properties; includes 13:25 per Measure
      • ~85:140 lines of annotation for SparseCube with tiny Dataset + 3 Observables; includes 7:10 per Measure (no errors)

If decoupled

  • Can annotate 1 version of parent model, and EITHER version of Measure at the leaves.
    • adding no additional annotation lines
  • Or in some way allow annotation of BOTH versions of Measure at leaves.
    • adding 7:25 additional lines of annotation per Measure that has multiple versions.

If we are considering the case I'm currently working, Master(Source) with associated Detections(Source) and associated LightCurve(SparseCube), this could add up to serious real estate.

Up to now, I've considered all this "The cost of doing business.", and am comfortable with that position.
But, after seeing the ModelInstance in Mango, maybe this needs more serious consideration. I had an idea this morning, inspired by this, which may be a good compromise. It could allow looser coupling of models, but still have definitive links between them for verification/interoperability. (ie: no ivoa:anyType stuff). Once I've thought that through a bit, I'll post it in a new thread.

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 7, 2021

[@msdemlei] And since you're mentioning vocablaries, given we're in RFC for that
I'd be particularly interested in your concerns about their
interaction with DMs and their mutual compatibility.

Message understood....
... but for the immediate time, the discussion is rather focused on DM concept itself.

Perhaps this is a point we'll have to discuss interactively..

Sure .I'm unable to connect the existence of Cube with the disaster that you announce

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 7, 2021

I'm in line with the @mcdittmar's summary.

I've would just remind that we are talking about modeling physical entities.
The comparison with what's happened with protocols (e.g. caprole) must considered with a lot of care.

My expectation is that the introduction of new players (e.g. radio) won't break existing stuff but introduce new patterns.

  • e.g. Radio dish FoVs seem a bit more complex that simple cones.
  • Description of complex object shapes (clouds)
  • Description of multi-object systems

I'm pretty sure that changes on model components in a way that breaks backward compatibility (no example to give) won't be endorsed by data providers or client developers either.

Let's imagine that it happens anyway,

  • Data provider will have to annotate with both versions until all clients supports them (will likely never occur)
  • Clients will have to support both versions as soon as a one data provider use the new version (will likely occur).

This bad situation would take place whether with the @msdemlei scheme, the @mcdittmar's one or mine.
As I said some posts ago, generating e.g. a new CUBE VODML/XML won't the major difficulty to sort this case out.

@mcdittmar
Copy link
Collaborator

I'm in line with the @mcdittmar's summary.
Yay! I always like hearing that!

My expectation is that the introduction of new players (e.g. radio) won't break existing stuff but introduce new patterns.

  • e.g. Radio dish FoVs seem a bit more complex that simple cones.

I think the most likely breaks will come from us having concrete objects defined in a model which we later find needs to be abstract in order to support branching by different domains. It is the main reason I have elements like the abstract Uncertainty type in the Measurements model.. to help guard against major version change. I grudgingly removed the abstract Point from Coords on the last iteration, and with this Mango work, we're finding an interest in restoring the space-centric LonLatPoint/SphericalPoint. This would be a major version update in Coords.

  • Data provider will have to annotate with both versions until all clients supports them (will likely never occur)
  • Clients will have to support both versions as soon as a one data provider use the new version (will likely occur).
    I was thinking about this, and it seems more likely (no evidence) that the clients would prefer to output as V1 OR V2 at the user's request, rather than annotating to both in the same output.

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 7, 2021

I agree that the condition for the risk (as pointed by @msdemlei) of breaking models with new features to be very low is that models have abstract classes. i.e. things that can be extended without altering existing stuff.

MANGO showed up (too much apparently) the ability of extending MCT without breaking anything.

@msdemlei
Copy link
Contributor

msdemlei commented Apr 8, 2021 via email

@Bonnarel
Copy link
Contributor

Bonnarel commented Apr 8, 2021

On Wed, Mar 31, 2021 at 02:05:18AM -0700, Bonnarel wrote: My 2 cents on this. What kind of situation can we imagine for a model change ? Where does it has impact when we are considering data transport using VOTable ? Why is one (or several) of the models changing ? I can imagine two reasons:
Well, the most common reason is: we simply did it wrong. As someone who did it wrong several times already, I think I'm entitled do say that. To mention my worst goof: Using ParamHTTP interfaces in TAPRegExt, which blew up badly more than half a decade later ("caproles"). It happened because I hadn't understood the full problem, didn't see the long-term consequences, and generally shunned the work of defining an extra type for TAP-like interfaces. You could say other people are less lazy, think deeper, and see farther, and they probably do. But who knows, perhaps one day I'll make a DM, and then it'd be reassuring to know that if I get it as wrong as the interfaces in TAPRegExt, the VO can shed by mistake without taking the whole DM system with it.

Well I think "do it wrong" is close to "does not allow an optimal interpretation". Of course this can always happen with everybody. This doesn't imply we have to let the client manage alone with the relationships between our break and pieces

  • The key thing is the independance of the annotation and of the data.
    Could you explain this a bit more? Do you mean lexical independence (e.g., annotation sits in an element of its own rather than, say, in FIELD's utype attributes)? Or semantic independence (in which case you'd have to explain how that should work)? Or yet something else? To me, I'd say the annotation depends strongly, and ideally "injectively" on the data's structure (i.e., different structures will have different annotations) and not at all on the data (which rules out follies like having pieces of photometry metadata in column values). Conversely, data and data structure do not depend at all on the annotation (which is less obvious than it may sound, but it in partiuclar means that you can attach as many different annotations to data structures as you like).

I clearly meant lexical independence. We can clearly imagine two strategies : either youy really map your data structure onto your model (and this requires a new schema each time you change the model - the things we did with xml schema distinct for each DM 15 years ago) or you add an (evoluating) mapping layer on top of more stable (VO)Tables.

@Bonnarel
Copy link
Contributor

Bonnarel commented Apr 8, 2021

On Wed, Apr 07, 2021 at 05:59:07AM -0700, Mark Cresitello-Dittmar wrote: I was thinking about this, and it seems more likely (no evidence) that the clients would prefer to output as V1 OR V2 at the user's request, rather than annotating to both in the same output.
whereas with the big God model it's virtually certain the new client will not recognise anything ("what's this ivoa-timeseries1:Root thing again?"). .........long after they were written, whereas they'll be entirely broken on the first major change with the God model. In the VO, you just can't "roll out version 2" -- you'll always have a wild mixture of modern and legacy services and modern and legacy clients, even 20 years from now. That's why it's so useful to limit the damage radius of breaking changes.

But we are not dealing with God models when speaking of TimeSeries or sparse Cubes or Source model with Parameters We have real situations which are meaningful cross projects and cross wavelengths (or even cross messengers) and want to interoperate them. Rapidly the way the things organize become complex. Very often we find tables with several different positions, times, magnitudes. Is there an independant time only and the other depend of ot like fluxes or whatever ? several (see ZTF and Beta Lyrae in Vizier examples ) ? are all the parameters independant (event list)? all but one (eg flux in a regularly sampled cube) ? Can we use the relationships between these parameters or axes to transform data from one data type to another one ? Providers may want to help users and clients do such things to compare or combine data. I imagine that with separate Cube-with-one-independant-axis-only and Cordinates annotation it will rapidly be a mess for the client to find its way.

@msdemlei
Copy link
Contributor

msdemlei commented Apr 9, 2021 via email

@Bonnarel
Copy link
Contributor

Bonnarel commented Apr 21, 2021

On Thu, Apr 08, 2021 at 09:41:16AM -0700, Bonnarel wrote: But we are not dealing with God models when speaking of TimeSeries or sparse Cubes or Source model with Parameters We have real
Well, as far as I can work out the idea is that there is one root node and everything else is then relative to it; it is this "there's one big class describing the whole of a document" is what I call God model. My skepticism to them is not only aesthetic: Having them means that if you don't understand this root node, you can't use any annotation, and that a client that knows how to find, say, a value/error in a time series will have to be taught anew how to do it in an object catalogue (and trouble with versioning, and much more; there's plenty of good reasons why the God object is considered an antipattern). Not to mention, of course, that few programmers will appreciate that you're trying to impose your data structures on them.
Well, there is an old consensus in IVOA that we are dealing with "datasets" or "dataproducts" and that dataproduct_type makes sense. A top level model is just some common description of the formal internal relationships between various parts of these data products consistent with the definition of the dataproduct type. DataProvider should succeed in agreeing on what is required and what is optional there. The constraint on application programmers will not come artificially from datamodelers but from DataProviders interoperability requirements

situations which are meaningful cross projects and cross wavelengths (or even cross messengers) and want to interoperate them. Rapidly the way the things organize become complex. Very often we find tables with several different positions, times, magnitudes. Is there an independent time only and the other depend of it like fluxes or whatever ? several (see ZTF and Beta Lyrae in Vizier examples ) ? are all the parameters independent (event list)? all but one (eg flux in a regularly sampled cube) ? Can we use the relationships between these parameters or axes to transform data from one data type to another one ? Providers may want to help users and clients do such things to compare or combine data. I imagine that with separate Cube-with-one-independant-axis-only and Cordinates annotation it will rapidly be a mess for the client to find its way.
I like the concrete examples and questions, because with them you can test whether stuff works. And I contend all of the questions are rather straightforwardly answerable by the simple scheme I'm proposing over in https://github.com/msdemlei/astropy. If you disagree, what sort of workflow do you think won't be covered by it?

Well as far as I understood this works because the raw data are rather simple. But what would happen with a catalog like this :
Shenavrin et al, Astronomicheskii Zhurnal, 2011, Vol. 88, No. 1, pp. 34–85. available in Vizier.

Here obviously there is one single independent time and the others parameters, including the other times depend of it. In addition there are several instances of TimeSeries in the same catalog (because there are several sources). Why shoud we discover all the times and then discover which one is the independent in another annotation ?

In the following catalog http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=J/ApJ/790/L21&-to=3 . All parameters have the same importance. It's an event list. Why should we not know that from the top ?

@msdemlei
Copy link
Contributor

msdemlei commented Apr 22, 2021 via email

@Bonnarel
Copy link
Contributor

Bonnarel commented Apr 22, 2021 via email

@msdemlei
Copy link
Contributor

msdemlei commented Apr 23, 2021 via email

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 23, 2021

Just some thoughs on the impact on the model changes.
The @msdemlei client README, makes the assumption that the VOTable is delivered with distinct annotations for 2 different versions of the same model (Coords). I do not thing that this case is the most likely because the annotation process is a very though job (ask @gilleslandais) and I doubt that data curators will duplicate their efforts to support multiple variants of the same model.

The most likely situation is to have a client trying to put together (e.g. xmatch ) data sets annotated with different versions on the same model.

  • How to cross match data annotated with CoordsV3 against data annotated with CoordsV4.?
  • How to cross match data annotated with CubeV3 against data annotated with CubeV4.?

This a critical point that cannot be worked around just by using un-entangled models.

@Bonnarel
Copy link
Contributor

Bonnarel commented Apr 26, 2021 via email

@msdemlei
Copy link
Contributor

msdemlei commented Apr 26, 2021 via email

@Bonnarel
Copy link
Contributor

Bonnarel commented Apr 26, 2021 via email

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 26, 2021

Oh, but the operations you need to re-structure the tables are
relational algebra, and once you start re-structuring tables in your
annotation, you will re-discover all of Codd (1970):

We just have re-discovered use-cases:

  • Data mixed in one table: let's FILTER or GROUP them
  • Data spread over multiple tables: Let's JOIN them.

These 3 statements correspond each to a specific use case (TS Gaia & ZTF and combined data).
I assume there were good reasons to design these dataset as they are and it looks fair to propose a solution making them interoperable. This solution is rather light by the way (a couple of XML elements not breaking the mapping structure).

@msdemlei
Copy link
Contributor

msdemlei commented Apr 27, 2021 via email

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 28, 2021

I understand you point of view and I do not underestimate your arguments in favor of using individual model elements.
You ask the right question: what is the advantage of using integrated model (sorry but entangled is a bit pejorative)?

Let me recap my points, already exposed sometime ago:

  • A model integrating components from other models will assure that all of those components are consistent each to others (vocabulary, roles at least) and that there is no risk of confusion (RFC validation).
  • Integrated models can describe complex datasets. I would be very sorry to have to say to some data provider: sorry but your dataset is too complex, I cannot annotate it. examples:
    • detections attached to sources
    • Same properties but with different frames
    • Multi-objects data tables
      3- Instances of integrated data models can be shared among different peers, e.g sending by SAMP Mango serializations of individual catalogue rows.
      4- Integrating components from other models does not mean dissolve them in the host model. They remain usable as such since they keep both integrity and dmtypes even within an integrated model. You client strategy can be applied on integrated data models as well (matter of XQuery). I've somewhere in my code a search_element_by_type(dmtype) method able to retrieve your coords2 class in any arbitrary nested model.

I'm not sure what you mean by cross matching cubev3 and cubev4; what
functionality do you have in mind for that?

Cross processing (doing anything with both datasets together) would be more appropriated

@msdemlei
Copy link
Contributor

msdemlei commented Apr 30, 2021 via email

@lmichel
Copy link
Collaborator Author

lmichel commented May 3, 2021

This is probably a good pl

sure

interdependencies between 10 models

Little exaggeration?

Our positions are not converging at all, there is no need to run a new discussion loop.

I would just like to repeat what I wrote 2 weeks ago.
Technically, the actual annotation scheme works for any model granularity, hence your proposal is actually to ask the VO not to RECommand integrated models. This is what I do not agree with.

@lmichel
Copy link
Collaborator Author

lmichel commented May 5, 2021

Hence, for the workshop participants' benefit, let me briefly recap a
few of the advantages of having small-ish, isolated DMs:

My answers

(1) Lesson from STC-1...

Not really applicable here.

  • As DM chair I've always said that both Meas and Coords must be adopted together.
  • I've nothing against using them as independant components as long as the annotation remains faith to the model structure.
  • I'm against the integrated models diabolization

(2) Lesson from TAP validation:

The comparison with TAP is unfair. TAP is a complete database infrastructure encompassing all VO fields: Nothing to do with a simple measure container as MANGO is.

(3) Separate evolvability

That's discussed in some depth elsewhere

(4) Flexibility and Consistency:

I agree with you, our 2 approaches do work.... for the simplest cases.
Have you tried to figure out (on paper at least) whether your code design could be applied to a MANGO mapping?

(5) Small, isolated models

As the impact of model changes has already been discussed many time here, I prefer to have a little fun with your Lego metaphor: When I was young, I spent a lot time playing with Lego bricks. At that time Lego was mostly sold as brick boxes, but year after year the company has marketed more and more complex (entangled) objects (Star War ships, robots..) with a growing success. Just to warn you against this sort of comparison :-)

@msdemlei
Copy link
Contributor

msdemlei commented May 6, 2021 via email

@lmichel
Copy link
Collaborator Author

lmichel commented May 7, 2021

I believe most of MANGO isn't so far from

This is true as long as you do not have associated data.
This is also why I'm repeating that our difference on this topic is a matter of Xpath
For the workshop I'll insist on the complex data especially. I'm even more motivated for this since Yesterday when we had a long meeting with exoplanet people who are asking for mapping highly connected data and even for JSON serializations.

Well, you got me there on the marketing ..

:thumb up:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants