Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSIM Structure Group definition / explanatory text update #28

Closed
InKyungChoi opened this issue Oct 13, 2022 · 17 comments
Closed

GSIM Structure Group definition / explanatory text update #28

InKyungChoi opened this issue Oct 13, 2022 · 17 comments

Comments

@InKyungChoi
Copy link
Collaborator

Please see this google doc for the feedback from Metadata Glossary team. I would like to draw attention to:

  1. Data Resource: I propose to delete
    • "organized" (because there is nothing we can show it is "organized" (e.g., compare with the definition of Data Set "organized collection of data" which makes sense in this case because it has Data Structure))
    • "Data Resources are.." (because they don't have to be "used by stat activities" nor "for production of information". How about just add examples like "Example: collection of labour survey data from 2020 to 2021")
Object Definition Explanatory Text
Data Resource organized collection of stored information made of one or more Data Sets. Data Resources are collections of data that are used by a statistical activity to produce information. Data Resource is a specialization of an Information Resource.
  1. Dimensional Data Point: see this feedback from metadata glossary team

  2. Dimensional Data Set and UnitData Set: Need examples

  3. Information Resource: I propose to delete

    • "organized" (for the same reason as 1)
    • "statistical content" (does it have to be "statistical"?)
Object Definition Explanatory Text
Information Set organized collections of statistical contentinformation Statistical organizations collect, process, analyze and disseminate Information Sets, which contain data (Data Sets), referential metadata (Referential Metadata Sets), or potentially other types of statistical content, which could be included in additional types of Information Set.
  1. Logical Record: I am very confused with the definition...
Object Definition Explanatory Text
Logical Record Describes a type of Unit Data Record for one Unit Type within a Unit Data Set. Examples: household, person, or dwelling record.
  1. New definitions and explanatory texts are proposed for all referential metadata-related classes here: Missing GSIM class – quality indicators #6

  2. Unit Data Point : see this metadata glossary team feedback

@FrancineK
Copy link
Collaborator

Just a remark, concerning "organized", I would not delete altogether. This page : https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210000101&request_locale=en, is an information set which has a data set and a referential metadata set. It is organized.

@InKyungChoi
Copy link
Collaborator Author

(from Oct 26 meeting notes #30)

Dimensional Data Point and Unit Data Point (see this google doc)

  • Action: @FlavioRizzolo @dgillman4909 to review the explanatory text and update (for Dimensional Data Point, second sentence “there may be multiple values…” is not so needed here; also third sentence “the different values represent…” are also not necessary)

@InKyungChoi
Copy link
Collaborator Author

InKyungChoi commented Oct 27, 2022

Regarding examples of Dimensional Data Set and UnitData Set:

I suggest

  • Example of Unit Data Set**: a collection of Unit Records (1212123, 48, American, United Kingdom), (1212111, 38, Hungarian, United Kingdom), (1212317, 51, Canadian, Mexico) for three people where each record has the social security number, age, citizenship and the country of birth.
  • Example of Dimensional Data set: a collection of dimensional data (Mexico, 130.3), (United Kingdom, 331.9), (Italy, 59.1) where the firsts item specifies the name of the country and the second item specifies the population in millions.

** I followed the example of Unit Data Record which is "For example (1212123, 48, American, United Kingdom) specifies the age (48) in years, the current citizenship (American), and the country of birth (United Kingdom) for a person with social security number 1212123". But actually, Data Set is an aggregation of Data Points which are just placeholder (cell), not actually datum in them. Then should we add in the explanatory text that the examples are instantiated version of Data Set?

(@FrancineK, somehow I don't see the examples in the Specification document)

@FlavioRizzolo
Copy link
Collaborator

A couple of proposals here with slightly updated definitions and new explanatory texts (to be completed):

Object Definition Explanatory Text
Data Point Container for a single value of an Instance Variable A Data Point is a cell or a placeholder for a value (datum) it may contain (note that a data point could be empty).
Object Definition Explanatory Text
Dimensional Data Point Container for a single value of an Instance Variable partially identified by a set of dimensions A Dimensional Data Point is uniquely identified by the combination of exactly one value for each of the dimensions (represented as Identifier Components) and exactly one measure (represented as Measure Component) or descriptive attribute (represented as Attribute Component). A Dimensional Data Point could contain a value about a Unit or a Population. The Unit might be de-identified, in which case no link to the Unit itself can be directly established.
Object Definition Explanatory Text
Unit Data Point Container for a single value of an Instance Variable about a Unit A Unit Data Point is uniquely identified by the combination of exactly one value from each Identifier Component. The Unit might be de-identified, in which case no link to the Unit itself can be directly established.

Now, this note about de-identified information means that the Unit associated with the Data Point is not required, so we need to change the cardinality to 0..1.
The constraint associated with the Dimensional Data Point stating that the relationship to either Unit or Population must exist needs to be deleted since neither is required in the de-identified scenario. (BTW, that constraint can only be found in the full Structures Group diagram, we need move those types of constraints into the text, perhaps the explanatory notes.)

@FlavioRizzolo
Copy link
Collaborator

We need to revise the associations between Data Point and Instance Variable:

image

I propose to remove replace them by just one called "is described by", since the current ones are redundant at best and wrong at worst. There is no need to qualify the association with "identifier", "attribute" or "measure" since that's taken care of elsewhere: the Instance Variable is associated with some Represented Variable which in turn plays the role of being either a Identifier, Attribute or Measure Component in a particular Data Structure. By having a qualified association directly to the Instance Variable we are bypassing the Data Structure and therefore fixing a given Instance Variable into a specific role, i.e. identifier, attribute or measure, which defeats the purpose of having a separate Data Structure capturing that semantics and providing the flexibility of changing it when necessary.

@dgillman4909
Copy link

dgillman4909 commented Nov 15, 2022 via email

@FlavioRizzolo
Copy link
Collaborator

I think you are right, Dan, "populates" is better.

@dgillman4909
Copy link

dgillman4909 commented Nov 15, 2022 via email

@FlavioRizzolo
Copy link
Collaborator

I couldn't agree more, Dan. I've struggled with this since the notion was introduced almost 10 years ago, and couldn't come up with a good definition in all this time.

I think you hinted what the problem is: this is an artificial distinction. I don't see any reason to have a distinction between Unit and Dimensional Data Point. Why not just removing those subclasses entirely and keep only Data Point? Let's think about that option.

@InKyungChoi
Copy link
Collaborator Author

(from Nov 16 meeting notes #33)
Here is the updated version
image

@FlavioRizzolo - would you want to update Data Point explanatory text using explanatory texts from Unit Data Point and Dimensional Data Point?

If we indeed remove Unit Data Set and Dimensional Data Set:

  1. I would like to propose below definition and explanatory text of Data Set
  • definition: organised collection of data (no change)
  • explanatory text: Examples of Data Sets could be observation registers, time series, longitudinal data, survey data, rectangular data sets, event-history data, tables, data tables, cubes, registers, hypercubes, and matrixes. A broader term for Data Set could be data. A narrower term for Data Set could be data element, data record, cell, field. Data Set can be Unit Data Set or Dimensional Data Set (and perhaps use examples here)?
  1. Add a new attribute "Type" with controlled vocabulary (Unit Data Set, Dimensional Data Set)

  2. There are two attributes under Dimensional Data Set (Reporting Begin/End), I don't understand why these attributes are here.... if no one disagrees, we can drop these.

@FlavioRizzolo
Copy link
Collaborator

FlavioRizzolo commented Dec 6, 2022

I agree with 2, 3 and the definition in 1. In the explanatory text, I don't understand the second and third sentences around "broader" and "narrower", I think they are confusing and might be legacy from long time ago. For narrower we could say that data sets could be further partitioned/organized into data elements, data records, cells, fields, etc., but I'm not even sure that's necessary given that those artifacts are not in the model.

I propose the following:

Explanatory text: Data Sets could be used to organize a wide variety of content, including observation registers, time series, longitudinal data, survey data, rectangular data sets, event-history data, tables, data tables, registers, data cubes, data warehouses/marts and matrixes. An example of a population unit Data Set (microdata) could be a collection of three Data Records (1212123, 48, American, United Kingdom), (1212111, 38, Hungarian, United Kingdom), and (1212317, 51, Canadian, Mexico), each containing the social security number, age, citizenship and country of birth of an individual. An example of a population dimensional Data Set (aggregate) could be a collection of three entries (Mexico, 2021, 130.3), (United Kingdom, 2021, 67.33), and (Italy, 2022, 60.24), each containing the name of the country, year of interest and population of the country in millions.

@FlavioRizzolo
Copy link
Collaborator

FlavioRizzolo commented Dec 6, 2022

The way Logical Record is defined, and linked in the model, applies only to unit data, in which case either (i) Data Record is only about units and therefore we need another association directly from Data Set to Data Point for dimensional data, or (ii), Data Record is about both unit and dimensional data and therefore "isStructuredBy" Logical Record is optional (since it only applies to unit data).

@InKyungChoi
Copy link
Collaborator Author

(from Nov 16 meeting notes #34)
Here is the updated version

image

@FlavioRizzolo
Copy link
Collaborator

To be modelled in EA

@InKyungChoi
Copy link
Collaborator Author

InKyungChoi commented Mar 21, 2023

Final question before finalisation: I think we are still missing the definition and explanatory text of "Data Record" (previously "Unit Data Record") - how about this (adapted from the original Unit Data Record)?

Unit Data Point (in GSIM v1.2)

Object Group Definition Explanatory Text
Unit Data Record Structures Contains the specific values (as a collection of Unit Data Points) related to a given Unit as defined in a Logical Record. For example (1212123, 48, American, United Kingdom) specifies the age (48) in years on the 1st of January 2012 in years, the current citizenship (American), and the country of birth (United Kingdom) for a person with social security number 1212123.

Data Point (in GSIM v2.0)

Object Group Definition Explanatory Text
Unit Data Record Structures container for the specific values (as a collection of Unit Data Points) related to a given Unit or Population as defined in a Logical Record. For example (1212123, 48, American, United Kingdom) specifies the age (48) in years on the 1st of January 2012 in years, the current citizenship (American), and the country of birth (United Kingdom) for a person with social security number 1212123. For the case of unit data, it is structured by Logical Record.

image

@InKyungChoi InKyungChoi reopened this Mar 21, 2023
@dgillman4909
Copy link

Flavio -

Going back to DataPoint versus UnitDP and DimensionalDP, I agree having just the one class, DataPoint, is the right way to go. In a comment above, I said that NCubes only take DimensionalDataPoints, and I'd like to retract that. I am working with a longitudinal survey at US BLS, and there's an application for NCubes to account for variables that repeat within each wave and the set repeats from wave to wave. The "measure" in this case is the variable, and the dimensions are time (the waves) and the number of repetitions that variable can take. In the National Longitudinal Survey at BLS, multiple variables exist to account for each job a person can hold at once.. NLS allows for 9 jobs and the 1997 cohort is not up to 19 waves. This is 181 variables of each kind, a monstrosity to say the least. An NCube (defined as we do in DDI-CDI, which can take more than one measure for each cell) allows us to account for all of them.

@InKyungChoi
Copy link
Collaborator Author

Decision made regarding the Data Record (see #43):

Action: to use “collection of Data Points related to a given Unit or Population” for the definition, to use “….(as proposed in #28 (comment)).. For the case of unit data, it can be structured by Logical Record” for the explanatory text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants