Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Irregular tables #107

Open
2 tasks
MatthewCaseres opened this issue Sep 27, 2021 · 6 comments · May be fixed by #115
Open
2 tasks

Irregular tables #107

MatthewCaseres opened this issue Sep 27, 2021 · 6 comments · May be fixed by #115

Comments

@MatthewCaseres
Copy link
Contributor

MatthewCaseres commented Sep 27, 2021

Errors on some tables

using MortalityTables
MortalityTables.table("2012 IDEC Select Termination Rates - Male, Occ Cl 1, Acc and Sick, 14 day EP")
MortalityTables.table("Guatemala Abridged Life Tables 1980-85 Males")
  • Distinguish between tables that will have a custom mortality-table specific API and those that should not. Tables that cannot be parsed into the existing API will inform the user that they are trying to load a table that is not a supported table.
  • Provide a general API for all tables. When parsing the XTbML the only assumption we are making is that it is a valid XTbML file.
@alecloudenback
Copy link
Member

Yea, I was worried about tables like this. Have you seen anything in the XML metadata that would suggest standard ways of handling irregular tables?

Do you have a sense of how many different structures there are? more on the order of 10 or 100? If the former, I think defining the structure in advance could work; if the latter then would have to address in a more dynamic way I think.

This was referenced Sep 28, 2021
@alecloudenback
Copy link
Member

The SOA website indicates maybe there's about 10 table types:
image

@MatthewCaseres
Copy link
Contributor Author

MatthewCaseres commented Sep 28, 2021

Some details

  • For Guatemala Abridged (xml)
    • There are two tables but they don't represent select and ultimate tables.
    • Both tables have the same AxisName in the table metadata and seem to represent the same thing.
      • The reason they are split is that the Increment is different, it goes by 5-year increments after age 1.
    • The table layout (of which there are about 10) doesn't seem to fully determine the structure of the tables.
      • 1941 CSO Basic is also aggregate but has a single table and a constant increment.
  • For 2012 IDEC Select Termination Rates - Male, Occ Cl M, Acc and Sick, 7 day EP (xml)
    • There are three tables, whose row labels represent week, then months, then years.

Unsure you would want to include either of these tables. For tables with increments other than 1 I don't know how that will work with the current API.

IDEC tables aren't really mortality tables so I don't know removing them would detract from the package and I think they are just trouble.

I think the next step is to parse the MetaData child nodes of each Table node and do some exploratory data analysis. I'll probably do this in Python just because I am familiar with the ecosystem. Hopefully have a reasonable .ipynb file that enumerates the types of potentially problematic tables by Monday.

Edit: I think maybe a pivot table that uses Google Sheets will be best. I think I can get it done tomorrow.

I assume problematic tables are just any table that doesn't fit the following pattern:

  • It has a single table and it is ultimate.
  • It has two tables, a select and an ultimate table.

@MatthewCaseres
Copy link
Contributor Author

MatthewCaseres commented Sep 28, 2021

I normalized the data, each row corresponds to an AxisDef tag. For Select and Ultimate tables there are two AxisDef tags, one for each dimension. You can find the CSV and the parsing in this repo -

https://github.com/actuarialopensource/NormalizeMetadata/blob/main/SOA_Tables_20210915.csv

I did a quick pivot table to find XTbML files with many Table tags, and tables with large Increment values in one of their AxisDef tags.

https://docs.google.com/spreadsheets/d/1dQx8nz7uOdbCd8MLPjlNISS35IguG25ztljOOwNBYVk/edit?usp=sharing

Remember that each row represents an AxisDef tag, so there can be many rows per unique TableIdentity tag.

Edit: I can do the analysis and submit a PR. Will likely end up removing a good number of tables.

@alecloudenback
Copy link
Member

Thank you for progressing this. Some thoughts/ideas:

  • I'm not sure what it would be, but if there were a better data format than XTBML (JSON, HDF5, other?) then I think we might be able to programmatically convert and propose to SOA
  • For tables with increments >1, could use the interpolation of rates given one of the Balducci(), Uniform(), or Constant() assumpitons. If the user asks for a rate that's not given in the table, just return an interpolated rate.
  • Any idea what the tcs, e.g. tc="2", represent in the xml?
    image
  • I looked a a few of the Continuance tables (e.g. 2627 and similar) and they followed the same pattern of weeks/months/years. The consistency suggests that maybe there's an underlying data structure that would work, and it's just a matter of getting the API right?
  • Maybe for the irregular tables there's a fallback API that mirrors the XML structure? e.g. the user has to know the indices/axis in advance but then they still have it available even if its a less convinient API.

How to handle irregular tables in the meantime:

  • My initial preference would be to not remove any tables from the artifact (ie set of tables contained in the package)
    • this is less work for us to predetermine compatibility
  • Document that some tables are known to not work and if user encounters an issue they can file an issue and extension/special cases can be processed with some indication of demand from users
  • If the data is good then I think it's a matter of looking a different situations and figuring out the right representation; if the data is bad then we could potentially figure out a way to make it better via replacement for XTBML

@MatthewCaseres
Copy link
Contributor Author

MatthewCaseres commented Sep 29, 2021

Maybe for the irregular tables there's a fallback API that mirrors the XML structure? e.g. the user has to know the indices/axis in advance but then they still have it available even if its a less convinient API.

I like this idea because it doesn't remove tables and it doesn't increase the complexity of the codebase too much by trying to handle edge cases.

if the data is bad then we could potentially figure out a way to make it better via replacement for XTBML

I think XTBML is okay, it is just XML. It definitely could be JSON, but the two can be used interchangeably here I believe. XMLDict.jl can turn XML into dictionary objects, which I see is a dependency of this package. I am leaning towards providing these dicts as the backup to the MortalityTable type when the table doesn't work well with the package.

The data itself being bad would be a separate issue. I think XML is not an issue. The people at ACORD are producing XML standards for the insurance industry, they have several XML standards that they are managing. XML is also used for Learning Management System interoperability (driven by IMS Global Learning Consortium), and EPUB (W3C), it is pretty common for data exchange standards.

Any idea what the tcs, e.g. tc="2", represent in the XML?

I ran a pivot table and tc=2 for ordinal dates. Might be worth letting them know about the typo ordinali date. I guess the data isn't perfect? A bit concerning. The ordinali date typo is given tc=0 when it should be tc=2 as an ordinal date. So it looks like tc is a function of the value in the ScaleType tag, with 0 being a sort of catch-all. Some tables have ScaleType=Unknown, which I haven't looked into too much but in concept is sort of strange.

The info is officially in the Type Code Definition Document provided by ACORD, so says the SOA under the second header. I registered on the ACORD website but their standards are only available to members, which seem to usually be organizations that pay a fee.

For tables with increments >1, could use the interpolation of rates

This will work but first maybe just do something simple and it can be a separate issue that is opened after this one is closed?

Right now the codebase is pretty straightforward and there is a lot of value in that, at least for me as a new contributor. I edited the issue to include a task list with two tasks that I think will allow us to close the issue.

I think the less one-off stuff is done the better. Instead of just giving users the dict that XMLdict gives from the XML on irregular tables, I can imagine giving them some sort of struct though. Maybe it will have a list of objects that represent tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants