Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV Format Change: prefLabel Field Now Contains List Instead of String Value #218

Closed
alexskr opened this issue Oct 9, 2024 · 9 comments · Fixed by #227
Closed

CSV Format Change: prefLabel Field Now Contains List Instead of String Value #218

alexskr opened this issue Oct 9, 2024 · 9 comments · Fixed by #227
Assignees
Labels

Comments

@alexskr
Copy link
Member

alexskr commented Oct 9, 2024

The format of the CSV file generated by the application has changed unexpectedly. Previously, the Preferred Label column in the CSV file contained a single string value for each row. However, the current output shows that the perfLabel field now contains a list of values enclosed in square brackets e.g., ["value1"] or multiple values for multilingual ontologies: ["value1", "value2", "value3"]

@syphax-bouazzouni
Copy link

it happens because we put in the CSV the same data as the ones that we index and put in SOLR; and as we index the prefLabel as an array, it is also an array in the CSV.
the code that write to the CSV is here https://github.com/ncbo/ontologies_linked_data/blob/master/lib/ontologies_linked_data/services/submission_process/operations/submission_indexer.rb#L148

@alexskr
Copy link
Member Author

alexskr commented Oct 9, 2024

Are we missing a unit test in the CSV writer that would have caught this new behavior?
What would be the appropriate solution, considering that a list of perfLabels without language tags in a random order is not ideal?
Perhaps a new column for each language should be added, instead of a single column containing all languages?

@jonquet
Copy link

jonquet commented Oct 11, 2024

Discussed today in meeting. Indeed with the multilingual implementation, the CSV being produced aside the Solr indexing, the CSV produced is not "backward compatible"

AgroPortal (INRAETHES)
Capture d’écran 2024-10-11 à 16 06 01

BioPortal (AGRO)
Capture d’écran 2024-10-11 à 16 09 09

@jonquet
Copy link

jonquet commented Oct 11, 2024

I would suggest an non-backward compatible improvement of the CSV format generated.... :

  • accepting the multiple values but with language tags "@en"
  • doing the same for all the cells e.g. Synonyms should also be represented as an array

How many users are concerned?
If this is only a matter of a full hands maybe a general improvement is better than a "backward compatible" fix that will hide the multiligual values of the CSV

Any opinions?

@jvendetti jvendetti changed the title CSV Format Change: perfLabel Field Now Contains List Instead of String Value CSV Format Change: prefLabel Field Now Contains List Instead of String Value Oct 11, 2024
@jvendetti
Copy link
Member

Summary of CSV generation discussion
During a BioPortal meeting yesterday, we discussed addressing the issue with CSV generation. The agreed short-term fix is to restore the original format, specifically reverting the “Preferred Label” column to contain a single string value, rather than an array. We’ve received multiple support inquiries about this on the BioPortal list, so it’s important to revert this change before proceeding with any further work.

Future enhancements
There was general consensus that the CSV output should eventually support the new multilingual capabilities. As you mentioned, Clement, columns like “Preferred Label,” “Definitions,” and “Synonyms” should accommodate multiple values, each tagged with the appropriate language. I suggest creating a separate GitHub issue for this enhancement, with examples clearly showing the desired content format.

Long-term refactor
Additionally, we should refactor the code so that CSV generation is decoupled from the indexing process. The current design causes CSV generation to fail if indexing fails, which complicates ontology processing. This can also be tracked in a separate enhancement issue for future work.

@mdorf
Copy link
Member

mdorf commented Nov 15, 2024

Screenshot 2024-11-15 at 08 49 08 The `obsolete` field is also broken on the value of TRUE. It appears as [true]

@piehld
Copy link

piehld commented Nov 26, 2024

Thank you @mdorf and @alexskr for the help on this issue!

I just took a look at the ATC page (https://bioportal.bioontology.org/ontologies/ATC/?p=summary), but it looks like the older mal-formatted file is still the latest (2024A.A). I'm wondering if this file will be automatically updated according to some update cycle, to reflect the fixes you've applied?

@alexskr
Copy link
Member Author

alexskr commented Nov 26, 2024

@piehld, We haven't had the chance to reprocess all ontologies yet, but we plan to do so in the next couple of weeks. In the meantime, I've reprocessed ACT for you if this is urgent.

@piehld
Copy link

piehld commented Nov 26, 2024

Thank you for the information @alexskr. No rush at all, we can wait for the full update.

Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants