Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on RDF to CSV #20

Closed
tobiasschweizer opened this issue Jul 20, 2023 · 2 comments
Closed

Feedback on RDF to CSV #20

tobiasschweizer opened this issue Jul 20, 2023 · 2 comments

Comments

@tobiasschweizer
Copy link

Hi there,

rdfpandas is an interesting approach. I only used the RDF to CSV functionality so far.
I tried it on a smaller dataset and it worked. When I tried with a bigger Turtle file (> 600 MB), it failed. I made a similar experience with rdflib when dealing with bigger data sources (it is primarily slow).

However, here are some questions / observations:

  • Is there a way to configure the prefixes for the properties (independent of those used in the source)?
  • Sometimes I got more than a hundred columns for the same property, ranging from index [0] to [n]. I think this is the same problem one has in a relational DB design wehen dealing with multiple atomic values for a field but without normalisation. Would it make sense to create more than one table for a given RDF source and use foreign keys to relate to other tables? Or would this overcomplicate things as one would have to deal with m:n tables?
  • I think keeping different types like literals and IRIs separate for the same property makes sense

I hope this feedback is useful.

@tobiasschweizer tobiasschweizer changed the title Feedback Feedback on RDF to CSV Jul 20, 2023
@cadmiumkitty
Copy link
Owner

Hi Tobias,

Thanks for the feedback, it is very much appreciated.

Re: questions:

  • I am assuming you want to change the prefix while keeping the prefix URI the same - it should be possible by getting access to NamespaceManager before passing the Graph to to_dataframe.
  • That's right, it is about dealing with multiple values for the same property. Rdfpandas is meant for fairly simple transforms, so adding tables and foreign keys might be a step too far (I use rdfpandas mostly for CVS to RDF as I find working with tabular data the most efficient way of building SKOS taxonomies and RDFS-based schemas). Would you please be able to share your use case and a sample dataset so I can think about it a bit more?
  • Thanks, that was a conscious decision; glad it makes sense with real-world data.

@tobiasschweizer
Copy link
Author

tobiasschweizer commented Jul 24, 2023

Hi Eugene,

Thanks for the feedback. My use case is actually nothing special. We transform some non-RDF data sources to RDF conforming to schema.org. Wherever we have a property that can occur more than once, there could be several columns for it.

Here is a TTL file with some test data:
expected_opendata_preprocessed.ttl.txt

And here is the CSV:
test.csv

Here is the title row, see for example schema1:keywords:
@id,schema1:author{URIRef},schema1:dateCreated{Literal}(xsd:date),schema1:dateModified{Literal}(xsd:date),schema1:datePublished{Literal}(xsd:date),schema1:description{Literal},schema1:distribution{URIRef}[0],schema1:distribution{URIRef}[1],schema1:distribution{URIRef}[2],schema1:distribution{URIRef}[3],schema1:distribution{URIRef}[4],schema1:distribution{URIRef}[5],schema1:distribution{URIRef}[6],schema1:identifier{Literal},schema1:inLanguage{Literal},schema1:keywords{Literal}[0],schema1:keywords{Literal}[1],schema1:keywords{Literal}[2],schema1:keywords{Literal}[3],schema1:keywords{Literal}[4],schema1:keywords{Literal}[5],schema1:keywords{Literal}[6],schema1:keywords{Literal}[7],schema1:keywords{Literal}[8],schema1:keywords{Literal}[9],schema1:keywords{Literal}[10],schema1:keywords{Literal}[11],schema1:name{Literal},rdf:type{URIRef},schema1:conditionsOfAccess{Literal},schema1:contentUrl{URIRef},schema1:encodingFormat{Literal},schema1:isAccessibleForFree{Literal}(xsd:boolean)

Note that schema1 comes from rdflib since the default prefix schema refers to https://schema.org. I am not particularly happy (RDFLib/rdflib#2312 (comment)) with this change but what can I do ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants