Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using sqlite3 backend for itsdb #204

Closed
goodmami opened this issue Feb 25, 2019 · 12 comments
Closed

Investigate using sqlite3 backend for itsdb #204

goodmami opened this issue Feb 25, 2019 · 12 comments

Comments

@goodmami
Copy link
Member

In the interest of maintainability and, secondly, speed, I'm curious if we could use Python's sqlite3 module as the backend for itsdb.TestSuite. Doing so may remove the need to implement selects, joins, etc. on my own custom classes, and the task then turns into transforming the [incr tsdb()] database format into SQL commands (for reading and writing). I wonder what is the minimal amount of changes to the Python API that would be needed to accommodate this, as I don't want to break things for users who may have recently switched to the newer TestSuite API.

Some questions are how to store the sqlite3 database. All in :memory: could get too big for giant profiles, unless I could flush rows to the [incr tsdb()] files mid-way. SQLite has a notion of temporary databases (connect to "" instead of ":memory:"), but there's no indication that Python's sqlite3 supports that (but maybe it doesn't prevent it? ...still, I'd like it documented). Or I could make a temporary file to store the results?

@goodmami
Copy link
Member Author

Another question is how to define the keys in SQLite terms. [incr tsdb()] just specifies columns as :key (and sometimes also :partial), but there's no explicit indication which is the "owner" of that key. Take for instance item and analysis:

item:
  i-id :integer :key
  i-origin :string
  ...

analysis:
  i-id :integer :key
  a-position :string
  ...

They both specify i-id but intuitively we know that item is the primary table defining those keys and analysis depends on those. That is, in analysis it should be a foreign key, but how do we know that? The fact that other analysis fields start with a- and not i-? Because item appears first in the relations file?

Also consider item-phenomenon:

item-phenomenon:
  ip-id :integer :key
  i-id :integer :key
  p-id :integer :key
  ip-author :string
  ip-date :date

It has 3 :key fields. They can't all be primary keys (probably i-id and p-id should be foreign keys). Also, it is not the first table in the relations file defining ip-id (parameter is), so the first-one-wins criterion fails here.

Another oddity (but not necessarily problematic) is that score-id on the score table is not marked as a key at all.

There doesn't seem to be a good way of determining which keys are primary and which are foreign in a general way. We could just mark them as regular keys, but then SQLite will not automatically maintain database integrity the way we'd want, and thus using sqlite3 as the backend loses some of its benefits.

@goodmami
Copy link
Member Author

goodmami commented Feb 27, 2019

If we were to propose an addition to the relations file format, we would need to be able to say which keys are foreign and which table (and column?) they reference. For example:

analysis:
  i-id :integer :key :foreign(item:i-id)
  ...

Since foreign keys reference keys of the same name in other tables, explicitly spelling out the table and name could be optional (e.g., just :foreign(item)), or simply forbidden if we require that shared keys have the same name.

If all non-primary keys are marked as foreign, then it might not be necessary to put something like :primary on, for instance, i-id in item. Partial keys (which I take to mean "composite" keys in sqlite) are still a problem though. For instance, item-phenomenon has 3 keys, but none are marked as :partial, so it's not the case that all 3 are required to uniquely select a row. In set, there is a s-id (primary?) key and a p-id partial foreign key, and in item-set there are two foreign keys of which one is marked :partial. I think I need more information to determine what to do here (maybe we should mark all keys in the composite as :partial, whether or not they are primary or foreign).

@fcbond
Copy link
Member

fcbond commented Feb 27, 2019 via email

@arademaker
Copy link
Member

arademaker commented Feb 27, 2019

If all non-primary keys are marked as foreign, then it might not be necessary to put something like :primary on, for instance, i-id in item.

I didn't understand this sentence above. In the item table we have i-id as :key. Anyway, I agree with @fcbond that making things explicit is always safer.

If I understood your point, you were expecting the item-phenomenon table to have something like:

item-phenomenon:
  ip-id :integer :key :partial
  i-id :integer :key :partial
  p-id :integer :key :partial
  ip-author :string
  ip-date :date

Right? Or maybe I didn't get your point about this table too:

For instance, item-phenomenon has 3 keys, but none are marked as :partial, so it's not the case that all 3 are required to uniquely select a row.

Another detail that could be made explicit is the abbreviation used for the tables names (i.e. item-phenomenon equals ip)

@arademaker
Copy link
Member

Maybe you all know that, but one more thing to take into consideration. In the logon tree I found 639 relations file. Only 16 differ. The majority of them are all equal to lingo/lkb/src/tsdb/skeletons/english/esd/relations. That is, the capability of adapting the profile schema is rarely used, maybe because the tools are not ready for that.

$ find . -name 'relations' | wc -l
639
$ find . -name 'relations' -exec diff -q lingo/lkb/src/tsdb/skeletons/english/esd/relations {} \; | wc -l
 16

@fcbond
Copy link
Member

fcbond commented Feb 27, 2019 via email

@arademaker
Copy link
Member

Do you have any pointer for the documentation about those changes?

@fcbond
Copy link
Member

fcbond commented Feb 27, 2019 via email

@goodmami
Copy link
Member Author

@arademaker the point was that if all instances of :key have :foreign except for the primary instance, then marking it :primary seems redundant, but yes maybe explicit is better than implicit.

@goodmami
Copy link
Member Author

goodmami commented Mar 1, 2019

If there isn't support upstream for changing relations files to accommodate modern relational DBs (or even if there is) we could also just not use [incr tsdb()] profiles for storing our results, and instead use our own schema for sqlite3, or something else entirely, like pandas dataframes. Most of the tables and columns in [incr tsdb()] are irrelevant for many use cases anyway, including treebanking. Doing so, however, would mean those "profiles" would not be compatible with other tools, like ACE.

@goodmami
Copy link
Member Author

Another variation to the proposal:

  1. define the additions to relations files (:foreign, etc.)
  2. define what constitutes a valid relations and how data must conform to it (e.g., every row must have a unique set of :key values)
  3. describe (e.g., on the wiki) the set of transformations to the standard relations schema to make it valid according to (1) and (2)

This would encode the way to fix up-convert the most common schema while alternative schemas would have to already conform to the more stringent specs. Only the necessary changes are described rather than an entirely new schema because I expect it might be robust to future additions to the standard schema that don't alter keys or add new tables.

Also (2) is in the list above because I found that I could not enforce the unique-keys constraint on result rows because the only key on a result row was parse-id (result-id is not marked as a key). Also, some tables don't have keys.

Given the above, all test suites (either up-converted or defined with the improved schema) would be mapable to traditional SQL databases, and we could then use one as the backend.

@goodmami
Copy link
Member Author

Closing this as there are no immediate plans to move to a sqlite3 backend. My thoughts regarding extensions to TSDB schema have been recorded at http://moin.delph-in.net/TsdbSchemaRfc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants