Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Initial work on DBMSProcessor batch entry insertion into ENTRY table #5814

Merged
merged 31 commits into from
Feb 19, 2020

Conversation

abepolk
Copy link
Contributor

@abepolk abepolk commented Jan 2, 2020

I am working on making DBMSProcessor insert multiple entries with only a small constant number of queries. It currently does not compile. But the basic ideas are there.

@abepolk
Copy link
Contributor Author

abepolk commented Jan 3, 2020

@koppor @tobiasdiez I'm curious what you think of the underlying algorithm I have (tried to) create. The problem I encountered was that when each entry was inserted into the ENTRY table one at a time, the generated key was taken out of each INSERT statement and used for BibEntry.setSharedID(). When I tried to insert the entries together in a single INSERT statement, I realized that PreparedStatement.getGeneratedKeys() returns a ResultSet that unordered. Therefore it is not straightforward to assign them to a stream of the List<BibEntry> because the wrong key would be assigned to each entry.

My solution, still not completely implemented, is to get the generated keys from that batch INSERT statement. The generated keys represent entries in the SHARED_ID column. Then execute a SELECT query looking for the rows with those generated SHARED_IDs. These are the rows just inserted. Ignoring the version (which I may have to consider to do this right), queries with the same TYPE will only vary by SHARED_ID. Therefore, I create a map of each type to the set of IDs containing that type, and populate the map from the ResultSet. Then for each entry, take an arbitrary key that is associated with that entry's type in the database, and assign it as the entry's shared ID, and remove it from the keys available to be assigned. The reasoning is that it doesn't matter (again ignoring the version) which row the entry actually corresponds to because they are identical apart from the difference in shared ID. That finalizes the setting of shared IDs in the local entries' SharedBibEntryData. I check to make sure no IDs were left unassigned at the end.

Is this a valid way of setting the shared ID to local BibEntrys in a batch INSERT statement? Is there a simpler way?

@tobiasdiez
Copy link
Member

tobiasdiez commented Jan 4, 2020

Your algorithm looks good to me. Another option would be to use the DuplicationChecker, which should be able to identify the entries and thus let you assign the id (at least if the entries are sufficiently different).
EDIT: after thinking a bit more about it, I'm a little bit confused by "it doesn't matter (again ignoring the version) which row the entry actually corresponds to because they are identical apart from the difference in shared ID". What happens if I insert two different articles? If the BibEntries get the wrong SharedID, then subsequent updates will update the wrong entry, or not?

However, I'm wondering why getGenereratedKeys doesn't respect the order. According to https://stackoverflow.com/questions/50943216/does-jdbc-getgeneratedkeys-method-always-same-order-of-inserted-element it should work. Maybe it's worth looking into the OUTPUT statement, which you can use to return values after an insert.

@koppor
Copy link
Member

koppor commented Jan 4, 2020

We are correct in reading the generated Ids from the database - The reason for generating the unique Ids on the side of the database is because of multi-user-access. Even though one could generate UUIDs on client-side, this feels wrong (unnecessary size of key, ...).

I would not use the duplication checker. This also feels hacky. Maybe using the BibTeX keys - and if there isn't a key, the remaining entries could be compared. But still, this feels hacky, since there should be a reliable (and quick) way to read the generated IDs.

@NorwayMaple Did add tests for the ResultSet? I would also rely on https://stackoverflow.com/a/50944078/873282 that the driver returns the generated keys inorder. Nevertheless, test cases should exist.

@abepolk
Copy link
Contributor Author

abepolk commented Jan 4, 2020

No tests yet. Would I be better off writing the tests before all this code?

On batches, I wasn't using executeBatch(), I was using executeUpdate() with an insert statement containing many rows. In this case INSERT INTO ENTRY(TYPE) VALUES(?), (?), (?), (?) for four entries. But I could use batches of single row INSERT statements instead if that it at least just as efficient. Recommendations?

@tobiasdiez I think you are right because multiple users might then have different shared IDs for the same BibEntry. So if one person updates an entry, the wrong one will be synchronized. Also, they have to be connected via foreign key to the FIELD table, so my algorithm won't work.

@tobiasdiez
Copy link
Member

If you write a test now, it should fail. Then you can experiment with the different options until the test pass. I would suggest that the test generates a bunch of random entries (say 100) and then inserts them. Afterwards you can check that the shared id is the correct one for all entries.

@abepolk
Copy link
Contributor Author

abepolk commented Jan 18, 2020

Okay. So far I've been hard-coding the example entries for the tests. I could do, say, 5, or even 100, but are you thinking I might use some other means to generate the random entries? Also, are we satisfied making sure the tests pass, given it is technically possible for the JDBC implementations to change at any time and there is no guarantee that they will maintain orderedness?

@tobiasdiez
Copy link
Member

I guess a for loop from i = 1 to 100 which generates entries ala new BibEntry().withKey(i).withField(Journal, "journal" + i) should be sufficient. Don't have to be complex entries.

I would be satisfied if this tests passes. If, in the future, the implementation changes then we get notified by the test, which then fails.

@abepolk
Copy link
Contributor Author

abepolk commented Feb 4, 2020

I am very confused about a bug that appears to be in OracleProcessor in the method I just wrote, insertIntoEntryTable. When I run the database tests with Oracle, I always get java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended. If I just execute the SQL as an ordinary statement rather than a prepared statement and hard-code a parameter, the query executes. So it must be a problem with the prepared statement. For now these are lines 121-127. Ideas?

@@ -118,7 +118,8 @@ protected void insertIntoEntryTable(List<BibEntry> entries) {
}
insertEntryQuery.append(" SELECT * FROM DUAL");
LOGGER.info(insertEntryQuery.toString());
try (PreparedStatement preparedEntryStatement = connection.prepareStatement(insertEntryQuery.toString())) {
try (PreparedStatement preparedEntryStatement = connection.prepareStatement(insertEntryQuery.toString(),
new String[]{"SHARED_ID"})) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but this looks odd to me with the String[]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just indicates the column where we want to put the auto-generated keys. See https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#prepareStatement(java.lang.String,%20int[])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into Stack Overflow and the Oracle JDBC driver source code and it looks like the Oracle JDBC doesn't support getGeneratedKeys on INSERT ALL statements. So I'll have to find another way of getting the shared IDs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG. This sounds like we should really drop Oracle support. There is no need for Oracle in 2020, is it? - Postgres is the way to go, isn't it?

We should IMHO also drop MySQL support as it does not offer automatic updates on changes on the server.

@koppor
Copy link
Member

koppor commented Feb 19, 2020

@NorwayMaple Is it something about a ; required in some settings and sometimes not? I had this when playing around.

We are planning to release the 5.0 version this weekend. Do you think, you can finish this PR by Friday? If not, I will drop Oracle support. Causes too much trouble. I think, we should go for https://www.jooq.org/. WDYT?

@abepolk
Copy link
Contributor Author

abepolk commented Feb 19, 2020

If you look at the Oracle JDBC source code, the reason Oracle says the SQL is not properly ended is because the JDBC secretly inserts a RETURNING INTO clause to get the generated key. Problem is that INSERT ALL statements require a SELECT statement at the end for arbitrary reasons, so the RETURNING clause after SELECT confuses the database. There may be a way using BULK COLLECT to do it, but there are no examples online and the documentation is surprisingly poor. If you say Oracle is out-of-style, then by all means drop the support. I have no opinion on MySQL.

Finally, as a note on Oracle, I should be able to make Oracle work iteratively with a SQL statement for each entry (but not one per field). This would mean it wasn't optimized like Postgres and MySQL, but this would be an easy and fast fix on my end, and we wouldn't even have to completely drop Oracle.

I'm not sure about Friday, I may have full-time work starting tomorrow on a temp job starting tomorrow, and an interview on Wednesday. I can try and do Saturday afternoon, but I'm in the US East time zone so that might be Saturday evening and night for you, and I still have to prepare for my interview. However, if you ignore Oracle, I think I can complete the PR much faster, even without www.jooq.org, so there may not be a lot of work left.

@abepolk
Copy link
Contributor Author

abepolk commented Feb 19, 2020

I did the main fixes I wanted to do so they work on Postgres and MySQL. Now I just have to test the Oracle, manually test it in the GUI, fix checkStyle, and take a final look at the PR diff.

@abepolk
Copy link
Contributor Author

abepolk commented Feb 19, 2020

@koppor I've done the manual testing and checkStyle, and looked at the PR diff. The only thing left to do is test it in Oracle. The code is simple and it is a hassle for me to work with Docker. I think if we merge it in, it might be easiest for GitHub's CI to tell us if it passes the Oracle tests. I know Oracle's not important, but we are so close! But for that, you have to merge it into the upstream because I can't do it from my fork.

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update! The changes look good, and the tests are passing so I'll merge now.

@tobiasdiez tobiasdiez merged commit 93196ee into JabRef:master Feb 19, 2020
@abepolk
Copy link
Contributor Author

abepolk commented Feb 20, 2020

Looks like we are now getting Error processing tar file(exit status 1): write /opt/oracle/product/18c/dbhomeXE/md/property_graph/lib/lucene-analyzers-common-4.10.4.jar: no space left on device in the run. See https://github.com/JabRef/jabref/runs/456577265?check_suite_focus=true. So we still don't know if Oracle works. It might be worth figuring out.

@koppor koppor mentioned this pull request Feb 25, 2020
Siedlerchr added a commit that referenced this pull request Mar 6, 2020
* upstream/master:
  Try to fix linux pdf opening again (#5945)
  [WIP] Initial work on DBMSProcessor batch entry insertion into ENTRY table (#5814)
  followup fix
  Fixfetcher (#5948)
  Bump byte-buddy-parent from 1.10.7 to 1.10.8 (#5952)
  Added MenuButtons to IntegrityCheckDialog (#5955)
  Reimplement custom entry types dialog (#5799)
  Bump unirest-java from 3.4.03 to 3.5.00 (#5953)
  MySQL: Allow public key retrieval (#5909)
  Restructure and improve docs for setting up IntelliJ (#5960)
  Change syntax for Oracle multi-row insert SQL statement (#5837)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants