[WIP] Initial work on DBMSProcessor batch entry insertion into ENTRY table #5814

abepolk · 2020-01-02T20:25:30Z

I am working on making DBMSProcessor insert multiple entries with only a small constant number of queries. It currently does not compile. But the basic ideas are there.

Change in CHANGELOG.md described (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for bigger UI changes)
Checked documentation: Is the information available and up to date? If not: Issue created at https://github.com/JabRef/user-documentation/issues.

abepolk · 2020-01-03T23:17:15Z

@koppor @tobiasdiez I'm curious what you think of the underlying algorithm I have (tried to) create. The problem I encountered was that when each entry was inserted into the ENTRY table one at a time, the generated key was taken out of each INSERT statement and used for BibEntry.setSharedID(). When I tried to insert the entries together in a single INSERT statement, I realized that PreparedStatement.getGeneratedKeys() returns a ResultSet that unordered. Therefore it is not straightforward to assign them to a stream of the List<BibEntry> because the wrong key would be assigned to each entry.

My solution, still not completely implemented, is to get the generated keys from that batch INSERT statement. The generated keys represent entries in the SHARED_ID column. Then execute a SELECT query looking for the rows with those generated SHARED_IDs. These are the rows just inserted. Ignoring the version (which I may have to consider to do this right), queries with the same TYPE will only vary by SHARED_ID. Therefore, I create a map of each type to the set of IDs containing that type, and populate the map from the ResultSet. Then for each entry, take an arbitrary key that is associated with that entry's type in the database, and assign it as the entry's shared ID, and remove it from the keys available to be assigned. The reasoning is that it doesn't matter (again ignoring the version) which row the entry actually corresponds to because they are identical apart from the difference in shared ID. That finalizes the setting of shared IDs in the local entries' SharedBibEntryData. I check to make sure no IDs were left unassigned at the end.

Is this a valid way of setting the shared ID to local BibEntrys in a batch INSERT statement? Is there a simpler way?

tobiasdiez · 2020-01-04T08:44:49Z

Your algorithm looks good to me. Another option would be to use the DuplicationChecker, which should be able to identify the entries and thus let you assign the id (at least if the entries are sufficiently different).
EDIT: after thinking a bit more about it, I'm a little bit confused by "it doesn't matter (again ignoring the version) which row the entry actually corresponds to because they are identical apart from the difference in shared ID". What happens if I insert two different articles? If the BibEntries get the wrong SharedID, then subsequent updates will update the wrong entry, or not?

However, I'm wondering why getGenereratedKeys doesn't respect the order. According to https://stackoverflow.com/questions/50943216/does-jdbc-getgeneratedkeys-method-always-same-order-of-inserted-element it should work. Maybe it's worth looking into the OUTPUT statement, which you can use to return values after an insert.

koppor · 2020-01-04T10:57:46Z

We are correct in reading the generated Ids from the database - The reason for generating the unique Ids on the side of the database is because of multi-user-access. Even though one could generate UUIDs on client-side, this feels wrong (unnecessary size of key, ...).

I would not use the duplication checker. This also feels hacky. Maybe using the BibTeX keys - and if there isn't a key, the remaining entries could be compared. But still, this feels hacky, since there should be a reliable (and quick) way to read the generated IDs.

@NorwayMaple Did add tests for the ResultSet? I would also rely on https://stackoverflow.com/a/50944078/873282 that the driver returns the generated keys inorder. Nevertheless, test cases should exist.

abepolk · 2020-01-04T15:35:57Z

No tests yet. Would I be better off writing the tests before all this code?

On batches, I wasn't using executeBatch(), I was using executeUpdate() with an insert statement containing many rows. In this case INSERT INTO ENTRY(TYPE) VALUES(?), (?), (?), (?) for four entries. But I could use batches of single row INSERT statements instead if that it at least just as efficient. Recommendations?

@tobiasdiez I think you are right because multiple users might then have different shared IDs for the same BibEntry. So if one person updates an entry, the wrong one will be synchronized. Also, they have to be connected via foreign key to the FIELD table, so my algorithm won't work.

tobiasdiez · 2020-01-04T17:51:29Z

If you write a test now, it should fail. Then you can experiment with the different options until the test pass. I would suggest that the test generates a bunch of random entries (say 100) and then inserts them. Afterwards you can check that the shared id is the correct one for all entries.

abepolk · 2020-01-18T19:23:23Z

Okay. So far I've been hard-coding the example entries for the tests. I could do, say, 5, or even 100, but are you thinking I might use some other means to generate the random entries? Also, are we satisfied making sure the tests pass, given it is technically possible for the JDBC implementations to change at any time and there is no guarantee that they will maintain orderedness?

tobiasdiez · 2020-01-19T12:04:49Z

I guess a for loop from i = 1 to 100 which generates entries ala new BibEntry().withKey(i).withField(Journal, "journal" + i) should be sufficient. Don't have to be complex entries.

I would be satisfied if this tests passes. If, in the future, the implementation changes then we get notified by the test, which then fails.

…ntryExistence

abepolk · 2020-02-04T01:40:10Z

I am very confused about a bug that appears to be in OracleProcessor in the method I just wrote, insertIntoEntryTable. When I run the database tests with Oracle, I always get java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended. If I just execute the SQL as an ordinary statement rather than a prepared statement and hard-code a parameter, the query executes. So it must be a problem with the prepared statement. For now these are lines 121-127. Ideas?

Siedlerchr · 2020-02-04T18:30:09Z

src/main/java/org/jabref/logic/shared/OracleProcessor.java

@@ -118,7 +118,8 @@ protected void insertIntoEntryTable(List<BibEntry> entries) {
            }
            insertEntryQuery.append(" SELECT * FROM DUAL");
            LOGGER.info(insertEntryQuery.toString());
-            try (PreparedStatement preparedEntryStatement = connection.prepareStatement(insertEntryQuery.toString())) {
+            try (PreparedStatement preparedEntryStatement = connection.prepareStatement(insertEntryQuery.toString(),
+                    new String[]{"SHARED_ID"})) {


I am not sure, but this looks odd to me with the String[]

This just indicates the column where we want to put the auto-generated keys. See https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#prepareStatement(java.lang.String,%20int[])

I looked into Stack Overflow and the Oracle JDBC driver source code and it looks like the Oracle JDBC doesn't support getGeneratedKeys on INSERT ALL statements. So I'll have to find another way of getting the shared IDs.

OMG. This sounds like we should really drop Oracle support. There is no need for Oracle in 2020, is it? - Postgres is the way to go, isn't it?

We should IMHO also drop MySQL support as it does not offer automatic updates on changes on the server.

koppor · 2020-02-19T08:39:27Z

@NorwayMaple Is it something about a ; required in some settings and sometimes not? I had this when playing around.

We are planning to release the 5.0 version this weekend. Do you think, you can finish this PR by Friday? If not, I will drop Oracle support. Causes too much trouble. I think, we should go for https://www.jooq.org/. WDYT?

abepolk · 2020-02-19T13:53:19Z

If you look at the Oracle JDBC source code, the reason Oracle says the SQL is not properly ended is because the JDBC secretly inserts a RETURNING INTO clause to get the generated key. Problem is that INSERT ALL statements require a SELECT statement at the end for arbitrary reasons, so the RETURNING clause after SELECT confuses the database. There may be a way using BULK COLLECT to do it, but there are no examples online and the documentation is surprisingly poor. If you say Oracle is out-of-style, then by all means drop the support. I have no opinion on MySQL.

Finally, as a note on Oracle, I should be able to make Oracle work iteratively with a SQL statement for each entry (but not one per field). This would mean it wasn't optimized like Postgres and MySQL, but this would be an easy and fast fix on my end, and we wouldn't even have to completely drop Oracle.

I'm not sure about Friday, I may have full-time work starting tomorrow on a temp job starting tomorrow, and an interview on Wednesday. I can try and do Saturday afternoon, but I'm in the US East time zone so that might be Saturday evening and night for you, and I still have to prepare for my interview. However, if you ignore Oracle, I think I can complete the PR much faster, even without www.jooq.org, so there may not be a lot of work left.

… yet tested

abepolk · 2020-02-19T19:27:25Z

I did the main fixes I wanted to do so they work on Postgres and MySQL. Now I just have to test the Oracle, manually test it in the GUI, fix checkStyle, and take a final look at the PR diff.

abepolk · 2020-02-19T21:20:49Z

@koppor I've done the manual testing and checkStyle, and looked at the PR diff. The only thing left to do is test it in Oracle. The code is simple and it is a hassle for me to work with Docker. I think if we merge it in, it might be easiest for GitHub's CI to tell us if it passes the Oracle tests. I know Oracle's not important, but we are so close! But for that, you have to merge it into the upstream because I can't do it from my fork.

tobiasdiez

Thanks for the quick update! The changes look good, and the tests are passing so I'll merge now.

abepolk · 2020-02-20T00:27:43Z

Looks like we are now getting Error processing tar file(exit status 1): write /opt/oracle/product/18c/dbhomeXE/md/property_graph/lib/lucene-analyzers-common-4.10.4.jar: no space left on device in the run. See https://github.com/JabRef/jabref/runs/456577265?check_suite_focus=true. So we still don't know if Oracle works. It might be worth figuring out.

* upstream/master: Try to fix linux pdf opening again (#5945) [WIP] Initial work on DBMSProcessor batch entry insertion into ENTRY table (#5814) followup fix Fixfetcher (#5948) Bump byte-buddy-parent from 1.10.7 to 1.10.8 (#5952) Added MenuButtons to IntegrityCheckDialog (#5955) Reimplement custom entry types dialog (#5799) Bump unirest-java from 3.4.03 to 3.5.00 (#5953) MySQL: Allow public key retrieval (#5909) Restructure and improve docs for setting up IntelliJ (#5960) Change syntax for Oracle multi-row insert SQL statement (#5837)

Initial work on DBMSProcessor entry insertion into ENTRY table

e701c62

abepolk and others added 3 commits January 17, 2020 10:36

Change syntax for Oracle multi-row insert SQL statement

a5bfc87

Merge branch 'master' into batch_DBMSProcessor_entries

3027c54

Run tests also when source files changed

ae9d53c

abepolk added 19 commits January 21, 2020 18:47

Add to comment about Oracle

ed675a2

Assume ResultSet is in order for setting shared IDs

4e80030

Add insertEntry for DBMSProcessor tests and fix PostgresSQLProcessor

67fdf27

Fix SQL typo

8aaf774

Separate table drops in Oracle tests

bbc81ab

Merge remote-tracking branch 'fork/fix_fields_sql' into fix_fields_sql

74ab816

Remove CI tests that were added in branch

17de560

Work on unit test for DBMSProcessor insertEntries

5f23e03

Fix bug in DBMSProcessorTest and simplify DBMSProcessor.FilterForBibE…

402a9cc

…ntryExistence

Remove Oracle connection bug with wrong port

a37129a

Add Oracle insertIntoEntryTable

94a75cd

Oracle connection fix - taken from fix_fields_sql branch

fbf1d33

Fix typo bug

dbbfe00

Clean up code

a5ab4e5

Remove commented blocks

2279464

Remove comment about needing a test that probably isn't necessary

926473a

Manually merge fix_fields_sql OracleProcessor (just add method)

9d5960f

Merge fix_fields_sql

a2fcb77

Emphasize todo

3bdf2d8

abepolk added 2 commits February 1, 2020 12:50

setSharedID into OracleProcessor entry table method

85f9196

Add shared id to preparedEntryStatement

245c48e

Siedlerchr reviewed Feb 4, 2020

View reviewed changes

abepolk added 3 commits February 19, 2020 10:39

Make Oracle insertIntoEntryTable iterative - pasted from master - not…

79d8354

… yet tested

Add fields to fields table in parallel

c315838

Merge master

c073b8f

abepolk added 2 commits February 19, 2020 14:29

Reset test trace length

934acb2

Fix checkStyle

bd32ecd

Revert port setting

99191ff

tobiasdiez approved these changes Feb 19, 2020

View reviewed changes

tobiasdiez merged commit 93196ee into JabRef:master Feb 19, 2020

koppor mentioned this pull request Feb 25, 2020

Remove Oracle support #6013

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Initial work on DBMSProcessor batch entry insertion into ENTRY table #5814

[WIP] Initial work on DBMSProcessor batch entry insertion into ENTRY table #5814

abepolk commented Jan 2, 2020

abepolk commented Jan 3, 2020

tobiasdiez commented Jan 4, 2020 •

edited

Loading

koppor commented Jan 4, 2020

abepolk commented Jan 4, 2020

tobiasdiez commented Jan 4, 2020

abepolk commented Jan 18, 2020

tobiasdiez commented Jan 19, 2020

abepolk commented Feb 4, 2020

Siedlerchr Feb 4, 2020

abepolk Feb 5, 2020

abepolk Feb 13, 2020

koppor Feb 19, 2020

koppor commented Feb 19, 2020

abepolk commented Feb 19, 2020

abepolk commented Feb 19, 2020

abepolk commented Feb 19, 2020

tobiasdiez left a comment

abepolk commented Feb 20, 2020

[WIP] Initial work on DBMSProcessor batch entry insertion into ENTRY table #5814

[WIP] Initial work on DBMSProcessor batch entry insertion into ENTRY table #5814

Conversation

abepolk commented Jan 2, 2020

abepolk commented Jan 3, 2020

tobiasdiez commented Jan 4, 2020 • edited Loading

koppor commented Jan 4, 2020

abepolk commented Jan 4, 2020

tobiasdiez commented Jan 4, 2020

abepolk commented Jan 18, 2020

tobiasdiez commented Jan 19, 2020

abepolk commented Feb 4, 2020

Siedlerchr Feb 4, 2020

Choose a reason for hiding this comment

abepolk Feb 5, 2020

Choose a reason for hiding this comment

abepolk Feb 13, 2020

Choose a reason for hiding this comment

koppor Feb 19, 2020

Choose a reason for hiding this comment

koppor commented Feb 19, 2020

abepolk commented Feb 19, 2020

abepolk commented Feb 19, 2020

abepolk commented Feb 19, 2020

tobiasdiez left a comment

Choose a reason for hiding this comment

abepolk commented Feb 20, 2020

tobiasdiez commented Jan 4, 2020 •

edited

Loading