Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed first/last_indexed, improved import perf, support for mariadb connector #4108

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

damien-git
Copy link
Contributor

@damien-git damien-git commented Nov 25, 2024

VuFind 10 includes a PR that modifies UpdateDateTracker.java to avoid a deprecated method. After we started using that in prod, our galera cluster crashed. Disabling first/last_indexed fixed it. We are guessing galera could not keep up with the frequent statement creations during import.

This PR reverts the UpdateDateTracker PR. The finalize() code could have been simply removed, because DatabaseManager closes the database connection and hence the statements on shutdown.

It also improves performance by sending db updates by batch. This required changing DatabaseManager to call the shutdown in UpdateDateTracker before the connection is closed, otherwise the statements could not have been used reliably to finish sending db data by batch. When the change_tracker table is empty and only database inserts are used, import performance was improved by 30% in a test.

Sending batches is done a bit differently in mysql and mariadb. We are using mariadb, so I had to add specific support for mariadb, and include the mariadb JDBC connector. In PHP code the mysql driver can be used, but support for the mariadb connection string is needed.

NOTE 1: this change will require mariadb users to change their config to use mariadb as the database driver (or in the connection string). If they don't do it, firt/last_indexed import might fail with a syntax error.

NOTE 2: this code has only been tested with mariadb so far, but it also changes execution for mysql and postgres.

TODO

@demiankatz
Copy link
Member

Thanks, @damien-git! Since I'm out of office this week, I haven't had time to give this an especially close look yet, but I did run the full integration test suite on this branch in my local test environment using MySQL and everything passed. That's hardly an exhaustive test, but at least there's no catastrophic issue there.

I would like to do the same thing for PostgreSQL, but upgrading to Ubuntu 24 in my VM seems to have broken my PostgreSQL install; I'll need to figure that out when time permits so I can do further testing. I'll try to find time next week when I'm back in office, though with things piling up, it's possible it will take me a little longer!

@demiankatz
Copy link
Member

Just a quick note to say that I'm making some progress on getting my PostgreSQL test environment working. There have been quite a few distractions this week, but I hope to get back to this PR next week, once #4149 is merged.

@damien-git
Copy link
Contributor Author

@demiankatz Thanks. I will be on vacation from Dec 13 to Jan 3 (included), and then I will be working on FOLIO development again for 6 months. I we can't finish it before I go on vacation, maybe someone else on my team can pick this up in January, or maybe I will find some time to fix issues in January.

@demiankatz
Copy link
Member

Enjoy your vacation, @damien-git! If I can make more progress on this early next week, I will try! I've just opened another PR (#4151) that will also be necessary to move this forward with PostgreSQL, but hopefully that won't take long to get approved.

@demiankatz
Copy link
Member

@maccabeelevine and @mtrojan-ub, since we've had conversations related to Java optimization and database logic, I thought you might have interest in this PR. No obligation to look if you're busy or not interested, but this seems like something that would benefit from more eyes! (I'm still waiting to get some PostgreSQL bug fix PRs merged so that I can move forward with testing more scenarios on my end).

@demiankatz demiankatz added this to the 11.0 milestone Dec 10, 2024
Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, @damien-git! With all of my PostgreSQL fixes now in place, I was able to run the full test suite here using PostgreSQL, and it seems to be passing, just as MySQL did previously.

It would probably be wise to test this with something a bit more robust than the standard test suite, though I'm not sure exactly what scale would be necessary to constitute an adequate test. Open to suggestions!

In any case, I'm going to approve this but leave it open for a few more days in case others have comments or report different results.

@mtrojan-ub
Copy link
Contributor

This sounds like a great improvement, thank you very much for implementing it!

I can't really judge whether the implementation is entirely correct or not - but i can confirm that first/last indexed fields are usually the fields with the worst performance during the import process, so there is a lot of potential in optimizing this - also i cannot see any obvious bugs when having a first look at the implementation.

@demiankatz
Copy link
Member

Thanks for taking a look, @mtrojan-ub. Do you want to try this out in your local environment before we merge it, or is that not feasible? If nothing else, I can try doing a larger import in my test environment to see if any problems emerge before merging this.

@mtrojan-ub
Copy link
Contributor

@demiankatz: Time until christmas is very short, so it might take until january for me to test this, sorry... so if you have the time it would be great if you could test this.

@demiankatz
Copy link
Member

Thanks, @mtrojan-ub. If I have time to do more testing of my own, I'll share the results here. In any case, if you think it's worth doing testing on your end in January, I'll hold off on merging until you have time to do that.

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there does seem to be a problem here.

I decided to test what happens if I turn on change tracking for all records in our standard test environment. Here's what I did:

1.) I started up the standard test environment with MySQL.

2.) I copied and uncommented the first_indexed and last_indexed lines from import/marc_local.properties to import/marc.properties (to ensure that they would get picked up by the Phing build process).

3.) I ran vendor/bin/phing import_biblios to reimport all of the test records.

What I expected: 328 entries in the change_tracker table (326 for the real records, plus the 2 fake deleted records we set up during the build process).

What I got: only the entries from the initial startup; the larger-scale import failed with these errors:

ERROR [Thread-3] (UpdateDateTracker.java:82) - SQLException in possiblyExecuteBatch(): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ',('biblio', 'dot.dash-underscore__3.colon:18', '2024-12-16 15:11:03.741921', '20' at line 1
ERROR [Thread-3] (UpdateDateTracker.java:183) - SQLException in shutdown hook: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ',('biblio', 'dot.dash-underscore__3.colon:18', '2024-12-16 15:11:03.741921', '20' at line 1

My guess would be that there's some kind of escaping problem where the records with unusual IDs are breaking the SQL. I'm not sure why that would be the case, though! It seems like the abstraction layer is being used appropriately. However, when I repeat my test procedure on the dev branch, I do get the expected result and no errors... so the problem is somehow introduced by the changes here.

@demiankatz
Copy link
Member

Addendum:

I decided to try this with PostgreSQL as well to see if I would get different results.

Following the previous experiment, I simply ran vendor/bin/phing reset_setup -Ddbtype=pgsql and then vendor/bin/phing import_biblios a second time.

In this scenario, I get the expected result. So it seems the problem may somehow be MySQL-specific.

@demiankatz
Copy link
Member

Update: the problem seems to be related to the very old MySQL JDBC .jar included in the project. Things work correctly with a newer version. I will open a separate PR to upgrade this.

@demiankatz
Copy link
Member

Okay, #4162 should fix the MySQL problems I encountered.

@demiankatz
Copy link
Member

Okay, #4162 has been merged to dev, and I've merged dev into this branch. This should now be in reasonable shape for @mtrojan-ub to test in the new year, if time permits.

In the meantime, one question for @damien-git: is there any possibility that using a more up-to-date MySQL jar eliminates the need for a separate MariaDB jar? (I suspect the answer is no, but just double-checking in case there's an opportunity to simplify).

@damien-git
Copy link
Contributor Author

Thanks @demiankatz . I was thinking it would be a good idea to update the other drivers, but I didn't want to mix that up with this ticket and I didn't have the time to test it. We do need a separate MariaDB jar for Java, the 2 drivers have diverged noticeably, especially in regards to bulk edits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants