-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed first/last_indexed, improved import perf, support for mariadb connector #4108
base: dev
Are you sure you want to change the base?
Conversation
Thanks, @damien-git! Since I'm out of office this week, I haven't had time to give this an especially close look yet, but I did run the full integration test suite on this branch in my local test environment using MySQL and everything passed. That's hardly an exhaustive test, but at least there's no catastrophic issue there. I would like to do the same thing for PostgreSQL, but upgrading to Ubuntu 24 in my VM seems to have broken my PostgreSQL install; I'll need to figure that out when time permits so I can do further testing. I'll try to find time next week when I'm back in office, though with things piling up, it's possible it will take me a little longer! |
Just a quick note to say that I'm making some progress on getting my PostgreSQL test environment working. There have been quite a few distractions this week, but I hope to get back to this PR next week, once #4149 is merged. |
@demiankatz Thanks. I will be on vacation from Dec 13 to Jan 3 (included), and then I will be working on FOLIO development again for 6 months. I we can't finish it before I go on vacation, maybe someone else on my team can pick this up in January, or maybe I will find some time to fix issues in January. |
Enjoy your vacation, @damien-git! If I can make more progress on this early next week, I will try! I've just opened another PR (#4151) that will also be necessary to move this forward with PostgreSQL, but hopefully that won't take long to get approved. |
@maccabeelevine and @mtrojan-ub, since we've had conversations related to Java optimization and database logic, I thought you might have interest in this PR. No obligation to look if you're busy or not interested, but this seems like something that would benefit from more eyes! (I'm still waiting to get some PostgreSQL bug fix PRs merged so that I can move forward with testing more scenarios on my end). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, @damien-git! With all of my PostgreSQL fixes now in place, I was able to run the full test suite here using PostgreSQL, and it seems to be passing, just as MySQL did previously.
It would probably be wise to test this with something a bit more robust than the standard test suite, though I'm not sure exactly what scale would be necessary to constitute an adequate test. Open to suggestions!
In any case, I'm going to approve this but leave it open for a few more days in case others have comments or report different results.
This sounds like a great improvement, thank you very much for implementing it! I can't really judge whether the implementation is entirely correct or not - but i can confirm that first/last indexed fields are usually the fields with the worst performance during the import process, so there is a lot of potential in optimizing this - also i cannot see any obvious bugs when having a first look at the implementation. |
Thanks for taking a look, @mtrojan-ub. Do you want to try this out in your local environment before we merge it, or is that not feasible? If nothing else, I can try doing a larger import in my test environment to see if any problems emerge before merging this. |
@demiankatz: Time until christmas is very short, so it might take until january for me to test this, sorry... so if you have the time it would be great if you could test this. |
Thanks, @mtrojan-ub. If I have time to do more testing of my own, I'll share the results here. In any case, if you think it's worth doing testing on your end in January, I'll hold off on merging until you have time to do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, there does seem to be a problem here.
I decided to test what happens if I turn on change tracking for all records in our standard test environment. Here's what I did:
1.) I started up the standard test environment with MySQL.
2.) I copied and uncommented the first_indexed and last_indexed lines from import/marc_local.properties to import/marc.properties (to ensure that they would get picked up by the Phing build process).
3.) I ran vendor/bin/phing import_biblios
to reimport all of the test records.
What I expected: 328 entries in the change_tracker table (326 for the real records, plus the 2 fake deleted records we set up during the build process).
What I got: only the entries from the initial startup; the larger-scale import failed with these errors:
ERROR [Thread-3] (UpdateDateTracker.java:82) - SQLException in possiblyExecuteBatch(): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ',('biblio', 'dot.dash-underscore__3.colon:18', '2024-12-16 15:11:03.741921', '20' at line 1
ERROR [Thread-3] (UpdateDateTracker.java:183) - SQLException in shutdown hook: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ',('biblio', 'dot.dash-underscore__3.colon:18', '2024-12-16 15:11:03.741921', '20' at line 1
My guess would be that there's some kind of escaping problem where the records with unusual IDs are breaking the SQL. I'm not sure why that would be the case, though! It seems like the abstraction layer is being used appropriately. However, when I repeat my test procedure on the dev branch, I do get the expected result and no errors... so the problem is somehow introduced by the changes here.
Addendum: I decided to try this with PostgreSQL as well to see if I would get different results. Following the previous experiment, I simply ran In this scenario, I get the expected result. So it seems the problem may somehow be MySQL-specific. |
Update: the problem seems to be related to the very old MySQL JDBC .jar included in the project. Things work correctly with a newer version. I will open a separate PR to upgrade this. |
Okay, #4162 should fix the MySQL problems I encountered. |
Okay, #4162 has been merged to dev, and I've merged dev into this branch. This should now be in reasonable shape for @mtrojan-ub to test in the new year, if time permits. In the meantime, one question for @damien-git: is there any possibility that using a more up-to-date MySQL jar eliminates the need for a separate MariaDB jar? (I suspect the answer is no, but just double-checking in case there's an opportunity to simplify). |
Thanks @demiankatz . I was thinking it would be a good idea to update the other drivers, but I didn't want to mix that up with this ticket and I didn't have the time to test it. We do need a separate MariaDB jar for Java, the 2 drivers have diverged noticeably, especially in regards to bulk edits. |
VuFind 10 includes a PR that modifies
UpdateDateTracker.java
to avoid a deprecated method. After we started using that in prod, our galera cluster crashed. Disablingfirst
/last_indexed
fixed it. We are guessing galera could not keep up with the frequent statement creations during import.This PR reverts the
UpdateDateTracker
PR. Thefinalize()
code could have been simply removed, becauseDatabaseManager
closes the database connection and hence the statements on shutdown.It also improves performance by sending db updates by batch. This required changing
DatabaseManager
to call the shutdown inUpdateDateTracker
before the connection is closed, otherwise the statements could not have been used reliably to finish sending db data by batch. When thechange_tracker
table is empty and only database inserts are used, import performance was improved by 30% in a test.Sending batches is done a bit differently in mysql and mariadb. We are using mariadb, so I had to add specific support for mariadb, and include the mariadb JDBC connector. In PHP code the mysql driver can be used, but support for the
mariadb
connection string is needed.NOTE 1: this change will require mariadb users to change their config to use
mariadb
as the database driver (or in the connection string). If they don't do it,firt
/last_indexed
import might fail with a syntax error.NOTE 2: this code has only been tested with mariadb so far, but it also changes execution for mysql and postgres.
TODO