Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkp/pkp#8933 Restore original file after cancelling file upload wizard #8941

Merged
merged 22 commits into from
Jun 2, 2023

Conversation

Vitaliy-1
Copy link
Collaborator

No description provided.

controllers/api/file/PKPManageFileApiHandler.php Outdated Show resolved Hide resolved
controllers/api/file/PKPManageFileApiHandler.php Outdated Show resolved Hide resolved
controllers/api/file/PKPManageFileApiHandler.php Outdated Show resolved Hide resolved
controllers/api/file/PKPManageFileApiHandler.php Outdated Show resolved Hide resolved
controllers/wizard/fileUpload/FileUploadWizardHandler.php Outdated Show resolved Hide resolved
@Vitaliy-1 Vitaliy-1 force-pushed the i8933_restore_original branch 4 times, most recently from 5383bf9 to 5ff9333 Compare May 20, 2023 09:13
Copy link
Contributor

@NateWr NateWr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one longer comment about modifying the schema at run-time like this. I don't know if you'll have time to rethink that approach, but something to think about anyway. Otherwise, it's all small comments. 👍

api/v1/submissions/PKPSubmissionHandler.php Outdated Show resolved Hide resolved
classes/security/Validation.php Outdated Show resolved Hide resolved
classes/security/Validation.php Show resolved Hide resolved
classes/facades/Repo.php Show resolved Hide resolved
classes/log/event/Collector.php Show resolved Hide resolved
controllers/wizard/fileUpload/FileUploadWizardHandler.php Outdated Show resolved Hide resolved
schemas/eventLog.json Outdated Show resolved Hide resolved
schemas/eventLog.json Outdated Show resolved Hide resolved
classes/log/event/DAO.php Outdated Show resolved Hide resolved
@Vitaliy-1 Vitaliy-1 force-pushed the i8933_restore_original branch from 379e695 to 9b2f174 Compare May 28, 2023 20:29
$table->dropForeign('event_log_user_id_foreign');
$table->dropIndex('event_log_user_id');
$table->bigInteger('user_id')->nullable()->change();
$table->foreign('user_id')->references('user_id')->on('users')->onDelete('cascade');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want an ON DELETE CASCADE here? I would think ON DELETE SET NULL would be more appropriate -- the point is to be able to continue to log events when a user account doesn't exist, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong: the only appropriate way to remove a user account is to merge it with another one. For this specific case we also combine their log entries. Thus, on delete cascade here would prevent data corruption in unexpected cases. If not, I can set to null on delete.

Forgot to mention, the idea behind making user_id nullable, is to allow to log events from the system, automated events or other cases when a specific user isn't involved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me, thanks, please go ahead. But please do add a ->comment to the user_id column in LogMigration to explain what a null value means (we'll add more of these in future releases)!

$mapLocaleWithSettingIds = [];
foreach ($logChunks as $row) {
// Get locale based on a submission file ID log entry
$locale = $this->getContextPrimaryLocale($row, $sitePrimaryLocale);
Copy link
Member

@asmecher asmecher May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do this all in SQL for a significant performance boost. Do a LEFT JOIN to the appropriate table for each assoc_type, then join on journals (contexts) using COALESCE() around the candidate IDs; that'll give you a single journalsjoin regardless of what kind ofassoc_type` was present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking further about performance: would we be able to turn this around to optimize it? Currently it's...

  • for each file upload, edit, or revise event,
    • get its context ID's primary locale, and
    • update the setting's locale column to match.

If I understand it right, it would probably be a lot faster to go the other way around:

  • For each context,
    • Get its primary locale, and
    • Update any file upload, edit, or revise event log entry settings to match.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that was also on my mind as a belated idea but I wasn't sure if the performance boost would be worth to re-implement it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference, there are 3M entries in SciELO's event log table; 567k are in (SUBMISSION_LOG_FILE_UPLOAD, SUBMISSION_LOG_FILE_EDIT, SUBMISSION_LOG_FILE_REVISION_UPLOAD). So the cost of going through these individually would be very high.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting context primary locale having only submission or submission file. Event log doesn't have context_id column, thus I need to iterate over all those entries. I meant to move statement that selects context locale out of the loop.

I've optimised migration where possible. The major change is setting chunking to 10000, which means that the update statement will run once for 10k entries. On my laptop 500k records are updated within 15 minutes. If it's not enough, I can try to create a wider join statement, a series of something like:

SELECT * FROM event_log AS e
JOIN event_log_settings AS es on e.log_id = es.log_id
JOIN submissions AS s ON e.assoc_id = s.submission_id
WHERE e.assoc_type = 1048585 
AND es.setting_name = 'filename'
AND s.context_id = '?'

Then updating log entries altogether, knowing the associated context and its primary locale. It might be faster but less prone to corrupted data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good, thanks -- please go ahead, and I'll do a sanity check with a large dataset once all upgrade-related merges are complete in order to make sure we don't regress on performance.

@Vitaliy-1 Vitaliy-1 force-pushed the i8933_restore_original branch 3 times, most recently from 87f2613 to f18aae2 Compare June 2, 2023 08:24
@Vitaliy-1 Vitaliy-1 force-pushed the i8933_restore_original branch from f18aae2 to 25123ae Compare June 2, 2023 14:19
@Vitaliy-1 Vitaliy-1 merged commit a8af7b3 into pkp:main Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants