Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase timeout and improve DB table indexes for jobs ProcessUsageStatsLogFile and RemoveDoubleClicks #9822

Closed
bozana opened this issue Mar 22, 2024 · 28 comments
Assignees
Labels
Enhancement:1:Minor A new feature or improvement that can be implemented in less than 3 days.
Milestone

Comments

@bozana
Copy link
Collaborator

bozana commented Mar 22, 2024

The name and topic of this issue changed:
It seems the ProcessUsageStatsLogFile (that reads and simply validates the log file, removes the bot entries and inserts the entries into the usage stats temporary tables) is not the problem any more, so that it does not need to be separate into smaller processes (that would process chunks of the log file at once).
The job RemoveDoubleClicks however seems to be the problem -- it takes too long for bigger log files. The job executes this function https://github.com/pkp/pkp-lib/blob/main/classes/statistics/PKPTemporaryTotalsDAO.php#L90.
To solve this we will change the index on the table usage_stats_total_temporary_records to consider user_agent and canonical_url columns, so to be something like:
$table->index(['load_id', 'context_id', 'ip', 'user_agent', 'canonical_url'], 'ust_load_id_context_id_ip'_ua_url);
Also, we will increase the $timeout (and $retry_after) for this job.

PRs:
stable-3_4_0:

main:


Original issue:
Big installations, that have huge log files and that use AcronPlugin for job processing, can experience timeouts. Thus, try to separate the ProcessUsageStatsLogFile into several smaller jobs, so that each job processes only e.g. 100 lines of the log file.

@bozana bozana self-assigned this Mar 22, 2024
@bozana bozana added the Enhancement:1:Minor A new feature or improvement that can be implemented in less than 3 days. label Mar 22, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Statistics Mar 22, 2024
@bozana bozana added this to the 3.4.0-6 milestone Mar 22, 2024
@bozana
Copy link
Collaborator Author

bozana commented May 3, 2024

Because the current solution works also for the huge installation, I will close this issue for now...

@bozana bozana closed this as not planned Won't fix, can't repro, duplicate, stale May 3, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Done in Statistics May 3, 2024
@shantanu198713
Copy link

shantanu198713 commented Jul 9, 2024

Hi @bozana
I am still facing too many times or run too long. The job may have previously timed out issue. This has been coming to light for a long time. I execute it by manually clicking the TryAgain button multiple times.

Screenshot 2024-07-09 155516

{
    "message": "APP\\jobs\\statistics\\ProcessUsageStatsLogFile has been attempted too many times or run too long. The job may have previously timed out.",
    "code": 0,
    "file": "\/ojs\/lib\/pkp\/lib\/vendor\/laravel\/framework\/src\/Illuminate\/Queue\/Worker.php",
    "line": 746,
    "trace": [
        {
            "file": "\/ojs\/lib\/pkp\/lib\/vendor\/laravel\/framework\/src\/Illuminate\/Queue\/Worker.php",
            "line": 505,
            "function": "maxAttemptsExceededException",
            "class": "Illuminate\\Queue\\Worker",
            "type": "->"
        },
        {
            "file": "\/ojs\/lib\/pkp\/lib\/vendor\/laravel\/framework\/src\/Illuminate\/Queue\/Worker.php",
            "line": 415,
            "function": "markJobAsFailedIfAlreadyExceedsMaxAttempts",
            "class": "Illuminate\\Queue\\Worker",
            "type": "->"
        },
        {
            "file": "\/ojs\/lib\/pkp\/lib\/vendor\/laravel\/framework\/src\/Illuminate\/Queue\/Worker.php",
            "line": 375,
            "function": "process",
            "class": "Illuminate\\Queue\\Worker",
            "type": "->"
        },
        {
            "file": "\/ojs\/lib\/pkp\/lib\/vendor\/laravel\/framework\/src\/Illuminate\/Queue\/Worker.php",
            "line": 326,
            "function": "runJob",
            "class": "Illuminate\\Queue\\Worker",
            "type": "->"
        },
        {
            "file": "\/ojs\/lib\/pkp\/classes\/core\/PKPQueueProvider.php",
            "line": 106,
            "function": "runNextJob",
            "class": "Illuminate\\Queue\\Worker",
            "type": "->"
        },
        {
            "file": "\/ojs\/lib\/pkp\/classes\/queue\/JobRunner.php",
            "line": 220,
            "function": "runJobInQueue",
            "class": "PKP\\core\\PKPQueueProvider",
            "type": "->"
        },
        {
            "file": "\/ojs\/lib\/pkp\/classes\/core\/PKPQueueProvider.php",
            "line": 135,
            "function": "processJobs",
            "class": "PKP\\queue\\JobRunner",
            "type": "->"
        },
        {
            "function": "PKP\\core\\{closure}",
            "class": "PKP\\core\\PKPQueueProvider",
            "type": "->"
        }
    ]
}

@bozana bozana reopened this Jul 10, 2024
@github-project-automation github-project-automation bot moved this from Done to Todo in Statistics Jul 10, 2024
@asmecher asmecher modified the milestones: 3.4.0-6, 3.4.0-7 Jul 12, 2024
@mpbraendle
Copy link
Contributor

Same here - one usage events log file is just not processed with the same errors as OP reported.
Log files are between 3 M and 21 M.

@asmecher asmecher modified the milestones: 3.4.0-7, 3.4.0-8 Aug 23, 2024
@shantanu198713
Copy link

Working fine now after transferring the database to nutanix server.

@mpbraendle
Copy link
Contributor

mpbraendle commented Nov 4, 2024

@bozana Still no luck. OJS 3.4.0-7 . MariaDB database is on a central database server (no local installation, we have to use the central primary/replica server).

{ "uuid": "a50aa537-1e13-42ac-a80a-cd11258896bf", "displayName": "APP\\jobs\\statistics\\CompileUniqueInvestigations", "job": "Illuminate\\Queue\\CallQueuedHandler@call", "maxTries": 3, "maxExceptions": 3, "failOnTimeout": true, "backoff": "5", "timeout": 60, "retryUntil": null, "data": { "commandName": "APP\\jobs\\statistics\\CompileUniqueInvestigations", "command": "O:47:\"APP\\jobs\\statistics\\CompileUniqueInvestigations\":5:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:10:\"connection\";s:8:\"database\";s:5:\"queue\";s:5:\"queue\";s:19:\"chainCatchCallbacks\";a:1:{i:0;O:47:\"Laravel\\SerializableClosure\\SerializableClosure\":1:{s:12:\"serializable\";O:46:\"Laravel\\SerializableClosure\\Serializers\\Native\":5:{s:3:\"use\";a:0:{}s:8:\"function\";s:44:\"function (\\Throwable $e) {\n }\";s:5:\"scope\";s:28:\"PKP\\task\\PKPUsageStatsLoader\";s:4:\"this\";N;s:4:\"self\";s:32:\"000000000000027c0000000000000000\";}}}s:7:\"chained\";a:71:{i:0;s:159:\"O:41:\"APP\\jobs\\statistics\\CompileUniqueRequests\":3:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:10:\"connection\";s:8:\"database\";s:5:\"queue\";s:5:\"queue\";}\";i:1;s:159:\"O:41:\"PKP\\jobs\\statistics\\CompileContextMetrics\":3:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:10:\"connection\";s:8:\"database\";s:5:\"queue\";s:5:\"queue\";}\";i:2;s:157:\"O:39:\"APP\\jobs\\statistics\\CompileIssueMetrics\":3:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:10:\"connection\";s:8:\"database\";s:5:\"queue\";s:5:\"queue\";}\";i:3;s:162:\"O:44:\"PKP\\jobs\\statistics\\CompileSubmissionMetrics\":3:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:10:\"connection\";s:8:\"database\";s:5:\"queue\";s:5:\"queue\";}\";i:4;s:170:\"O:52:\"APP\\jobs\\statistics\\CompileSubmissionGeoDailyMetrics\":3:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:10:\"connection\";s:8:\"database\";s:5:\"queue\";s:5:\"queue\";}\";i:5;s:174:\"O:56:\"APP\\jobs\\statistics\\CompileCounterSubmissionDailyMetrics\":3:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:10:\"connection\";s:8:\"database\";s:5:\"queue\";s:5:\"queue\";}\";i:6;s:185:\"O:67:\"APP\\jobs\\statistics\\CompileCounterSubmissionInstitutionDailyMetrics\":3:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:10:\"connection\";s:8:\"database\";s:5:\"queue\";s:5:\"queue\";}\";i:7;s:170:\"O:52:\"APP\\jobs\\statistics\\DeleteUsageStatsTemporaryRecords\":3:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:10:\"connection\";s:8:\"database\";s:5:\"queue\";s:5:\"queue\";}\";i:8;s:8354:\"O:44:\"PKP\\jobs\\statistics\\ArchiveUsageStatsLogFile\":4:{s:9:\"\u0000*\u0000loadId\";s:25:\"usage_events_20241020.log\";s:7:\"\u0000*\u0000site\";O:13:\"PKP\\site\\Site\":6: ... } }

@bozana
Copy link
Collaborator Author

bozana commented Nov 7, 2024

@mpbraendle, hmmm... it seems the job batch fails on CompileUniqueInvestigations, so the solution I thought in this issue would not help. For this issue I thought that the reading of the log file and inserting of the entries from the log file into the DB tables, i.e. the job ProcessUsageStatsLogFile, are failing .
It seems the processing of the unique investigations in the DB fails for you. This processing is done only in the DB, on temporary usage stats tables, and for this the whole log file (all entries) have to be inserted into the temporary tables.
Hmmm...
Do your temporary usage stats DB tables contain the entries for only one log file (when the processing fails)?
I will see with my colleague, that would probably better know it, how to investigate why exactly is it failing...

@asmecher asmecher modified the milestones: 3.4.0-8, 3.4.0-9 Nov 29, 2024
@mpbraendle
Copy link
Contributor

@bozana - the timeout of 60s is taken over from lib/pkp/jobs/BaseJob.php

If you add a high value of timeout, e.g.

public int $timeout = 600;

or even public int $timeout = 1200;

to the

PKPProcessUsageStatsLogFile and RemoveDoubleClicks class, the jobs run through.

Especially the removeDoubleClicks task my take a while (even on a locally installed MariaDB database).

Usually, we have 1000-2000 views and 1000-1500 downloads per day, sometimes up to 10000. This will increase with time when we add more articles retrospectively.

@bozana
Copy link
Collaborator Author

bozana commented Jan 6, 2025

Hi @mpbraendle, thanks a lot!
You are running jobs within a request, correct? What are you server settings for the max execution time for a PHP script?
I will see with @touhidurabir if that change for the timeout would make sense...

@mpbraendle
Copy link
Contributor

mpbraendle commented Jan 7, 2025

No, I run them via workers (using supervisor). As recommended in https://docs.pkp.sfu.ca/admin-guide/en/deploy-jobs

@touhidurabir
Copy link
Member

Hi all , the problem that we have here is bit more complex considering our current infrastructural limitation of queue system .

So , first of all , for such long running queue jobs, the best and right approach is to use the worker with supervisor as @mpbraendle has mentioned . Job Runner is simply not going to cut out here .

However , we have some limitation of timeout which is also related to retry_after of global queue configuration which can not be applied to job level but a queue level configuration . This make sure that no job get picked up while being processed in case of MULTIPLE worker .

Now setting the timeout to a higher value means that we have to set the retry_after to more higher value as per the mechanism of laravel queue system . However that can present another problem with other jobs . Say a simple job for some reason failed which may only take 30 seconds to complete . But as we need to set the retry_after higher than ANY job with highest timeout , it will not be picked until that time has passed (so if set retry_after 1200 seconds, it will not picked by worker again to attempt again before 1200 seconds has passed since it being last attempted given that any more tries left for that job) . This is according to Laravel's doc

The --timeout value should always be at least several seconds shorter than your retry_after configuration value. This will ensure that a worker processing a frozen job is always terminated before the job is retried. If your --timeout option is longer than your retry_after configuration value, your jobs may be processed twice.

and

The pcntl PHP extension must be installed in order to specify job timeouts. In addition, a job's "timeout" value should always be less than its "retry after" value. Otherwise, the job may be re-attempted before it has actually finished executing or timed out.

Similar situation in more details explained at laravel/framework#35633 .

NOTE that this case will arise when jobs are running via worker and MULTIPLE worker is set up which is currently possible using supervisor .

I can think of few possible solution as such

  1. have multiple queue and queue connection have long running jobs is such different connection . However it's not possible to dynamically add those and also involve to set up multiple worker via supervisor . Not a good option for us currently .
  2. Break this long running jobs into batches using job batch .
  3. Keep the retry_after as it is and increase the timeout and mark such jobs as unique , https://laravel.com/docs/11.x/queues#unique-jobs . However this require some investigation before implementing .

Another options is to increase the timeout only as most of time there is only one worker . However that always leave the possibility to get the job double picked if multiple worker set up as that's is possible now. At the same time this is the easiest one also .

@mpbraendle
Copy link
Contributor

That shouldn't be much of a problem, because the statistics are usually calculated once per night.

@touhidurabir
Copy link
Member

touhidurabir commented Jan 8, 2025

That shouldn't be much of a problem, because the statistics are usually calculated once per night.

It's not about when or how many times the job supposed to run but when and if the job run multiple times at same time by MULTIPLE worker if multiple worker set up and when retry_after not higher than the highest timeout of jobs .

@bozana what will be the impact the same stat jobs run more than one time at same time ?

@bozana
Copy link
Collaborator Author

bozana commented Jan 8, 2025

Hi @touhidurabir, this should not happen -- the stats calculation could then be wrong.

@bozana
Copy link
Collaborator Author

bozana commented Jan 8, 2025

And also, some jobs, e.g. the RemoveDoubleClicks that is making problem here, cannot be split further into smaller jobs -- it uses this SQL on one DB table that should appropriately remove the user double clicks:

DELETE FROM ust USING {$this->table} ust
               INNER JOIN {$this->table} ustt ON (
                    ustt.load_id = ust.load_id AND
                    ustt.context_id = ust.context_id AND
                    ustt.ip = ust.ip AND
                    ustt.user_agent = ust.user_agent AND
                    ustt.canonical_url = ust.canonical_url
                )
                WHERE ust.load_id = ? AND
                    TIMESTAMPDIFF(SECOND, ust.date, ustt.date) < ? AND
                    TIMESTAMPDIFF(SECOND, ust.date, ustt.date) > 0 AND
                    ust.line_number < ustt.line_number 

@bozana
Copy link
Collaborator Author

bozana commented Jan 8, 2025

Maybe also a note that all stats jobs are chained, s. https://github.com/pkp/pkp-lib/blob/main/classes/task/PKPUsageStatsLoader.php#L122, because the order is also important and that they do not run in parallel.

bozana added a commit to pkp/ops that referenced this issue Feb 6, 2025
pkp/pkp-lib#9822 Increase timeout and improve DB table indexes for usage stats jobs
bozana added a commit to bozana/pkp-lib that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ojs that referenced this issue Feb 6, 2025
bozana added a commit to bozana/omp that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ops that referenced this issue Feb 6, 2025
bozana added a commit to bozana/pkp-lib that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ojs that referenced this issue Feb 6, 2025
bozana added a commit to bozana/omp that referenced this issue Feb 6, 2025
bozana added a commit to bozana/omp that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ops that referenced this issue Feb 6, 2025
bozana added a commit to bozana/pkp-lib that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ojs that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ojs that referenced this issue Feb 6, 2025
bozana added a commit to bozana/omp that referenced this issue Feb 6, 2025
bozana added a commit to bozana/omp that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ops that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ops that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ops that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ops that referenced this issue Feb 6, 2025
bozana added a commit to bozana/omp that referenced this issue Feb 6, 2025
bozana added a commit to bozana/omp that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ojs that referenced this issue Feb 6, 2025
bozana added a commit to bozana/ojs that referenced this issue Feb 6, 2025
bozana added a commit that referenced this issue Feb 7, 2025
#9822 Increase timeout and improve DB table indexes for us…
bozana added a commit to pkp/ojs that referenced this issue Feb 7, 2025
pkp/pkp-lib#9822 Increase timeout and improve DB table indexes for usage stats jobs
bozana added a commit to pkp/omp that referenced this issue Feb 7, 2025
pkp/pkp-lib#9822 Increase timeout and improve DB table indexes for usage stats jobs
bozana added a commit to pkp/ops that referenced this issue Feb 7, 2025
pkp/pkp-lib#9822 Increase timeout and improve DB table indexes for usage stats jobs
@bozana
Copy link
Collaborator Author

bozana commented Feb 7, 2025

@mpbraendle, I added the indexes and increased the timout to be 600. When you are able to test it, please let us know if that all helped...
Thanks a lot!
Closing the issue for now...

@bozana bozana closed this as completed Feb 7, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in Statistics Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement:1:Minor A new feature or improvement that can be implemented in less than 3 days.
Projects
Status: Done
Development

No branches or pull requests

6 participants