Avoid inserting duplicates in invalidation table to reduce IO. #16843

diosmosis · 2020-12-01T04:57:33Z

Description:

As title. Race conditions could still exist causing duplicates, but they should be much rarer. In this case I don't think we need to delete after finalizing an archive.
FYI @tsteur

Review

Functional review done
Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
Security review done see checklist
Code review done
Tests were added if useful/possible
Reviewed for breaking changes
Developer changelog updated if needed
Documentation added if needed
Existing documentation updated if needed

sgiehl · 2020-12-01T15:56:44Z

@diosmosis the code looks fine, but some tests are failing.
Also I'm not sure how to test that properly. Are there any specific test cases that should be fixed now?

diosmosis · 2020-12-01T21:19:32Z

@sgiehl there are changes to tests for this. If an invalidation already exists in the table, we don't insert another.

tsteur · 2020-12-02T03:06:46Z

@diosmosis is this purely to reduce IO? I'm thinking the query in getExistingInvalidations might in the end cause more IO than just inserting the entry and later deleting it (not 100% sure though) . We could wait with this PR to see in production if IO is an issue or not.

diosmosis · 2020-12-02T03:16:14Z

@tsteur

is this purely to reduce IO?

Yes, mostly. Alternatively instead of doing the GROUP BY we could check for each invalidation we want to insert (after determining whether it is a problem or not). The most common case will be finding at most one row per invalidation to insert.

It's also there to make sure we don't have duplicates in the table (so we don't have to worry about it at all). Either this PR or deleting after writing archives is needed. (I think we have to implement one of them at least to avoid too many dupes.)

tsteur · 2020-12-02T03:40:45Z

It's just that it basically fetches all invalidations for 1 site and if I see this correctly it does that for each date that will be invalidated etc. We could merge it though and try to monitor impact.

I wonder will it cause an issue that we have these invalidations:

month Feb 2020

Then we invalidate day for 20-02-20 and Feb 2020

Then it would not add "feb 2020" because it already exists

Then an existing archive run would start archiving "feb 2020" (because there is no more day archive because ts_invalidated<queueConsumerIdSiteTime). This finishes

Then another archiver would start the day but there is no Feb 2020 invalidation that would need to be done after the day invalidation?

I'm thinking there could be issues.

A similar issue could happen when removing duplicates on archive finalize I think.
And so far the only reliable way I found to avoid these kind of things was isInvalidationsScheduledForSite and to basically indirectly invalidate less often :(

tsteur · 2020-12-02T03:41:08Z

core/DataAccess/Model.php

+        $table = Common::prefixTable('archive_invalidations');
+
+        $sql = "SELECT idsite, date1, date2, period, name, COUNT(*) as count FROM `$table`
+                 WHERE idsite IN (" . implode(',', $idSites) . ")


btw we would need to ignore any archive currently being in progress? status=1?

Yes, thought of that too 👍

@diosmosis be great to add the status check. I guess it should be status=0 or status!=1

diosmosis · 2020-12-02T04:04:43Z

@tsteur

It's just that it basically fetches all invalidations for 1 site and if I see this correctly it does that for each date that will be invalidated etc.

It does a group by so it shouldn't select duplicates, and we could also restrict it to the periods requested as well.

Then another archiver would start the day but there is no Feb 2020 invalidation that would need to be done after the day invalidation?

This seems like an issue too... but I think it would be solved if we kept the latest duplicate?

diosmosis · 2020-12-02T06:16:34Z

@tsteur wait is the referenced issue actually an issue? (trying to understand)

month Feb 2020 already exists (because archiver was terminated or something) [ts_invalidated = 2020-12-01 03:04:00]
Then we invalidate day for 20-02-20 and Feb 2020, now we have 20-02-20 [ts_invalidated = 2020-12-01 03:05:00], [Feb 2020, ts_invalidated = 2020-12-01 03:04:00]
archiver that is in process or new starts on this idsite (w/ now = 2020-12-01 03:05:30, eg (it will be after the invalidation in step 2 in this scenario))
archiver process day + month just fine

in this case it looks like it's fine, but if we're in the middle of an archive that might be an issue, perhaps? like:

month Feb 2020 already exists (because archiver was terminated or something) [ts_invalidated = 2020-12-01 03:04:00]
archiver starts processing at 2020-12-01 03:04:30, feb 2020 is worked on but does not finish
Then we invalidate day for 20-02-20 and Feb 2020, now we have feb 2020 (original) [status = 1, ts_invalidated = 2020-12-01 03:04:00], 20-02-20 [ts_invalidated = 2020-12-01 03:05:00], [Feb 2020 [ts_invalidated = 2020-12-01 03:05:00]
the archiver finishes and we just have 20-02-20 [ts_invalidated = 2020-12-01 03:05:00], [Feb 2020 [ts_invalidated = 2020-12-01 03:05:00]
the archiver runs again and everything is fine

ok another one:

02-20-2020 + Feb 2020 already exists (because archiver was terminated or something) 02-20-2020 [ts_invalidated = 2020-12-01 03:04:00], Feb 2020 [ts_invalidated = 2020-12-01 03:04:00]
archiver starts and picks up day, so 02-20-2020 [ts_invalidated = 2020-12-01 03:04:00, status = 1], Feb 2020 [ts_invalidated = 2020-12-01 03:04:00]
for some reason, feb 2020 by itself is invalidated (it's possible to do just a month so this can happen), but nothing is inserted
archive for day finishes and feb 2020 starts and finishes. this is all fine since the month was processed after the invalidation was requested. still no issue.

Seems like the main possible issues are tiny race conditions (something getting checked and inserted before status is set to 1), or the archive_invalidations table being in an inconsistent state (w/ periods being invalidated w/o other periods also being invalidated). It's not an issue I think for larger periods w/o smaller periods, in this case we just use the smaller periods as they are, and if they are invalidated later, we end up processing the larger periods with them no matter what. It's just an issue if there is a smaller period, and we don't handle larger periods (like day put down, but month/week/year are not), causing inconsistent data. Am I right in thinking this?

If this is right, we could maybe attempt to fix the table as we go along. Ie, we see a day, then we make sure there's a week/month/year added (hopefully quick SELECT for the periods, then insert if we have to, though it shouldn't happen that often). Or we see a week, and check for month/year. Only thing that wouldn't necessarily work are ranges, but I think that's normal unless they are totally blocked from being archived through the browser.

Does any of this seem legit?

diosmosis · 2020-12-02T06:49:42Z

One other issue is users using FixedSiteIds while SharedSiteIds is in progress. Would be good to not allow multiple sites to be archived together at least through core:archive, I think...

tsteur · 2020-12-02T20:43:44Z

If this is right, we could maybe attempt to fix the table as we go along. Ie, we see a day, then we make sure there's a week/month/year added (hopefully quick SELECT for the periods, then insert if we have to, though it shouldn't happen that often). Or we see a week, and check for month/year. Only thing that wouldn't necessarily work are ranges, but I think that's normal unless they are totally blocked from being archived through the browser.

Not sure. It seems things are getting more and more complicated and so hard to slowly tell what side effects it causes.

Generally I think we can proceed for now with this duplicate check but we should maybe only skip the insertion if there's more than 1 existing entry. We'd basically want to avoid having 3 or more duplicates. There then might be still issues and race conditions but it might lower the risk. Low risk might not be good enough though. I think it be fine though as a temporary workaround until we find something better.

Generally I'm mostly mentioning this because of this recent change https://github.com/matomo-org/matomo/pull/16844/files#diff-732c8b0c81b3c11bec111dfb48dbb289e3b2232428d012dd94cad00634c35d61R718 which could cause such race conditions potentially easily.

I'm not sure if you meant this in the previous comment but maybe we simply need to create the week/month/year invalidations on demand. Meaning when a day archive finishes, then we insert the week invalidation. When the week finishes we insert the month invalidation etc. Not sure how feasible this is.

diosmosis · 2020-12-02T21:28:53Z

Generally I think we can proceed for now with this duplicate check but we should maybe only skip the insertion if there's more than 1 existing entry.

This is what I was thinking as well.

I'm not sure if you meant this in the previous comment but maybe we simply need to create the week/month/year invalidations on demand. Meaning when a day archive finishes, then we insert the week invalidation.

Yes this is what I meant in my last comment.

tsteur · 2020-12-02T21:35:46Z

Yes this is what I meant in my last comment.

great, yes I think that would be great.

diosmosis · 2020-12-08T03:17:37Z

@tsteur made some changes (applied status = 0 + added more conditions to select less data)

tsteur · 2020-12-08T03:42:01Z

core/DataAccess/Model.php

+        $table = Common::prefixTable('archive_invalidations');
+
+        $sql = "SELECT idsite, date1, date2, period, name, COUNT(*) as `count` FROM `$table`
+                 WHERE idsite IN (" . implode(',', $idSites) . ") AND status = " . ArchiveInvalidator::INVALIDATION_STATUS_QUEUED . "


just seeing $idSites here. They are most likely already casted to int but be still great to add it again just to be safe and array_map(intval, $idsites). I think further up around line 130 it's also the same

tsteur · 2020-12-08T03:46:07Z

OK that should work re the periods and name in the query. Wondering though if the where query could get quite large? Wonder if we should ignore the period if it's longer than say 1000 character or so. We could ignore this though for now and do something when it actually causes an issue. Might be good to have some upper limit though. But then we might fetch a lot of data so I guess either way is fine.

diosmosis · 2020-12-08T03:52:51Z

@tsteur we do one query per archive table in ArchiveInvalidator so there should already be an effective upper limit on the query length based on max number of periods in a month (31 + 4 + 1 + 1 I think).

tsteur · 2020-12-08T04:00:47Z

👍 great, feel free to merge once tests pass

Avoid inserting duplicates in invalidation table to reduce IO.

8302252

diosmosis added the Needs Review PRs that need a code review label Dec 1, 2020

diosmosis added this to the 4.0.3 milestone Dec 1, 2020

Limit to given sites to not select too much.

52e7797

fix build

aea94d3

tsteur reviewed Dec 2, 2020

View reviewed changes

tsteur mentioned this pull request Dec 6, 2020

If archive_invalidations is in inconsistent state, fix as getting next archive to process. #16886

Merged

9 tasks

diosmosis added 2 commits December 7, 2020 18:49

Merge branch '4.x-dev' into archive-invalidation-dupes

2acde9e

tweaks to duplicate checking query

54d2f84

tsteur reviewed Dec 8, 2020

View reviewed changes

more pr feedback

dde3e95

fix cron archive test

fe23513

diosmosis merged commit 5377425 into 4.x-dev Dec 8, 2020

diosmosis deleted the archive-invalidation-dupes branch December 8, 2020 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid inserting duplicates in invalidation table to reduce IO. #16843

Avoid inserting duplicates in invalidation table to reduce IO. #16843

diosmosis commented Dec 1, 2020

sgiehl commented Dec 1, 2020

diosmosis commented Dec 1, 2020 •

edited

Loading

tsteur commented Dec 2, 2020

diosmosis commented Dec 2, 2020

tsteur commented Dec 2, 2020

tsteur Dec 2, 2020

diosmosis Dec 2, 2020

tsteur Dec 8, 2020

diosmosis commented Dec 2, 2020

diosmosis commented Dec 2, 2020 •

edited

Loading

diosmosis commented Dec 2, 2020

tsteur commented Dec 2, 2020

diosmosis commented Dec 2, 2020

tsteur commented Dec 2, 2020

diosmosis commented Dec 8, 2020

tsteur Dec 8, 2020

diosmosis Dec 8, 2020

tsteur commented Dec 8, 2020

diosmosis commented Dec 8, 2020

tsteur commented Dec 8, 2020

Avoid inserting duplicates in invalidation table to reduce IO. #16843

Avoid inserting duplicates in invalidation table to reduce IO. #16843

Conversation

diosmosis commented Dec 1, 2020

Description:

Review

sgiehl commented Dec 1, 2020

diosmosis commented Dec 1, 2020 • edited Loading

tsteur commented Dec 2, 2020

diosmosis commented Dec 2, 2020

tsteur commented Dec 2, 2020

tsteur Dec 2, 2020

Choose a reason for hiding this comment

diosmosis Dec 2, 2020

Choose a reason for hiding this comment

tsteur Dec 8, 2020

Choose a reason for hiding this comment

diosmosis commented Dec 2, 2020

diosmosis commented Dec 2, 2020 • edited Loading

diosmosis commented Dec 2, 2020

tsteur commented Dec 2, 2020

diosmosis commented Dec 2, 2020

tsteur commented Dec 2, 2020

diosmosis commented Dec 8, 2020

tsteur Dec 8, 2020

Choose a reason for hiding this comment

diosmosis Dec 8, 2020

Choose a reason for hiding this comment

tsteur commented Dec 8, 2020

diosmosis commented Dec 8, 2020

tsteur commented Dec 8, 2020

diosmosis commented Dec 1, 2020 •

edited

Loading

diosmosis commented Dec 2, 2020 •

edited

Loading