Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid inserting duplicates in invalidation table to reduce IO. #16843

Merged
merged 7 commits into from
Dec 8, 2020

Conversation

diosmosis
Copy link
Member

Description:

As title. Race conditions could still exist causing duplicates, but they should be much rarer. In this case I don't think we need to delete after finalizing an archive.
FYI @tsteur

Review

  • Functional review done
  • Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • Security review done see checklist
  • Code review done
  • Tests were added if useful/possible
  • Reviewed for breaking changes
  • Developer changelog updated if needed
  • Documentation added if needed
  • Existing documentation updated if needed

@diosmosis diosmosis added the Needs Review PRs that need a code review label Dec 1, 2020
@diosmosis diosmosis added this to the 4.0.3 milestone Dec 1, 2020
@sgiehl
Copy link
Member

sgiehl commented Dec 1, 2020

@diosmosis the code looks fine, but some tests are failing.
Also I'm not sure how to test that properly. Are there any specific test cases that should be fixed now?

@diosmosis
Copy link
Member Author

diosmosis commented Dec 1, 2020

@sgiehl there are changes to tests for this. If an invalidation already exists in the table, we don't insert another.

@tsteur
Copy link
Member

tsteur commented Dec 2, 2020

@diosmosis is this purely to reduce IO? I'm thinking the query in getExistingInvalidations might in the end cause more IO than just inserting the entry and later deleting it (not 100% sure though) . We could wait with this PR to see in production if IO is an issue or not.

@diosmosis
Copy link
Member Author

@tsteur

is this purely to reduce IO?

Yes, mostly. Alternatively instead of doing the GROUP BY we could check for each invalidation we want to insert (after determining whether it is a problem or not). The most common case will be finding at most one row per invalidation to insert.

It's also there to make sure we don't have duplicates in the table (so we don't have to worry about it at all). Either this PR or deleting after writing archives is needed. (I think we have to implement one of them at least to avoid too many dupes.)

@tsteur
Copy link
Member

tsteur commented Dec 2, 2020

It's just that it basically fetches all invalidations for 1 site and if I see this correctly it does that for each date that will be invalidated etc. We could merge it though and try to monitor impact.

I wonder will it cause an issue that we have these invalidations:

  • month Feb 2020

Then we invalidate day for 20-02-20 and Feb 2020

Then it would not add "feb 2020" because it already exists

Then an existing archive run would start archiving "feb 2020" (because there is no more day archive because ts_invalidated<queueConsumerIdSiteTime). This finishes

Then another archiver would start the day but there is no Feb 2020 invalidation that would need to be done after the day invalidation?

I'm thinking there could be issues.

A similar issue could happen when removing duplicates on archive finalize I think.
And so far the only reliable way I found to avoid these kind of things was isInvalidationsScheduledForSite and to basically indirectly invalidate less often :(

$table = Common::prefixTable('archive_invalidations');

$sql = "SELECT idsite, date1, date2, period, name, COUNT(*) as count FROM `$table`
WHERE idsite IN (" . implode(',', $idSites) . ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw we would need to ignore any archive currently being in progress? status=1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thought of that too 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis be great to add the status check. I guess it should be status=0 or status!=1

@diosmosis
Copy link
Member Author

@tsteur

It's just that it basically fetches all invalidations for 1 site and if I see this correctly it does that for each date that will be invalidated etc.

It does a group by so it shouldn't select duplicates, and we could also restrict it to the periods requested as well.

Then another archiver would start the day but there is no Feb 2020 invalidation that would need to be done after the day invalidation?

This seems like an issue too... but I think it would be solved if we kept the latest duplicate?

@diosmosis
Copy link
Member Author

diosmosis commented Dec 2, 2020

@tsteur wait is the referenced issue actually an issue? (trying to understand)

  1. month Feb 2020 already exists (because archiver was terminated or something) [ts_invalidated = 2020-12-01 03:04:00]
  2. Then we invalidate day for 20-02-20 and Feb 2020, now we have 20-02-20 [ts_invalidated = 2020-12-01 03:05:00], [Feb 2020, ts_invalidated = 2020-12-01 03:04:00]
  3. archiver that is in process or new starts on this idsite (w/ now = 2020-12-01 03:05:30, eg (it will be after the invalidation in step 2 in this scenario))
  4. archiver process day + month just fine

in this case it looks like it's fine, but if we're in the middle of an archive that might be an issue, perhaps? like:

  1. month Feb 2020 already exists (because archiver was terminated or something) [ts_invalidated = 2020-12-01 03:04:00]
  2. archiver starts processing at 2020-12-01 03:04:30, feb 2020 is worked on but does not finish
  3. Then we invalidate day for 20-02-20 and Feb 2020, now we have feb 2020 (original) [status = 1, ts_invalidated = 2020-12-01 03:04:00], 20-02-20 [ts_invalidated = 2020-12-01 03:05:00], [Feb 2020 [ts_invalidated = 2020-12-01 03:05:00]
  4. the archiver finishes and we just have 20-02-20 [ts_invalidated = 2020-12-01 03:05:00], [Feb 2020 [ts_invalidated = 2020-12-01 03:05:00]
  5. the archiver runs again and everything is fine

ok another one:

  1. 02-20-2020 + Feb 2020 already exists (because archiver was terminated or something) 02-20-2020 [ts_invalidated = 2020-12-01 03:04:00], Feb 2020 [ts_invalidated = 2020-12-01 03:04:00]
  2. archiver starts and picks up day, so 02-20-2020 [ts_invalidated = 2020-12-01 03:04:00, status = 1], Feb 2020 [ts_invalidated = 2020-12-01 03:04:00]
  3. for some reason, feb 2020 by itself is invalidated (it's possible to do just a month so this can happen), but nothing is inserted
  4. archive for day finishes and feb 2020 starts and finishes. this is all fine since the month was processed after the invalidation was requested. still no issue.

Seems like the main possible issues are tiny race conditions (something getting checked and inserted before status is set to 1), or the archive_invalidations table being in an inconsistent state (w/ periods being invalidated w/o other periods also being invalidated). It's not an issue I think for larger periods w/o smaller periods, in this case we just use the smaller periods as they are, and if they are invalidated later, we end up processing the larger periods with them no matter what. It's just an issue if there is a smaller period, and we don't handle larger periods (like day put down, but month/week/year are not), causing inconsistent data. Am I right in thinking this?

If this is right, we could maybe attempt to fix the table as we go along. Ie, we see a day, then we make sure there's a week/month/year added (hopefully quick SELECT for the periods, then insert if we have to, though it shouldn't happen that often). Or we see a week, and check for month/year. Only thing that wouldn't necessarily work are ranges, but I think that's normal unless they are totally blocked from being archived through the browser.

Does any of this seem legit?

@diosmosis
Copy link
Member Author

One other issue is users using FixedSiteIds while SharedSiteIds is in progress. Would be good to not allow multiple sites to be archived together at least through core:archive, I think...

@tsteur
Copy link
Member

tsteur commented Dec 2, 2020

If this is right, we could maybe attempt to fix the table as we go along. Ie, we see a day, then we make sure there's a week/month/year added (hopefully quick SELECT for the periods, then insert if we have to, though it shouldn't happen that often). Or we see a week, and check for month/year. Only thing that wouldn't necessarily work are ranges, but I think that's normal unless they are totally blocked from being archived through the browser.

Not sure. It seems things are getting more and more complicated and so hard to slowly tell what side effects it causes.

Generally I think we can proceed for now with this duplicate check but we should maybe only skip the insertion if there's more than 1 existing entry. We'd basically want to avoid having 3 or more duplicates. There then might be still issues and race conditions but it might lower the risk. Low risk might not be good enough though. I think it be fine though as a temporary workaround until we find something better.

Generally I'm mostly mentioning this because of this recent change https://github.com/matomo-org/matomo/pull/16844/files#diff-732c8b0c81b3c11bec111dfb48dbb289e3b2232428d012dd94cad00634c35d61R718 which could cause such race conditions potentially easily.

I'm not sure if you meant this in the previous comment but maybe we simply need to create the week/month/year invalidations on demand. Meaning when a day archive finishes, then we insert the week invalidation. When the week finishes we insert the month invalidation etc. Not sure how feasible this is.

@diosmosis
Copy link
Member Author

Generally I think we can proceed for now with this duplicate check but we should maybe only skip the insertion if there's more than 1 existing entry.

This is what I was thinking as well.

I'm not sure if you meant this in the previous comment but maybe we simply need to create the week/month/year invalidations on demand. Meaning when a day archive finishes, then we insert the week invalidation.

Yes this is what I meant in my last comment.

@tsteur
Copy link
Member

tsteur commented Dec 2, 2020

Yes this is what I meant in my last comment.

great, yes I think that would be great.

@diosmosis
Copy link
Member Author

@tsteur made some changes (applied status = 0 + added more conditions to select less data)

$table = Common::prefixTable('archive_invalidations');

$sql = "SELECT idsite, date1, date2, period, name, COUNT(*) as `count` FROM `$table`
WHERE idsite IN (" . implode(',', $idSites) . ") AND status = " . ArchiveInvalidator::INVALIDATION_STATUS_QUEUED . "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just seeing $idSites here. They are most likely already casted to int but be still great to add it again just to be safe and array_map(intval, $idsites). I think further up around line 130 it's also the same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tsteur
Copy link
Member

tsteur commented Dec 8, 2020

OK that should work re the periods and name in the query. Wondering though if the where query could get quite large? Wonder if we should ignore the period if it's longer than say 1000 character or so. We could ignore this though for now and do something when it actually causes an issue. Might be good to have some upper limit though. But then we might fetch a lot of data so I guess either way is fine.

@diosmosis
Copy link
Member Author

@tsteur we do one query per archive table in ArchiveInvalidator so there should already be an effective upper limit on the query length based on max number of periods in a month (31 + 4 + 1 + 1 I think).

@tsteur
Copy link
Member

tsteur commented Dec 8, 2020

👍 great, feel free to merge once tests pass

@diosmosis diosmosis merged commit 5377425 into 4.x-dev Dec 8, 2020
@diosmosis diosmosis deleted the archive-invalidation-dupes branch December 8, 2020 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants