-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid inserting duplicates in invalidation table to reduce IO. #16843
Conversation
@diosmosis the code looks fine, but some tests are failing. |
@sgiehl there are changes to tests for this. If an invalidation already exists in the table, we don't insert another. |
@diosmosis is this purely to reduce IO? I'm thinking the query in |
Yes, mostly. Alternatively instead of doing the GROUP BY we could check for each invalidation we want to insert (after determining whether it is a problem or not). The most common case will be finding at most one row per invalidation to insert. It's also there to make sure we don't have duplicates in the table (so we don't have to worry about it at all). Either this PR or deleting after writing archives is needed. (I think we have to implement one of them at least to avoid too many dupes.) |
It's just that it basically fetches all invalidations for 1 site and if I see this correctly it does that for each date that will be invalidated etc. We could merge it though and try to monitor impact. I wonder will it cause an issue that we have these invalidations:
Then we invalidate day for 20-02-20 and Feb 2020 Then it would not add "feb 2020" because it already exists Then an existing archive run would start archiving "feb 2020" (because there is no more day archive because Then another archiver would start the day but there is no Feb 2020 invalidation that would need to be done after the day invalidation? I'm thinking there could be issues. A similar issue could happen when removing duplicates on archive finalize I think. |
core/DataAccess/Model.php
Outdated
$table = Common::prefixTable('archive_invalidations'); | ||
|
||
$sql = "SELECT idsite, date1, date2, period, name, COUNT(*) as count FROM `$table` | ||
WHERE idsite IN (" . implode(',', $idSites) . ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw we would need to ignore any archive currently being in progress? status=1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thought of that too 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis be great to add the status check. I guess it should be status=0
or status!=1
It does a group by so it shouldn't select duplicates, and we could also restrict it to the periods requested as well.
This seems like an issue too... but I think it would be solved if we kept the latest duplicate? |
@tsteur wait is the referenced issue actually an issue? (trying to understand)
in this case it looks like it's fine, but if we're in the middle of an archive that might be an issue, perhaps? like:
ok another one:
Seems like the main possible issues are tiny race conditions (something getting checked and inserted before status is set to 1), or the archive_invalidations table being in an inconsistent state (w/ periods being invalidated w/o other periods also being invalidated). It's not an issue I think for larger periods w/o smaller periods, in this case we just use the smaller periods as they are, and if they are invalidated later, we end up processing the larger periods with them no matter what. It's just an issue if there is a smaller period, and we don't handle larger periods (like day put down, but month/week/year are not), causing inconsistent data. Am I right in thinking this? If this is right, we could maybe attempt to fix the table as we go along. Ie, we see a day, then we make sure there's a week/month/year added (hopefully quick SELECT for the periods, then insert if we have to, though it shouldn't happen that often). Or we see a week, and check for month/year. Only thing that wouldn't necessarily work are ranges, but I think that's normal unless they are totally blocked from being archived through the browser. Does any of this seem legit? |
One other issue is users using FixedSiteIds while SharedSiteIds is in progress. Would be good to not allow multiple sites to be archived together at least through core:archive, I think... |
Not sure. It seems things are getting more and more complicated and so hard to slowly tell what side effects it causes. Generally I think we can proceed for now with this duplicate check but we should maybe only skip the insertion if there's more than 1 existing entry. We'd basically want to avoid having 3 or more duplicates. There then might be still issues and race conditions but it might lower the risk. Low risk might not be good enough though. I think it be fine though as a temporary workaround until we find something better. Generally I'm mostly mentioning this because of this recent change https://github.com/matomo-org/matomo/pull/16844/files#diff-732c8b0c81b3c11bec111dfb48dbb289e3b2232428d012dd94cad00634c35d61R718 which could cause such race conditions potentially easily. I'm not sure if you meant this in the previous comment but maybe we simply need to create the week/month/year invalidations on demand. Meaning when a |
This is what I was thinking as well.
Yes this is what I meant in my last comment. |
great, yes I think that would be great. |
@tsteur made some changes (applied status = 0 + added more conditions to select less data) |
$table = Common::prefixTable('archive_invalidations'); | ||
|
||
$sql = "SELECT idsite, date1, date2, period, name, COUNT(*) as `count` FROM `$table` | ||
WHERE idsite IN (" . implode(',', $idSites) . ") AND status = " . ArchiveInvalidator::INVALIDATION_STATUS_QUEUED . " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just seeing $idSites here. They are most likely already casted to int but be still great to add it again just to be safe and array_map(intval, $idsites)
. I think further up around line 130 it's also the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
OK that should work re the periods and name in the query. Wondering though if the where query could get quite large? Wonder if we should ignore the period if it's longer than say 1000 character or so. We could ignore this though for now and do something when it actually causes an issue. Might be good to have some upper limit though. But then we might fetch a lot of data so I guess either way is fine. |
@tsteur we do one query per archive table in ArchiveInvalidator so there should already be an effective upper limit on the query length based on max number of periods in a month (31 + 4 + 1 + 1 I think). |
👍 great, feel free to merge once tests pass |
Description:
As title. Race conditions could still exist causing duplicates, but they should be much rarer. In this case I don't think we need to delete after finalizing an archive.
FYI @tsteur
Review