-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source Google Search Console: add slicing by date range #9073
Source Google Search Console: add slicing by date range #9073
Conversation
/test connector=connectors/source-google-search-console
|
/test connector=connectors/source-google-search-console
|
/test connector=connectors/source-google-search-console
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about rate limits? If we sync stream by day, will rate limits exceeded?
end_date = self._get_end_date() | ||
|
||
if start_date > end_date: | ||
yield from [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use yield instead of yield from list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this yield from.
start_date = self._get_start_date(stream_state, site_url, search_type) | ||
end_date = self._get_end_date() | ||
|
||
if start_date > end_date: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If start date greater than end date, you can set start date instead of duplicate yield dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vitaliizazmic Done.
"end_date": next_end.to_date_string(), | ||
} | ||
# add 1 day for the next slice's start date not to duplicate data from previous slice's end date. | ||
next_start = next_end + pendulum.Duration(days=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you add 1 day instead of period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vitaliizazmic
The period is added here
Without this line
next_start = next_end + pendulum.Duration(days=1)
The two slices will intersect, then user gets duplicated records:
{"start_date": "2021-09-01", "end_date": "2021-09-02"}
{"start_date": "2021-09-02", "end_date": "2021-09-03"}
/test connector=connectors/source-google-search-console
|
@vitaliizazmic I did not face rate limit. In this PR, range days is set to 2. One query will fetch records for 2 days. We can increase it to 3 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but just in case check syncing for long period.
…date-range # Conflicts: # docs/integrations/sources/google-search-console.md
/test connector=connectors/source-google-search-console
|
/publish connector=connectors/source-google-search-console
|
What
Resolves 8572
if we do API call for long date range there is a big chance to timeout.
How
The solution can be slicing streams by N days: for each date based slice send separate request. The less the range, the less chance we can timeout. By default range of days is 2.
For example:
Then our slices will be:
Recommended reading order