Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get changes from a specific date only for pages with geo location #70

Open
HarelM opened this issue Jun 13, 2020 · 12 comments
Open

Get changes from a specific date only for pages with geo location #70

HarelM opened this issue Jun 13, 2020 · 12 comments

Comments

@HarelM
Copy link

HarelM commented Jun 13, 2020

My use cause:
I'm using geosearch to get all the points in a certain area.
For each point I get the extended data I need and store it in the database (A mirroring of some sort only with the data I truly need - pages with geo location).
Later on I would like to know what items were updated or added from a specific point in time.
I'm not sure if there's an easy API to know what was added and what was updated given a specific date and then I'll need to test which page has geo location or to get the revisions list of a geoserach results.
In any case, I need to do a database incremental update given a specific date.
Any advise would be welcome :-)
I haven't found an option to add more props to geosearch generator.
Here's an example to a query:
/w/api.php?action=query&format=json&prop=coordinates%7Cpageimages%7Crevisions&generator=geosearch&ggscoord=37.7891838%7C-122.4033522
This page too:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=coordinates%7Cpageimages%7Crevisions&generator=geosearch&ggscoord=37.7891838%7C-122.4033522

I'm not sure this is the right solution though...

@CXuesong
Copy link
Owner

I'm not sure if there's an easy API to know what was added and what was updated given a specific date and then I'll need to …

For updated pages, you can try recentchanges generator, i.e. RecentChangesGenerator. You can use RecentChangesGenerator.TypeFilters to only include created pages. You can use StartTime and EndTime to specify a time range of interest.

Then you can use EnumPagesAsync or its overload to retrieve a list of WikiPages. Since you want to fetch GeoLocation, make sure to pass in a customized WikiPageQueryProvider with GeoCoordinatesPropertyProvider (see wiki), Then you should have the geolocation for enumerated pages.

However, I'm not sure what will happen if a page has been updated multiple times during your specified time range, when using EnumPagesAsync. Alternatively, you may also try using RecentChangesGenerator.EnumItemsAsync, remove the duplicates, then fetch for WikiPages manually. (See "batch fetching" on wiki).

@HarelM
Copy link
Author

HarelM commented Jun 13, 2020

Thanks, I'm not worried about multiple updates since I only need the pages' Ids. from there I send a request to get a page with all the fields I need in order to mirror it.
I guess I'll start with the "dumb" approach - get all page ids that changed from a specific date and go over each Id to get the full page. If the page doesn't have geolocation I'll skip it.
If this doesn't have good results I'll see if I can optimize it using the stuff above that you wrote...

@HarelM
Copy link
Author

HarelM commented Jun 13, 2020

Ok, So basically I need to choose which one to run first, i.e:

  1. Run changes query and add coordinates to know that I'm within a bounding box.
  2. Run a BBox query and check when was the last modification.

Both options require two steps as far as I understand in terms of getting the data and then filtering it.
Since the BBox I need to query is relatively small I think the second option is faster.
I just tried to get all the changes in Israel BBox in the last day in en wikipedia and it took 6 minutes just to get the list of pages which is a long time, I think...
he wikipedia takes around 30 seconds to get the list of pages.

@CXuesong
Copy link
Owner

Run changes query and add coordinates to know that I'm within a bounding box.

This surely would be slow, as there are a lot of changes taking place on WP every minute (I didn't check the actual number).

Run a BBox query and check when was the last modification.

I think this approach is better, too. You just need to keep track of the coordinates of the pages since your last visit, so you can discover whether there are pages that have been move out of your BBox.

@HarelM
Copy link
Author

HarelM commented Jun 14, 2020

True, I'll need to track deleted pages and pages that moved out of the BBox.
Is there a way to add properties to the main query of GeoSearchGenerator e.g. porp=revisions?
Do I need to use EnumPagesAsync for this? Will it then create another query for each page or will it use the main query and just add properties?
Thanks again for all the help and the super quick response!

@CXuesong
Copy link
Owner

The intention of EnumPagesAsync is to provide a way for you to leverage MW "generators", i.e., retrieving page objects (like action=query&title=... responses) from MW lists in a single API request, instead of retriving page titles/ids from MW lists, then sending another request (with action=query&title=...) to retrive the page objects.

It's up to you to decide whether to leverage this method. Sometimes it may be worthwhile to use EnumItemsAsync to fetch for page titles and ids only (and maybe some other list-specific properties), do some pre-processing on the list items (e,g, remove dups), then use a separate call (RefreshAsync extension method) to fetch for the pages.

@CXuesong
Copy link
Owner

Is there a way to add properties to the main query of GeoSearchGenerator e.g. porp=revisions?

If you used EnumPagesAsync, By default I suppose you should already have basic revision information of latest revision, excluding revision content. Do you need other information?

Generally, you can pass in a WikiPageQueryProvider instance to EnumPagesAsync method. And in the wiki page there is an example of how to construct a WikiPageQueryProvider.

@HarelM
Copy link
Author

HarelM commented Jun 14, 2020

The plot thickens :-)
When using a GeoSearchGenerator and EnumItemsAsync I'm getting only 500 items for a specific area (related to #64) in the "he" wikipedia:

var delta = 0.15;
            var results = _gateway.GetByBoundingBox(new Coordinate(34.75, 32), new Coordinate(34.75 + delta, 32 + delta), "he").Result;

When using almost the same code but with EnumPagesAsync I'm getting 25000 results, most of them do not have coordinates - this means I can't really use EnumPagesAsync with GeoSearchGenerator :-(

See here:

var geoSearchGenerator = new GeoSearchGenerator(_wikiSites[language])
                    {
                        BoundingRectangle = GeoCoordinateRectangle.FromBoundingCoordinates(southWest.X, southWest.Y, northEast.X, northEast.Y),
                        PaginationSize = 500,
                    };
                    var results = await geoSearchGenerator//.EnumItemsAsync().ToListAsync();
                    .EnumPagesAsync(new WikiPageQueryProvider
                    {
                        Properties =
                        {
                          new ExtractsPropertyProvider {AsPlainText = true, IntroductionOnly = true, MaxSentences = 1},
                          new PageImagesPropertyProvider {QueryOriginalImage = true},
                          new GeoCoordinatesPropertyProvider {QueryPrimaryCoordinate = true},
                          new RevisionsPropertyProvider { FetchContent = false }
                        }
                    }).ToListAsync();

@HarelM
Copy link
Author

HarelM commented Jun 14, 2020

Seems like coordinates is added only to 10 pages out of the query - when the query page is 10 (the default) it works as expected, which is how the sandbox API shows the results, but when setting it to 500 it doesn't :-(
This is not supersizing as it seems that the geosearch is not maintained well...
https://stackoverflow.com/questions/35826469/how-to-combine-two-wikipedia-api-calls-into-one/35830161
https://stackoverflow.com/questions/24529853/how-to-get-more-info-within-only-one-geosearch-call-via-wikipedia-api/32916451

@CXuesong
Copy link
Owner

CXuesong commented Jun 15, 2020

Seems like coordinates is added only to 10 pages out of the query

prop=coordinates also has a pagination setting colimit, with 10 as the default value. This value is used when you are using WikiPageGenerator, as colimit is not specified at all. This means you will have at most 10 coordinates per request, and there will be continuation (for coordinates list) in the MW API response. Though you may have more than 10 pages in the page results (generator=geosearch&ggslimit=500), the pages beyond first 10 will have empty coordinates property result, awaiting you continue the query. Actually, there are 2 sets of continuation token (dual continuation) in the MW API response and frankly the WikiPageGenerator my library cannot handle this case very well. However, you can evade such case by using EnumItemsAsync and RefreshPagesAsync.

{
  "continue": {
    "excontinue": 20,
    "picontinue": 128475,
    "cocontinue": "8670|13334822",
    "continue": "||revisions"
  },
  "query": {
    "pages": {
      "1225": {

As I've mentioned in #69, there is some basic logic in RefreshPagesAsync to merge prop list when there are some props need pagination (such as prop=coordinates). Thus I think the best you can do for now is to use GeoSearchGenerator.EnumItemsAsync, so you can have a list of page ids / titles. Then construct a IEnumerable<WikiPage> sequence and call RefreshPagesAsync on it.

@HarelM
Copy link
Author

HarelM commented Jun 16, 2020

Yea, I figured it out yesterday after digging into your code and seeing in fiddler that the number of pages that are scrolled is about 5 when doing a RefreshPagesAsync for the properties I needed.
I have managed to reduce the time it takes to do the mirroring process to around 2 minutes which is very good from my point of view.
Code can be seen here:
https://github.com/IsraelHikingMap/Site/blob/master/IsraelHiking.API/Services/Poi/WikipediaPointsOfInterestAdapter.cs#L81L96
I basically wrapped the refresh pages with a parallel loop since refresh pages is doing its job sequentially - i.e. when sending a lot of pages to be fetched and the page scroll is 10 or 5 it will take a long time to fetch all the pages (around 12K in my case).
It might be worth adding an option to parallel the refresh process in cases of low scroll value and high number of pages, not sure how it fits in the architecture of this project, but I found myself doing just that in the above code.

Thanks again for all the explanations and great library!
Feel free to close this issue if you feel there's nothing to be done in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants