Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For devs, have a way to only import some of the data set to their local machine #285

Open
ghost opened this issue Sep 8, 2021 · 3 comments

Comments

@ghost
Copy link

ghost commented Sep 8, 2021

Is your feature request related to a problem? Please describe.

Importing the full data set takes a long time and takes a lot of disk space.

For dev's, it would be great if there was some flag that just let them import a sample set.

Describe the solution you’d like

Some kind of easy flag to the usual procedure, eg

iati crawler download-and-update --sample

I hacked this; if you edit iati_datastore/iatilib/crawler.py , update_dataset function, in the call to queue.enqueue( update_activities ... add the parameter at_front=True. Then just run the worker as normal, and activities will start to appear in your database straight away - just stop worker after a bit and you have your sample data set of only some activities! (If you don't do this hack it tries to process all datasets before any activities come in, so you sit at an activities=0 for a long time)

Describe alternatives you’ve considered

n/a

Additional context

n/a

@andylolz
Copy link
Member

andylolz commented Sep 8, 2021

I think I usually do the following:

iati crawler download

…followed by:

iati crawler update --dataset=[dataset name]

I think that will just update a single named dataset?

It does download everything, though – it would be nice to have a workaround for that.


in the call to queue.enqueue( update_activities ... add the parameter at_front=True.

Ooh – that’s really nice! I wonder if there’s any reason we shouldn’t make that modification?! It would be much nicer to prepend queue items than to append them.

@ghost
Copy link
Author

ghost commented Sep 9, 2021

iati crawler download
iati crawler update --dataset="aidenvironment-2020-01-activities"
iati queue background

Doesn't work for me in a fresh box:

  File "/vagrant/iati_datastore/iatilib/crawler.py", line 284, in update_dataset
    fetch_dataset_metadata(dataset)
  File "/vagrant/iati_datastore/iatilib/crawler.py", line 80, in fetch_dataset_metadata
    d = iatikit.data().datasets.get(dataset.name)
AttributeError: 'NoneType' object has no attribute 'name'

Indeed, the datasets table in the database is empty - is there another stage that needs done?


Ooh – that’s really nice! I wonder if there’s any reason we shouldn’t make that modification?! It would be much nicer to prepend queue items than to append them.

Probably fine to make - it would improve "time till new activity data appears in site" (but do nothing for "total time of data load"), if that's a priority?

@ghost
Copy link
Author

ghost commented Jun 15, 2022

is there another stage that needs done?

It might be crawler fetch-dataset-list (after download & before update) - maybe just have to run once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant