Skip to content
This repository has been archived by the owner on Feb 28, 2018. It is now read-only.

Delegate reimbursement import to Celery #240

Merged
merged 9 commits into from
Oct 19, 2017
Merged

Conversation

cuducos
Copy link
Collaborator

@cuducos cuducos commented Aug 26, 2017

This is a work in progress, I just created it earlier to have feedbacks. The aim, when complete, is to address part of #217.

What is the purpose of this Pull Request?
The purpose of this PR is make the process of importing reimbursement quicker and without manual steps as described in #217.

What was done to achieve this purpose?
I introduced Celery to perform some task asynchronously: that way the heavy work of the import is handled in the background.

How to test if it really works?

  1. Install RabbitMQ if not using Docker — brew install rabbitmq or apt-get install rabbitmq-server should do the job for macOS and any Debian based Linux such as Ubuntu (Docker environment is ready, no need to worry about it)
  2. Install new packages: $ python -m pip install -U -r requirements.txt
  3. Run tests $ python manage.py test
  4. Get Celery started: $ celery worker --app jarbas
  5. Run reimbursement import: $ python manage.py reimbursements <path to reimbursements.xz>

Who can help reviewing it?
@jtemporal @lipemorais @davinirjr @augustogoulart @adrianomargarin

What is out of scope
In #217 I mentioned creating a API endpoint so Rosie can send POSTs with reimbursement data. I think this is an issue for a new PR when this one is finished, reviewed, stable an merged.

Next steps in this PR

  • Find out why the iteration is taking so long for each CSV (.xz) line (either I screwed up when wiring up Celery, or the serialize step is another bottleneck; if so, a callback in Celery should be able to handle serializing and then the create/update step)
  • Improve documentation on the RabbitMQ and Celery parts (I added bit on the environment variable, but instructions to run Celery still missing as well as pointing RabbitMQ as a requirement)

Next steps

  • Bring back Docker Hub images to docker-compose.yml after this PR is merged

@decko
Copy link

decko commented Oct 18, 2017

Testing with sample data seems Ok. Ready to Merge.

@cuducos
Copy link
Collaborator Author

cuducos commented Oct 18, 2017

Testing with sample data seems Ok. Ready to Merge.

In PVT @decko told me that actually it wasn't working with the Celery (worker) running in a container. So we still need to fix that before considering merging it. Is that right?

@cuducos
Copy link
Collaborator Author

cuducos commented Oct 18, 2017

Ok… I'm back to this PR.

I merged the novelties from master branch and studied this PR a bit more to advance it further.

The master branch imported the sample in 34s (counting the time to start Docker containers) and this branch did it in 24s (idem). We were saving 30% of the time, which is great.

Then I moved the serialize part to the async task (42e512a). The importing command took 15s — we're saving more than 50% of time now, yay!

From that I think the Celery part is indeed working and this PR is good for code review and, if that's the case, merging… cc @anaschwendler

Many thanks @decko for the tests and the ideas ; )


UPDATE

How to test this PR:

  1. First I'd run python manage.py reimbursements /mnt/data/reimbursements_sample.xz in the master branch, and time it (basically you can run time python manage.py reimbursements /mnt/data/reimbursements_sample.xz)
  2. Then I'd move to cuducos-celery branch, delete all reimbursements (python manage.py shell_plus and then Reimbursements.objects.all().delete() might do the job), and time the import again.

If at the end of the process the a) the importing was quicker in this branch when compared to master, and b) reimbursements were actually saved in the database (python manage.py shell_plus and then Reimbursements.objects.count() might do the job), then it's working as predicted ; )

@anaschwendler
Copy link
Collaborator

anaschwendler commented Oct 18, 2017

Hello @cuducos and @decko!

I'll start testing this PR little by little and how (so it will be edited along the way):

  1. Clone the repository:
$ git clone [email protected]:datasciencebr/jarbas.git
  1. Open the repo folder:
$ cd jarbas

I'll try to run with Docker, even if its showed without, I'll try :)

  1. Copy the .env file:
$ cp contrib/.env.sample env
  1. Build and start services:
$ docker-compose up -d
  1. Create the database and apply the migrations:
$ docker-compose run --rm django python manage.py migrate

I'll start testing as @cuducos suggest in the last comment:

  1. Run python manage.py reimbursements /mnt/data/reimbursements_sample.xz in the master branch, and time it:
$ time docker-compose run --rm django python manage.py reimbursements /mnt/data/reimbursements_sample.xz

And the result:

Starting jarbas_postgres_1 ... 
Starting jarbas_postgres_1 ... done
Starting with 0 reimbursements
Processed: 1000 lines                                                                                                                        Updated: 0 reimbursements                                                                                                                    Created: 1000 reimbursements                                                                                                                 Skip: 0 reimbursements                                                                                                                       
docker-compose run --rm django python manage.py reimbursements   1.13s user 0.34s system 11% cpu 13.221 total

Ok, now let's test with @cuducos PR:

  1. Checkout to @cuducos branch:
$ git checkout -b cuducos-celery origin/cuducos-celery
  1. Update the branch:
$ git merge master
  1. Delete all reimbursements:
$ docker-compose run --rm django python manage.py shell_plus
> Reimbursement.objects.all().delete()

The result:

>>> Reimbursement.objects.all().delete()
(1000, {'core.Tweet': 0, 'core.Reimbursement': 1000})
  1. Import again:
$ time docker-compose run --rm django python manage.py reimbursements /mnt/data/reimbursements_sample.xz

The result:

Starting jarbas_queue_1 ... 
Starting jarbas_queue_1 ... done
Starting jarbas_postgres_1 ... done
Starting jarbas_tasks_1 ... done
1000 reimbursements scheduled to creation/update                       
docker-compose run --rm django python manage.py reimbursements   1.11s user 0.26s system 13% cpu 9.829 total

It gave me a diference of 2secs by testing 3 times and getting the average time 🎉

Is it ok @cuducos ? Can I merge?

@anaschwendler anaschwendler merged commit 75874d9 into master Oct 19, 2017
@anaschwendler anaschwendler deleted the cuducos-celery branch October 19, 2017 18:43
@davinirjr
Copy link

Nice! Good work, people, congrats!

@giovanisleite giovanisleite mentioned this pull request Oct 21, 2017
5 tasks
@cuducos cuducos mentioned this pull request Oct 28, 2017
4 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants