A distributed Web Crawling Schedulercomponent to distribute URLs from Frontier to multiple Fetcher.
Distributed Fetcher ask for new URL Lists per REST API call.
The project is built on the python package FastAPI (MIT licensed) (https://fastapi.tiangolo.com/). FastApi itself is built on top of the following packages:
- Starlette (MIT-licensed) (https://www.starlette.io/)
- pydantic (MIT-licensed) (https://pydantic-docs.helpmanual.io/)
The Project also import parts of the following Libraries / Frameworks
- SQLAlchemy (MIT-licensed) (https://www.sqlalchemy.org/)
- Psycopg (GNU Lesser General Public License) (https://www.psycopg.org/)
- pytest (MIT-licensed) (https://www.pytest.org)
- xxhash (BSD licensed) (https://pypi.org/project/xxhash/)
The Docker Image provided by FastAPI is used as well
- tiangolo/uvicorn-gunicorn-fastapi:latest
The project is deployed on an AWS EC2 Ubuntu Machine.
[Link to Online Docs] http://ec2-18-185-96-23.eu-central-1.compute.amazonaws.com/docs
Re-Run local Docker-Image (Windows PowerShell)
docker ps -q | % { docker stop $_ }
docker pull dockerjens23/websch
docker build -t websch .
docker run -d -p 80:80 websch
Re-Run remote Docker-Image (Ubuntu)
sudo docker stop $(sudo docker ps -q)
sudo docker pull dockerjens23/websch
sudo docker run -d -p 80:80 dockerjens23/websch
Get Loginfo of running Container
sudo docker logs --follow $(sudo docker ps -q)
# disk free (human-readable)
df -h
# list all docker container (inactive, too)
sudo docker ps -a
sudo docker run --env-file ./env.list -p 80:80
POSTGRES_ENV_USER=...
POSTGRES_ENV_PW=...
POSTGRES_ENV_URI=...
POSTGRES_ENV_DB=...