Micro Web Crawler in PHP & Manticore
Yo! is the super thin client-server crawler based on Manticore full-text search.
Compatible with different networks, includes flexible settings, history snaps, CLI tools and adaptive JS-less UI.
Available alternative branch for Gemini Protocol!
- MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
- Page snap history with local and remote mirrors support (including FTP protocol)
- CLI tools for index administration and crontab tasks
- JS-less frontend to run local or public search web portal
- Manticore Server
- PHP library for Manticore
- PHP library for Network operations
- Symfony DOM crawler
- Symfony CSS selector
- FTP client for snap mirrors
- Hostname ident icons
- Captcha
- Bootstrap icons
wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
dpkg -i manticore-repo.noarch.deb
apt update
apt install git composer manticore manticore-extra php-fpm php-curl php-mbstring php-gd
Yo search engine uses Manticore as the primary database. If your server sensitive to power down,
change default binlog flush strategy to binlog_flush = 1
Project in development, to create new search project, use dev-main
branch:
composer create-project yggverse/yo:dev-main
git clone https://github.com/YGGverse/Yo.git
cd Yo
composer update
git checkout -b pr-branch
git commit -m 'new fix'
git push
cd Yo
git pull
composer update
cp example/config.json config.json
php src/cli/index/init.php
php src/cli/document/add.php URL
php src/cli/document/crawl.php
php src/cli/document/search.php '*'
cd src/webui
php -S 127.0.0.1:8080
- open
http://127.0.0.1:8080
in browser
Create initial index
php src/cli/index/init.php [reset]
reset
- optional, reset existing index
Change existing index
php src/cli/index/alter.php {operation} {column} {type}
operation
- operation name, supported values:add
|drop
column
- target column nametype
- target column type, supported values:text
|integer
php src/cli/document/add.php URL
URL
- add new URL to the crawl queue
php src/cli/document/crawl.php
Make index optimization, apply new configuration rules
php src/cli/document/clean.php [limit]
limit
- integer, documents quantity per queue
php src/cli/document/search.php '@title "*"' [limit]
query
- requiredlimit
- optional search results limit
Import index from YGGo database
php src/cli/yggo/import.php 'host' 'port' 'user' 'password' 'database' [unique=off] [start=0] [limit=100]
Source DB fields required:
host
port
user
password
database
unique
- optional, check for unique URL (takes more time)start
- optional, offset to start queuelimit
- optional, limit queue
SQL text dumps could be useful for public index distribution, but requires more computing resources.
Better for infrastructure administration and includes original data binaries.
http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yo/
- IPv60200::/7
addresses only | indexhttp://yo.ygg
http://yo.ygg.at
http://ygg.yo.index