Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #94

Merged
merged 1 commit into from
Nov 19, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,17 @@

### Why is this solution better than others ?
Unlike all other commoncrawl extractors, this project allows creation of custom extractors with high level of modularity.
Unlike getting records from CmonCrawl index using Amazon's Athena this solution is completely free of cost :)
It supports all ways to access the CommonCrawl:
For quering:
- [x] AWS Athena
- [x] CommonCrawl Index API

For download:
- [x] S3
- [x] CommonCrawl API

, while being wrapped in very easy to use CLI. While CLI is easier to get started we also provide ways how to use the library
directly from python.

### Installation
#### From PyPi
Expand All @@ -26,7 +36,7 @@ To create them you need an example html files you want to extract.
You can use the following command to get html files from the CommonCrawl dataset:

```bash
$ cmon download --match_type=domain --limit=100 example.com html_output html
$ cmon download --match_type=domain --limit=100 html_output html example.com
```
This will download a first 100 html files from example.com and save them in html_output.

Expand Down Expand Up @@ -95,7 +105,7 @@ In our case the config would look like this:
To test the extraction, you can use the following command:

```bash
$ cmon extract config.json extracted_output html_output/*.html html
$ cmon extract config.json extracted_output html html_output/*.html
```

### Crawl the sites
Expand All @@ -106,7 +116,7 @@ To do this you will proceed in two steps:
To do this, you can use the following command:

```bash
cmon download --match_type=domain --limit=100000 example.com dr_output record
cmon download --match_type=domain --limit=100000 dr_output record example.com
```

This will download the first 100000 records from example.com and save them in dr_output. By default it saves 100_000 records per file, you can change this with the --max_crawls_per_file option.
Expand All @@ -115,7 +125,7 @@ This will download the first 100000 records from example.com and save them in dr
Once you have the records, you can use the following command to extract them:

```bash
$ cmon extract --n_proc=4 config.json extracted_output dr_output/*.jsonl record
$ cmon extract --n_proc=4 config.json extracted_output record dr_output/*.jsonl
```

Note that you can use the --n_proc option to specify the number of processes to use for the extraction. Multiprocessing is done on file level, so if you have just one file it will not be used.
Expand Down
Loading