Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a filter by content option #585

Merged
merged 2 commits into from
Aug 2, 2018
Merged

Add a filter by content option #585

merged 2 commits into from
Aug 2, 2018

Conversation

dadoonet
Copy link
Owner

@dadoonet dadoonet commented Aug 2, 2018

You can filter out documents you would like to index by adding one or more
regular expression that match the extracted content.
Documents which are not matching will be simply ignored and not indexed.

If you define the following fs.filters property in your
~/.fscrawler/test/_settings.json file:

{
 "name" : "test",
 "fs": {
   "filters": [
     ".*foo.*",
     "^4\\d{3}([\\ \\-]?)\\d{4}\\1\\d{4}\\1\\d{4}$"
   ]
 }
}

With this example, only documents which contains the word foo and a VISA credit card number
with the form like 4012888888881881, 4012 8888 8888 1881 or 4012-8888-8888-1881
will be indexed.

Closes #463.

You can filter out documents you would like to index by adding one or more
regular expression that match the extracted content.
Documents which are not matching will be simply ignored and not indexed.

If you define the following `fs.filters` property in your
`~/.fscrawler/test/_settings.json` file:

```json
{
 "name" : "test",
 "fs": {
   "filters": [
     ".*foo.*",
     "^4\\d{3}([\\ \\-]?)\\d{4}\\1\\d{4}\\1\\d{4}$"
   ]
 }
}
```

With this example, only documents which contains the word `foo` and a VISA credit card number
with the form like `4012888888881881`, `4012 8888 8888 1881` or `4012-8888-8888-1881`
will be indexed.

Closes #463.
@dadoonet dadoonet added the new For new features or options label Aug 2, 2018
@dadoonet dadoonet added this to the 2.5 milestone Aug 2, 2018
@dadoonet dadoonet self-assigned this Aug 2, 2018
@dadoonet dadoonet merged commit 2517504 into master Aug 2, 2018
@dadoonet dadoonet deleted the fix/463-filter-text branch August 2, 2018 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new For new features or options
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant