Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

excludes doesn't appear to work with subdirectories/paths #553

Closed
Kevin-McKee opened this issue Jul 11, 2018 · 5 comments
Closed

excludes doesn't appear to work with subdirectories/paths #553

Kevin-McKee opened this issue Jul 11, 2018 · 5 comments
Assignees
Labels
bug For confirmed bugs
Milestone

Comments

@Kevin-McKee
Copy link

For example if I have a directory /mypath/folder with subdirectories

/mypath/folder
├── folderA
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
├── folderB
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
└── folderC
├── subfolderA
├── subfolderB
└── subfolderC

I would like to be able to start crawling at /mypath/folder and crawl everything except folderB/subfolderB for example.

With the way that I believe "excludes" works currently, I could put "folderB" which would exclude all of "folderB", or I could put "subfolderB" which would exclude folderA/subfolderB, folderB/subfolderB and folderC/subfolderB

I would like to be able to put "excludes": ["folderB/subfolderB"] or even a wildcard like "excludes": ["folderB/subfolder*"]

@dadoonet dadoonet added the bug For confirmed bugs label Jul 23, 2018
@dadoonet
Copy link
Owner

I feel like excludes and includes actually works only for now on folder name, not on the relative path from root folder.

Which means that folderB/subfolderB is never evaluated but subfolderB is.

I believe I need to rewrite this part. I started to POC around this today and will see how it goes but this could be seen as a breaking change though.

That said, I'm not sure a lot of users are using includes or excludes for dir names so it might be OK to break this behavior.
Stay tuned.

@dadoonet dadoonet self-assigned this Jul 23, 2018
dadoonet added a commit that referenced this issue Jul 23, 2018
For example if I have a directory /mypath/folder with subdirectories

```
/mypath/folder
├── folderA
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
├── folderB
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
└── folderC
    ├── subfolderA
    ├── subfolderB
    └── subfolderC
```

I would like to be able to start crawling at ``/mypath/folder` and crawl everything except `folderB/subfolderB` for example.

I would like to be able to put `"excludes": ["folderB/subfolderB"]` or even a wildcard like `"excludes": ["folderB/subfolder*"]`.

Closes #553.
@Kevin-McKee
Copy link
Author

Kevin-McKee commented Jul 23, 2018 via email

@dadoonet
Copy link
Owner

@a344254 I removed some personal information from the last answer you sent (email signature).

I created this branch https://github.com/dadoonet/fscrawler/tree/fix/553-exclude-dirs which supports the new feature.
I did some tests and it seems to be ok but I'd love if you can try it out.

Compile the project with

mvn package -DskipTests

And get the zip distribution file.

Documentation about this change is:

Let me know! :)

@dadoonet dadoonet added this to the 2.5 milestone Jul 23, 2018
dadoonet added a commit that referenced this issue Jul 28, 2018
For example if I have a directory /mypath/folder with subdirectories

```
/mypath/folder
├── folderA
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
├── folderB
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
└── folderC
    ├── subfolderA
    ├── subfolderB
    └── subfolderC
```

I would like to be able to start crawling at `/mypath/folder` and crawl everything except `/folderB/subfolderB` for example.

I would like to be able to put `"excludes": ["/folderB/subfolderB"]` or even a wildcard like `"excludes": ["/folderB/subfolder*"]`.

Closes #553.
@Kevin-McKee
Copy link
Author

I did some testing this morning and it appears to work. Thanks!
P.S. - the download link on http://fscrawler.readthedocs.io/en/fscrawler-2.5/installation.html doesn't work - it's pointing at https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.5/fscrawler-2.5.zip but gives a 404 Not Found error.

@dadoonet
Copy link
Owner

dadoonet commented Aug 7, 2018

Thanks @a344254. I just forgot to close and release the sonatype repo :)
It's done now. It can take some time to be synch'ed to Maven central though. Let's see that in some hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For confirmed bugs
Projects
None yet
Development

No branches or pull requests

2 participants