Skip to content

Commit

Permalink
Finish overhaul of glob matching.
Browse files Browse the repository at this point in the history
This commit completes the initial move of glob matching to an external
crate, including fixing up cross platform support, polishing the
external crate for others to use and fixing a number of bugs in the
process.

Fixes #87, #127, #131
  • Loading branch information
BurntSushi committed Oct 10, 2016
1 parent bc5accc commit e96d930
Show file tree
Hide file tree
Showing 13 changed files with 585 additions and 362 deletions.
7 changes: 0 additions & 7 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 0 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,5 @@ winapi = "0.2"
[features]
simd-accel = ["regex/simd-accel"]

[dev-dependencies]
glob = "0.2"

[profile.release]
debug = true
5 changes: 0 additions & 5 deletions benches/README.md

This file was deleted.

4 changes: 4 additions & 0 deletions ci/script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ run_test_suite() {
cargo clean --target $TARGET --verbose
cargo build --target $TARGET --verbose
cargo test --target $TARGET --verbose
cargo build --target $TARGET --verbose --manifest-path grep/Cargo.toml
cargo test --target $TARGET --verbose --manifest-path grep/Cargo.toml
cargo build --target $TARGET --verbose --manifest-path globset/Cargo.toml
cargo test --target $TARGET --verbose --manifest-path globset/Cargo.toml

# sanity check the file type
file target/$TARGET/debug/rg
Expand Down
7 changes: 7 additions & 0 deletions globset/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,17 @@ name = "globset"
version = "0.1.0"
authors = ["Andrew Gallant <[email protected]>"]

[lib]
name = "globset"
bench = false

[dependencies]
aho-corasick = "0.5.3"
fnv = "1.0"
lazy_static = "0.2"
log = "0.3"
memchr = "0.1"
regex = "0.1.77"

[dev-dependencies]
glob = "0.2"
122 changes: 122 additions & 0 deletions globset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
globset
=======
Cross platform single glob and glob set matching. Glob set matching is the
process of matching one or more glob patterns against a single candidate path
simultaneously, and returning all of the globs that matched.

[![Linux build status](https://api.travis-ci.org/BurntSushi/ripgrep.png)](https://travis-ci.org/BurntSushi/ripgrep)
[![Windows build status](https://ci.appveyor.com/api/projects/status/github/BurntSushi/ripgrep?svg=true)](https://ci.appveyor.com/project/BurntSushi/ripgrep)
[![](https://img.shields.io/crates/v/globset.svg)](https://crates.io/crates/globset)

Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).

### Documentation

[https://docs.rs/globset](https://docs.rs/globset)

### Usage

Add this to your `Cargo.toml`:

```toml
[dependencies]
globset = "0.1"
```

and this to your crate root:

```rust
extern crate globset;
```

### Example: one glob

This example shows how to match a single glob against a single file path.

```rust
use globset::Glob;

let glob = try!(Glob::new("*.rs")).compile_matcher();

assert!(glob.is_match("foo.rs"));
assert!(glob.is_match("foo/bar.rs"));
assert!(!glob.is_match("Cargo.toml"));
```

### Example: configuring a glob matcher

This example shows how to use a `GlobBuilder` to configure aspects of match
semantics. In this example, we prevent wildcards from matching path separators.

```rust
use globset::GlobBuilder;

let glob = try!(GlobBuilder::new("*.rs")
.literal_separator(true).build()).compile_matcher();

assert!(glob.is_match("foo.rs"));
assert!(!glob.is_match("foo/bar.rs")); // no longer matches
assert!(!glob.is_match("Cargo.toml"));
```

### Example: match multiple globs at once

This example shows how to match multiple glob patterns at once.

```rust
use globset::{Glob, GlobSetBuilder};

let mut builder = GlobSetBuilder::new();
// A GlobBuilder can be used to configure each glob's match semantics
// independently.
builder.add(try!(Glob::new("*.rs")));
builder.add(try!(Glob::new("src/lib.rs")));
builder.add(try!(Glob::new("src/**/foo.rs")));
let set = try!(builder.build());

assert_eq!(set.matches("src/bar/baz/foo.rs"), vec![0, 2]);
```

### Performance

This crate implements globs by converting them to regular expressions, and
executing them with the
[`regex`](https://github.com/rust-lang-nursery/regex)
crate.

For single glob matching, performance of this crate should be roughly on par
with the performance of the
[`glob`](https://github.com/rust-lang-nursery/glob)
crate. (`*_regex` correspond to benchmarks for this library while `*_glob`
correspond to benchmarks for the `glob` library.)
Optimizations in the `regex` crate may propel this library past `glob`,
particularly when matching longer paths.

```
test ext_glob ... bench: 425 ns/iter (+/- 21)
test ext_regex ... bench: 175 ns/iter (+/- 10)
test long_glob ... bench: 182 ns/iter (+/- 11)
test long_regex ... bench: 173 ns/iter (+/- 10)
test short_glob ... bench: 69 ns/iter (+/- 4)
test short_regex ... bench: 83 ns/iter (+/- 2)
```

The primary performance advantage of this crate is when matching multiple
globs against a single path. With the `glob` crate, one must match each glob
synchronously, one after the other. In this crate, many can be matched
simultaneously. For example:

```
test many_short_glob ... bench: 1,063 ns/iter (+/- 47)
test many_short_regex_set ... bench: 186 ns/iter (+/- 11)
```

### Comparison with the [`glob`](https://github.com/rust-lang-nursery/glob) crate

* Supports alternate "or" globs, e.g., `*.{foo,bar}`.
* Can match non-UTF-8 file paths correctly.
* Supports matching multiple globs at once.
* Doesn't provide a recursive directory iterator of matching file paths,
although I believe this crate should grow one eventually.
* Supports case insensitive and require-literal-separator match options, but
**doesn't** support the require-literal-leading-dot option.
64 changes: 28 additions & 36 deletions benches/bench.rs → globset/benches/bench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,39 +5,52 @@ tool itself, see the benchsuite directory.
#![feature(test)]

extern crate glob;
extern crate globset;
#[macro_use]
extern crate lazy_static;
extern crate regex;
extern crate test;

use globset::{Candidate, Glob, GlobMatcher, GlobSet, GlobSetBuilder};

const EXT: &'static str = "some/a/bigger/path/to/the/crazy/needle.txt";
const EXT_PAT: &'static str = "*.txt";

const SHORT: &'static str = "some/needle.txt";
const SHORT_PAT: &'static str = "some/**/needle.txt";

const LONG: &'static str = "some/a/bigger/path/to/the/crazy/needle.txt";
const LONG_PAT: &'static str = "some/**/needle.txt";

#[allow(dead_code, unused_variables)]
#[path = "../src/glob.rs"]
mod reglob;

fn new_glob(pat: &str) -> glob::Pattern {
glob::Pattern::new(pat).unwrap()
}

fn new_reglob(pat: &str) -> reglob::Set {
let mut builder = reglob::SetBuilder::new();
builder.add(pat).unwrap();
builder.build().unwrap()
fn new_reglob(pat: &str) -> GlobMatcher {
Glob::new(pat).unwrap().compile_matcher()
}

fn new_reglob_many(pats: &[&str]) -> reglob::Set {
let mut builder = reglob::SetBuilder::new();
fn new_reglob_many(pats: &[&str]) -> GlobSet {
let mut builder = GlobSetBuilder::new();
for pat in pats {
builder.add(pat).unwrap();
builder.add(Glob::new(pat).unwrap());
}
builder.build().unwrap()
}

#[bench]
fn ext_glob(b: &mut test::Bencher) {
let pat = new_glob(EXT_PAT);
b.iter(|| assert!(pat.matches(EXT)));
}

#[bench]
fn ext_regex(b: &mut test::Bencher) {
let set = new_reglob(EXT_PAT);
let cand = Candidate::new(EXT);
b.iter(|| assert!(set.is_match_candidate(&cand)));
}

#[bench]
fn short_glob(b: &mut test::Bencher) {
let pat = new_glob(SHORT_PAT);
Expand All @@ -47,7 +60,8 @@ fn short_glob(b: &mut test::Bencher) {
#[bench]
fn short_regex(b: &mut test::Bencher) {
let set = new_reglob(SHORT_PAT);
b.iter(|| assert!(set.is_match(SHORT)));
let cand = Candidate::new(SHORT);
b.iter(|| assert!(set.is_match_candidate(&cand)));
}

#[bench]
Expand All @@ -59,7 +73,8 @@ fn long_glob(b: &mut test::Bencher) {
#[bench]
fn long_regex(b: &mut test::Bencher) {
let set = new_reglob(LONG_PAT);
b.iter(|| assert!(set.is_match(LONG)));
let cand = Candidate::new(LONG);
b.iter(|| assert!(set.is_match_candidate(&cand)));
}

const MANY_SHORT_GLOBS: &'static [&'static str] = &[
Expand Down Expand Up @@ -101,26 +116,3 @@ fn many_short_regex_set(b: &mut test::Bencher) {
let set = new_reglob_many(MANY_SHORT_GLOBS);
b.iter(|| assert_eq!(2, set.matches(MANY_SHORT_SEARCH).iter().count()));
}

// This is the fastest on my system (beating many_glob by about 2x). This
// suggests that a RegexSet needs quite a few regexes (or a larger haystack)
// in order for it to scale.
//
// TODO(burntsushi): come up with a benchmark that uses more complex patterns
// or a longer haystack.
#[bench]
fn many_short_regex_pattern(b: &mut test::Bencher) {
let pats: Vec<_> = MANY_SHORT_GLOBS.iter().map(|&s| {
let pat = reglob::Pattern::new(s).unwrap();
regex::Regex::new(&pat.to_regex()).unwrap()
}).collect();
b.iter(|| {
let mut count = 0;
for pat in &pats {
if pat.is_match(MANY_SHORT_SEARCH) {
count += 1;
}
}
assert_eq!(2, count);
})
}
Loading

0 comments on commit e96d930

Please sign in to comment.