Skip to content

Commit

Permalink
v0.0.7 wrap up (#47)
Browse files Browse the repository at this point in the history
* readme updates & show output path in configuration

* readme updates.

* reduce the size of empty files.

* update smash ignores.

* classified.

* show all dupes if lower than the top size.

* Bugfix: report shows initial duplicate.

* Final bits for v0.0.7.

* reduce the sizes.
  • Loading branch information
thushan authored Dec 20, 2023
1 parent 7ca0166 commit bc98815
Show file tree
Hide file tree
Showing 9 changed files with 109 additions and 25 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,11 @@ go.work
/.code
/dist

# output files
smash
smash.exe
smash.out

# temporary files
report.json
analysis.json
Expand Down
Binary file added docs/artefacts/smash-v0.0.7-demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/artefacts/smash-v0.0.7-hdd-photos-demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/demos.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@
Pre-recorded demos of `smash` in action.

# Examples
* [v0.0.7-hdd-photos](https://vhs.charm.sh/vhs-4OwN0BJfb3F3CTzGJCFHcs.gif) - `smash`ing an old portable USB HDD of photos & removing duplicates.
* [v0.0.5-hdd-photos](https://vhs.charm.sh/vhs-7B6XHxXq8VPvZ6AY9FpGIc.gif) - `smash`ing an old portable USB HDD of photos with excluded directories.

# Versions

* [v0.0.7](https://vhs.charm.sh/vhs-5uZbZAvk8Y6eq4dihLppbk.gif) - powered by [VHS](https://vhs.charm.sh)
* [v0.0.5](https://vhs.charm.sh/vhs-1zSMi9vYpmh0DivoB4E6g4.gif) - powered by [VHS](https://vhs.charm.sh)
* [v0.0.4](https://vhs.charm.sh/vhs-tgMXNRqo7UovLRd5iSlgF.gif) - powered by [VHS](https://vhs.charm.sh)
* [v0.0.3](https://vhs.charm.sh/vhs-1T6pqQivwvPAmudnDpwVQP.gif) - powered by [VHS](https://vhs.charm.sh)
Expand Down
17 changes: 16 additions & 1 deletion docs/vhs/demo-photos-hdd.tape
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,22 @@ Set WindowBar Colorful
Set Theme "TokyoNight"

# smash Linux/drivers
Type "./smash /media/thushan/smash/photos/ --exclude-dir=_uploaded,sort,tmp,events"
Type "./smash /media/thushan/smash/photos/ --exclude-dir=sort,tmp,events -o report.json"
Sleep 500ms
Enter
Sleep 30s
Type "clear"
Sleep 500ms
Enter
Type `jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json`
Sleep 500ms
Enter
Sleep 5s
Type "clear"
Sleep 500ms
Enter
Type `jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs ls -lh`
Sleep 500ms
Enter
Sleep 5s

29 changes: 28 additions & 1 deletion docs/vhs/demo.tape
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,34 @@ Set WindowBar Colorful
Set Theme "JetBrains Darcula"

# smash Linux/drivers
Type "./smash ~/linux/drivers"
Type "./smash ~/linux/drivers --exclude-dir=git -o report.json"
Sleep 500ms
Enter
Sleep 10s
Type "clear"
Sleep 500ms
Enter
Type `jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l`
Sleep 500ms
Enter
Sleep 5s
Type `jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs rm`
Sleep 500ms
Enter
Sleep 5s
Type "cd ~/linux/drivers"
Sleep 500ms
Enter
Sleep 2s
Type "git status -s"
Sleep 500ms
Enter
Sleep 3s
Type "git reset --hard"
Sleep 500ms
Enter
Sleep 3s
Type "git status -s"
Sleep 500ms
Enter
Sleep 5s
4 changes: 4 additions & 0 deletions internal/smash/configuration.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ func (app *App) printConfiguration() {
theme.Println(b.Sprint("Algorithm: "), theme.ColourConfig(algorithms.Algorithm(f.Algorithm)))
theme.Println(b.Sprint("Locations: "), theme.ColourConfig(strings.Join(app.Locations, ", ")))

if !f.HideOutput && f.OutputFile != "" {
theme.Println(b.Sprint("Output: "), theme.ColourConfig(f.OutputFile), "(json)")
}

if len(f.ExcludeDir) > 0 || len(f.ExcludeFile) > 0 {
theme.StyleBold.Println("Excluded")
if len(f.ExcludeDir) > 0 {
Expand Down
41 changes: 30 additions & 11 deletions internal/smash/export.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,18 +41,23 @@ type ReportTopFilesSummary struct {
}
type ReportFiles struct {
Fails []ReportFailSummary `json:"fails"`
Empty []ReportFileSummary `json:"empty"`
Empty []ReportFileBaseSummary `json:"empty"`
Dupes []ReportDuplicateSummary `json:"dupes"`
}

type ReportFailSummary struct {
Filename string `json:"filename"`
Error string `json:"error"`
}
type ReportFileSummary struct {

type ReportFileBaseSummary struct {
Filename string `json:"filename"`
Location string `json:"location"`
Path string `json:"path"`
}

type ReportFileSummary struct {
ReportFileBaseSummary
Hash string `json:"hash"`
Size uint64 `json:"size"`
FullHash bool `json:"fullHash"`
Expand Down Expand Up @@ -116,19 +121,23 @@ func getHostName() string {
if host, err := os.Hostname(); err == nil {
return host
}
return "Unknown"
return "Classified"
}

func summariseRunAnalysis(session *AppSession) ReportFiles {

fails := summariseSmashFails(session.Fails)
empty := summariseEmptyFiles(session.Empty.Files)
dupes := transformDupes(session.Dupes)

return ReportFiles{
Fails: summariseSmashedFails(session.Fails),
Empty: summariseSmashedFiles(session.Empty.Files),
Dupes: transformDupes(session.Dupes),
Fails: fails,
Empty: empty,
Dupes: dupes,
}
}

func summariseSmashedFails(fails *xsync.MapOf[string, error]) []ReportFailSummary {
func summariseSmashFails(fails *xsync.MapOf[string, error]) []ReportFailSummary {
summary := make([]ReportFailSummary, fails.Size())
var index = 0
fails.Range(func(key string, value error) bool {
Expand All @@ -147,16 +156,24 @@ func transformDupes(duplicates *xsync.MapOf[string, *DuplicateFiles]) []ReportDu
var index = 0
duplicates.Range(func(hash string, dupe *DuplicateFiles) bool {
root := dupe.Files[0]
rest := dupe.Files[1:]
dupes[index] = ReportDuplicateSummary{
ReportFileSummary: summariseSmashedFile(root),
Duplicates: summariseSmashedFiles(dupe.Files),
Duplicates: summariseSmashedFiles(rest),
}
index++
return true
})
return dupes
}

func summariseEmptyFiles(files []File) []ReportFileBaseSummary {
summary := make([]ReportFileBaseSummary, len(files))
for i, file := range files {
summary[i] = summariseSmashedFile(file).ReportFileBaseSummary
}
return summary
}
func summariseSmashedFiles(files []File) []ReportFileSummary {
summary := make([]ReportFileSummary, len(files))
for i, file := range files {
Expand All @@ -166,9 +183,11 @@ func summariseSmashedFiles(files []File) []ReportFileSummary {
}
func summariseSmashedFile(file File) ReportFileSummary {
return ReportFileSummary{
Filename: file.Filename,
Location: file.Location,
Path: filepath.Dir(file.Path),
ReportFileBaseSummary: ReportFileBaseSummary{
Filename: file.Filename,
Location: file.Location,
Path: filepath.Dir(file.Path),
},
Hash: file.Hash,
Size: file.FileSize,
FullHash: file.FullHash,
Expand Down
35 changes: 24 additions & 11 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,13 @@ Tool to `smash` through to find duplicate files efficiently by slicing a file (o
and computing a hash using a fast non-cryptographic algorithm such as [xxhash](https://xxhash.com/) or [murmur3](https://en.wikipedia.org/wiki/MurmurHash).

<p align="center">
<img src="https://vhs.charm.sh/vhs-1zSMi9vYpmh0DivoB4E6g4.gif" alt="Made with VHS"><br/>
<img src="https://vhs.charm.sh/vhs-5uZbZAvk8Y6eq4dihLppbk.gif" alt="Made with VHS"><br/>
<sub>
<sup>Find duplicates in the <a href="https://github.com/torvalds/linux">linux/drivers</a> source tree with <code>smash</code> (see our <a href="docs/demos.md">🍿 other demos</a>). Made with <a href="https://vhs.charm.sh" target="_blank">vhs</a>!</sup>
</sub>
</p>

`smash` has a read-only view of the underlying filesystem and only reports duplicates - currently, we do not remove
duplicates and instead leave that for you to do via the output. We also don't support symlinks or NT Junction Points (Windows Symlinks) and ignore them.
`smash` has a read-only view of the underlying filesystem, outputs empty and duplicate files into a json report that you can use a tool like [jq](https://github.com/jqlang/jq) to operate on. See examples below or [this vhs tape](https://vhs.charm.sh/vhs-4OwN0BJfb3F3CTzGJCFHcs.gif).

The name comes from a prototype tool called SmartHash (written many years ago in C/ASM that's now lost in source &
too hard to modernise) which operated on a similar concept (with CRC32 then later MD5).
Expand Down Expand Up @@ -48,25 +47,26 @@ $ go install github.com/thushan/smash@latest
```
Usage:
smash [flags] [locations-to-smash]
Flags:
--algorithm algorithm Algorithm to use to hash files. Supported: xxhash, murmur3, md5, sha512, sha256 (full list, see readme) (default xxhash)
--base strings Base directories to use for comparison Eg. --base=/c/dos,/c/dos/run/,/run/dos/run
--disable-slicing Disable slicing & hash the full file instead
--disable-autotext Disable detecting text-files to opt for a full hash for those
--disable-meta Disable storing of meta-data to improve hashing mismatches
--disable-slicing Disable slicing & hash the full file instead
--exclude-dir strings Directories to exclude separated by comma Eg. --exclude-dir=.git,.idea
--exclude-file strings Files to exclude separated by comma Eg. --exclude-file=.gitignore,*.csv
-h, --help help for smash
--ignore-empty Ignore empty/zero byte files (default true)
--ignore-hidden Ignore hidden files & folders Eg. files/folders starting with '.' (default true)
--ignore-system Ignore system files & folders Eg. '$MFT', '.Trash' (default true)
-p, --max-threads int Maximum threads to utilise (default 16)
-w, --max-workers int Maximum workers to utilise when smashing (default 8)
-w, --max-workers int Maximum workers to utilise when smashing (default 16)
--nerd-stats Show nerd stats
--no-output Disable report output
--no-progress Disable progress updates
--no-top-list Hides top x duplicates list
-o, --output-file string Export as JSON
-o, --output-file string Export analysis as JSON (generated automatically otherwise)
--profile Enable Go Profiler - see localhost:1984/debug/pprof
--progress-update int Update progress every x seconds (default 5)
--show-duplicates Show full list of duplicates
Expand All @@ -84,18 +84,30 @@ Examples are given in Unix format, but apply to Windows as well.

### Basic

To check for duplicates in a single path (Eg. `~/media/photos`)
To check for duplicates in a single path (Eg. `~/media/photos`) & output report to `report.json`

```bash
$ ./smash ~/media/photos
$ ./smash ~/media/photos -o report.json
```

You can then look at `report.json` with [jq](https://github.com/jqlang/jq) to check duplicates:

```bash
$ jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l
```

### Show Empty Files

By default, `smash` ignores empty files but can report on them with the `--ignore-empty=false` argument:

```bash
$ ./smash ~/media/photos --ignore-empty=false
$ ./smash ~/media/photos --ignore-empty=false -o report.json
```

You can then look at `report.json` with [jq](https://github.com/jqlang/jq) to check empty files:

```bash
$ jq '.analysis.empty[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l
```

### Show Top 50 Duplicates
Expand Down Expand Up @@ -155,6 +167,7 @@ $ ./smash --algorithm:murmur3 ~/media/photos

This project was possible thanks to the following projects or folks.

* [@jqlang/jq](https://github.com/jqlang/jq) - without `jq` we'd be a bit lost!
* [@wader/fq](https://github.com/wader/fq) - countless nights of inspecting binary blobs!
* [@cespare/xxhash](https://github.com/cespare/xxhash) - xxhash implementation
* [@spaolacci/murmur3](https://github.com/spaolacci/murmur3) - murmur3 implementation
Expand All @@ -164,7 +177,7 @@ This project was possible thanks to the following projects or folks.
* [@golangci/golangci-lint](https://github.com/golangci/golangci-lint) - Go Linter
* [@dkorunic/betteralign](https://github.com/dkorunic/betteralign) - Go alignment checker

Testers - MarkB, JarredT, BenW, DencilW, JayT, ASV, TimW, RyanW, WilliamH
Testers - MarkB, JarredT, BenW, DencilW, JayT, ASV, TimW, RyanW, WilliamH, SpencerB, EmadA, ChrisE, AngelaB

# Licence

Expand Down

0 comments on commit bc98815

Please sign in to comment.