groupcover

Staged deduplication.

Test drive

$ go install github.com/miku/groupcover/cmd/groupcover@latest

Or via packages.

Usage

$ groupcover < input.csv > changes.csv

Where input.csv has three or more columns:

id, group, attribute, [key, key, ...]

Items from different groups (e.g. data sources) may share an attribute value (e.g. ISBN or DOI). Depending on a preference over groups (possibly per key), a number of keys may be dropped for an entry.

The CSV file must already be sorted by attribute.

$ groupcover -h
Usage of groupcover:
  -cpuprofile string
        pprof output file
  -f int
        column to use for grouping, one-based (default 3)
  -lower
        lowercase input
  -prefs string
        space separated string of preferences (most preferred first), e.g. 'B A C'
  -verbose
        more output
  -version
        show version

Examples

$ cat fixtures/sample.csv
id-1,group-1,value-1,Leipzig,Berlin
id-2,group-2,value-1,Berlin,Dresden

This is a duplicate (but only for Berlin), because both id-1 and id-2 have the same value: value-1. The Berlin key is repeated. By default, the group with the higher lexicographic value is choosen, so after deduplication Berlin would stay at id-2, but would get dropped from id-1:

$ groupcover < fixtures/sample.csv 2> /dev/null
id-1,group-1,value-1,Leipzig

Since 0.0.4, there is an experimental flag for settings preferences:

$ groupcover -prefs 'group-2 group-1' < fixtures/sample.csv 2> /dev/null
id-1,group-1,value-1,Leipzig

Overwrite default lexicographic order, prefer group-1 over group-2.

$ groupcover -prefs 'group-1 group-2' < fixtures/sample.csv 2> /dev/null
id-2,group-2,value-1,Dresden

Another example.

$ cat fixtures/mini.csv
1,G1,A1,K1,K2
2,G1,A2,K1,K2
3,G2,A2,K1,K2,K3
4,G3,A2,K2
5,G1,A3,K1,K2,K3
6,G2,A3,K2,K3
7,G1,,K2,K3
8,G2,,K2,K3
9,G2,A4,K2,K3
A,G2,A4,K2,K3

To sort CSV by attribute:

$ sort -t, -k3 fixtures/mini.csv

Only the changed entries are written:

$ groupcover < fixtures/mini.csv 2> /dev/null
2,G1,A2
3,G2,A2,K1,K3
5,G1,A3,K1

Finc Index

The licensing information is available e.g. in AILicensing, as intermediate format.

$ jq -r '[
    .["finc.record_id"],
    .["finc.source_id"],
    .["doi"],
    .["x.labels"][]?] | @csv' < <(unpigz -c /tmp/AILicensing/date-2016-11-28.ldj.gz)

"ai-48-QkVGT19fTTgzMDMxOTUzMzcwLU0tRklaVC1ET01BLVpERUUtQkVGTy1JVEVD","48",,"DE-J59"
"ai-48-QkVGT19fTTgzMDMxOTIwNjQ1LU0tRklaVC1ET01BLUJFRk8","48",,"DE-J59"
"ai-48-QkVGT19fTTgzMDMxOTE3NjQ1LU0tRklaVC1ET01BLUJFRk8","48",,"DE-J59"
...

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
cmd		cmd
docs		docs
fixtures		fixtures
packaging		packaging
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cpu.pdf		cpu.pdf
cpu.pprof		cpu.pprof
go.mod		go.mod
rewriter.go		rewriter.go
rewriter_test.go		rewriter_test.go
sketch.jpg		sketch.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

groupcover

Test drive

Usage

Examples

Finc Index

About

Releases 8

Packages

Languages

License

miku/groupcover

Folders and files

Latest commit

History

Repository files navigation

groupcover

Test drive

Usage

Examples

Finc Index

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages