Staged deduplication.
$ go install github.com/miku/groupcover/cmd/groupcover@latest
Or via packages.
$ groupcover < input.csv > changes.csv
Where input.csv has three or more columns:
id, group, attribute, [key, key, ...]
Items from different groups (e.g. data sources) may share an attribute value (e.g. ISBN or DOI). Depending on a preference over groups (possibly per key), a number of keys may be dropped for an entry.
The CSV file must already be sorted by attribute.
$ groupcover -h
Usage of groupcover:
-cpuprofile string
pprof output file
-f int
column to use for grouping, one-based (default 3)
-lower
lowercase input
-prefs string
space separated string of preferences (most preferred first), e.g. 'B A C'
-verbose
more output
-version
show version
$ cat fixtures/sample.csv
id-1,group-1,value-1,Leipzig,Berlin
id-2,group-2,value-1,Berlin,Dresden
This is a duplicate (but only for Berlin), because both id-1 and id-2 have the same value: value-1. The Berlin key is repeated. By default, the group with the higher lexicographic value is choosen, so after deduplication Berlin would stay at id-2, but would get dropped from id-1:
$ groupcover < fixtures/sample.csv 2> /dev/null
id-1,group-1,value-1,Leipzig
Since 0.0.4, there is an experimental flag for settings preferences:
$ groupcover -prefs 'group-2 group-1' < fixtures/sample.csv 2> /dev/null
id-1,group-1,value-1,Leipzig
Overwrite default lexicographic order, prefer group-1 over group-2.
$ groupcover -prefs 'group-1 group-2' < fixtures/sample.csv 2> /dev/null
id-2,group-2,value-1,Dresden
Another example.
$ cat fixtures/mini.csv
1,G1,A1,K1,K2
2,G1,A2,K1,K2
3,G2,A2,K1,K2,K3
4,G3,A2,K2
5,G1,A3,K1,K2,K3
6,G2,A3,K2,K3
7,G1,,K2,K3
8,G2,,K2,K3
9,G2,A4,K2,K3
A,G2,A4,K2,K3
To sort CSV by attribute:
$ sort -t, -k3 fixtures/mini.csv
Only the changed entries are written:
$ groupcover < fixtures/mini.csv 2> /dev/null
2,G1,A2
3,G2,A2,K1,K3
5,G1,A3,K1
The licensing information is available e.g. in AILicensing, as intermediate format.
$ jq -r '[
.["finc.record_id"],
.["finc.source_id"],
.["doi"],
.["x.labels"][]?] | @csv' < <(unpigz -c /tmp/AILicensing/date-2016-11-28.ldj.gz)
"ai-48-QkVGT19fTTgzMDMxOTUzMzcwLU0tRklaVC1ET01BLVpERUUtQkVGTy1JVEVD","48",,"DE-J59"
"ai-48-QkVGT19fTTgzMDMxOTIwNjQ1LU0tRklaVC1ET01BLUJFRk8","48",,"DE-J59"
"ai-48-QkVGT19fTTgzMDMxOTE3NjQ1LU0tRklaVC1ET01BLUJFRk8","48",,"DE-J59"
...