Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: More unique uniqueness flag #13

Open
StaticPH opened this issue Feb 26, 2021 · 2 comments
Open

Feature Request: More unique uniqueness flag #13

StaticPH opened this issue Feb 26, 2021 · 2 comments

Comments

@StaticPH
Copy link

As it stands, both runiq and runiq --invert always include a single instance of each value that exists more than once within the inputs. There is not, however, an option to completely omit values that occur more than once.
I would like to see some sort of '--no-duped' flag (the name is open for debate), probably mutually exclusive with --invert, that filters out all occurrences of data with duplicates, rather than the current default behavior of leaving a single instance.
example:

$ cat fileA
a1
b7
c1
d3
$ cat fileB
a7
b3
d8
c1
d3

With the current behavior, runiq fileA fileB would produce:

a1
b7
c1
d3
a7
b3
d8

runiq --no-duped fileA fileB would then produce:

a1
b7
a7
b3
d8
@whitfin
Copy link
Owner

whitfin commented Feb 26, 2021

Hi @StaticPH!

Although this sounds reasonable, it's likely not viable because it requires storing every value in memory to emit at the end (seeing as you need to process all input to know whether there has been a duplicate or not).

This is probably not something that should be added, since it's far too easy to blow up on memory for unsuspecting users. If there's some magic that might exist to allow this to work more efficiently I'm all ears, but on the face of it it just doesn't seem plausible.

@StaticPH
Copy link
Author

I'd be fine with just running runiq twice for such tasks, once with --invert to find all the values with duplicates, and a second time with some flag indicating that values matching subsequent arguments (or even just all values in some FILE; hopefully the code will play nicely with command redirection) should be ignored entirely. Ideally the user would only have to enter the command a single time with a specific new flag, and the internal code would deal with the multiple passes, internally storing only the output from --invert. I assume that keeping access to the beginning of a data stream is possible with some form of peek operation, even if it induces some extra IO buffering.

If both of those particular method stills has the issue of memory blow up, which I suspect it would, it wouldn't be a deal-breaker for the feature to work only with "permanent" files (no command redirection or pipelines).

I could probably get a similar effect by piping the output from the second run through some combination of grep, sed, and awk commands, but I don't think any of those easily supports variable count fixed-string patterns in an automated fashion. It'd be best for usability to have this all happen in one tool with a single run, but 2 runs of that one tool is acceptable considering the good point you raise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants