-
Notifications
You must be signed in to change notification settings - Fork 7
Data 'diff' format #19
Comments
I would be most interested in this as well. On Sun, Mar 1, 2015 at 11:10 AM, Roger D. Peng [email protected]
|
Awesome! I'm not 100% sure how this would work, but I think for it to be useful it would have to sit on top of git and then maybe show how the dataset changes independent of git's own output. The issue there would then be efficiency.... |
@rdpeng, are you familar with dat? It's a version control system for data, which feels very similar to git. At the moment, I think it tracks modifications by row only, but I am hoping that column-based diffs will be part of a future release. I agree that something that tracks a variety of transformations/modifications would be very useful. |
👍 I was just talking to @gvwilson about this exact thing earlier this week…. |
Also have a look at https://github.com/edwindj/daff
|
Thanks @ledell for mentioning Dat. I'm on my phone so this will be brief but I'll expand later. Dat can natively do diffs and ropensci has a rDat package in the works, waiting on Dat to come to beta (which is soon). I've invited the Dat project to join us and Karissa from their team will join us. There are some issues with rDat that I'm hoping Jeroen will help resolve. But 💯 to pursuing this idea. It should be easy to complete at the event. |
Daff looks quite good actually, and seems to implement most of what I was thinking about. One thing I was hoping to do was implement was something a bit more "intelligent" (and likely more constraining). So for example, if I transform a column by squaring it, is there a way to show that rather than just indicating that every value in the column changed? Perhaps a diff could be expressed via R code rather than the something along the lines of the usual +/- diff format. |
@rdpeng your latest comment sounds like a provenance-tracking problem. Are you thinking this will be applied in a system aware of what is done, or does it need to work like diff, I.e. given two datasets and no extra information, tell me the differences? |
Pardon if this is slightly off-topic but I want to park these links in a few relevant places, like this thread. Re: weaning people off of Excel for data inspection and cleaning. OpenRefine comes up a lot and is generally popular with people expecting a GUI. I had always thought it was only mouse driven, but that it wrote some sort of log file. Did not realize these logs are perhaps re-executable. But a recent Twitter conversation intrigues me and also alerted me to Ruby and Python wrappers around the underlying Refine API. @ostephens says:
|
I'm not sure, to be honest. I think I would need a brief discussion of the On Mon, Mar 9, 2015 at 5:22 PM, Gabe Becker [email protected]
Roger D. Peng | @rdpeng https://twitter.com/rdpeng | |
I'm interested in this topic. |
Me too! I really like this idea. In the Unix tradition, I think the best approach to an implementation might be a C or C++ library and command line tool (e.g. like curl). The, we could maybe write a simple R wrapper. |
Maybe, though that requires us to write it in C/C++ instead of the much I would argue that - for prototyping algorithms and features, at least - Remember what Duncan always said: for every two lines of C you write, you ~G On Mon, Mar 23, 2015 at 10:18 PM, Vince Buffalo [email protected]
Gabriel Becker, PhD |
There's a slick little Chrome extension CSVHub that will visualize the daff like differencing of a CSV from within Github: |
@bbest Wow! That's fantastic! Oddly it doesn't work with TSV files. I've opened an issue Data-Liberation-Front/csvhub#8 to request this feature. |
These are the git aliases that I use for diffing TSV and CSV files. [alias]
wdiff = diff --word-diff=plain
wdiffc = diff --word-diff=color
wdiffcsv = diff --word-diff=color --word-diff-regex=[^,]+ See https://github.com/sjackman/dotfiles/blob/master/.gitconfig#L3-L5 |
👍 |
thanks for sharing @sjackman :) |
Nice, @sjackman! |
Very cool. So you mentioned |
I took some notes from our conversation today. Thanks for contributing to the workshop! okdistribute/knead#1 |
Good one @sjackman! Here's my little play session with trying out this technique... # add alias to git's config
git config --global alias.diffcsv "diff --word-diff=color --word-diff-regex=[^,]+"
# initialize repo
git init test_csv; cd test_csv
# 1st commit of test csv
echo -e 'a,b,c\n1,2,3\n4,5,6' > x.csv; cat x.csv
git add x.csv; git commit -m 'initial csv'
# modify csv: b->c, 4->8
echo -e 'a,c,d\n1,2,3\n8,5,6' > x.csv; cat x.csv
# compare against previous commit
git diff x.csv
git diffcsv x.csv
# 2nd commit on modified csv: b->c, 4->8
git commit -a -m 'modified csv'
# modify csv: +e column with 0's
echo -e 'a,c,d,e\n1,4,3,0\n8,5,6,0' > x.csv
# compare against previous commit
git diffcsv x.csv
# 3rd commit on modified csv: +e column with 0's
git commit -a -m 'modified csv again'
# look at history of commits
git log
# compare between specific commits of the csv (swapping from your git log output)
git diffcsv 56515ac..97bfd69 -- x.csv |
daff works really well! |
Following up with @sjackman and @bbest's examples: I moved @bbest's script into R since for us this kind of visual differencing would need to be portable (ie to be able to share it with colleagues outside of your own terminal window). Unfortunately RStudio doesn't do the color differencing (would that even be possible?) and in fact the display is not useful. What would further options be? @Karissa? Examples and full R script below. Comparing R translation of @bbest's bash script above # add alias to git's config
system('git config --global alias.diffcsv "diff --word-diff=color --word-diff-regex=[^,]+"')
# initialize repo
system('git init test_csv; cd test_csv')
# 1st commit of test csv
x = data.frame(a = c(1,4), b = c(2,5), c = c(3,6)); x
write.csv(x, 'x.csv', row.names = F)
system("git add x.csv; git commit -m 'initial csv'")
# modify csv: b->c, c->d, 4->8
x = data.frame(a = c(1,8), c = c(2,5), d = c(3,6)); x
write.csv(x, 'x.csv', row.names = F)
# compare against previous commit
system('git diff x.csv')
system('git diffcsv x.csv')
# 2nd commit on modified csv: b->c, 4->8
system("git commit -a -m 'modified csv'")
# modify csv: +e column with 0's
x = data.frame(a = c(1,8), c = c(2,5), d = c(3,6), e = c(0,0)); x
write.csv(x, 'x.csv', row.names = F)
# compare against previous commit
system('git diffcsv x.csv')
# 3rd commit on modified csv: +e column with 0's
system("git commit -a -m 'modified csv again'")
# look at history of commits
system('git log')
# compare between specific commits of the csv (swapping from your git log output)
system('git diffcsv a4c1add0..5cf47e62 -- x.csv')
|
It's not as pretty, but you can use
|
It's possible that RStudio could render the ANSI colour code escape sequences. Certainly no harm in opening an issue with a feature request. |
By the way, pushing the test_csv repo created above to https://github.com/bbest/test_csv and viewing in Google Chrome with CSVHub nicely renders the differences between the following csv commits:
And now for the comparisons using daff style differencing (green add, red delete, blue modify) with the CSVHub Google Chrome extension:
|
One thing I've always wanted is a 'diff' type output for datasets (let's say tabular datasets for now). When I use git to manage projects, changes to the datasets I use are difficult to visualize using the standard diff output, which is line based. That works when rows are changed but not when columns are added/deleted or transformations are made. Is there a way to categorize the types of changes that can be made to a dataset and then visualize them in a useful way?
The text was updated successfully, but these errors were encountered: