You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hiya! I'm talking to Aaron Halfaker right now! We are thinking about using this again. Is this still an issue? He seems to remember you guys resolving this.
I believe it is, although the duplicates shouldn't be too many. Change "<=" in the last assertion in testSplitCompressed() to "==", and it won't pass (while it ideally should). According to the error I get there, the scale of duplicates looks like this: "expected: 93939, found: 93946".
The problem is in the way bzip files can be split - splits must be aligned to bzip2 blocks, which might end at in the middle of a revision. To not lose any revision, I had to implement to cover some revisions doubly.
It might make sense to solve this by adding another layer of a Hadoop job to remove duplicates in the larger workflow. (Looking back, I have a very vague memory discussing solving it more neatly, but anyway it wasn't implemented at the end.)
Revisions around a page ending can be duplicated in the results when bzip2 input is used.
The text was updated successfully, but these errors were encountered: