Skip to content

Commit

Permalink
Removed 58 lines of complex and duplicated lines from dogroups.c deal…
Browse files Browse the repository at this point in the history
…ing with .BY and list columns. Removed switch on R 3.1.0. Added simpler code to central memrecycle. Turned back on test 1691. All tests now pass R 3.0.0 (pre R 3.1.0) and R 3.3.1. Closes #1270. Simplifies #481.
  • Loading branch information
mattdowle committed Oct 18, 2016
1 parent 284404f commit d0f8dda
Show file tree
Hide file tree
Showing 4 changed files with 39 additions and 70 deletions.
4 changes: 2 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@

44. `fread` respects order of columns provided to argument `select` in result, and also warns if the column(s) provided is not present, [#1445](https://github.com/Rdatatable/data.table/issues/1445).

45. `DT[, .BY, by=x]` and other variants of adding a column using `.BY` is handled correctly in R v3.1.0+, [#1270](https://github.com/Rdatatable/data.table/issues/1270).
45. `DT[, .BY, by=x]` and other variants of adding a column using `.BY` are now handled correctly, [#1270](https://github.com/Rdatatable/data.table/issues/1270).

46. `as.data.table.data.table()` method checks and restores over-allocation, [#473](https://github.com/Rdatatable/data.table/issues/473).

Expand Down Expand Up @@ -838,7 +838,7 @@
31. Fixed an edge case in `DT[order(.)]` internal optimisation to be consistent with base. Closes [#696](https://github.com/Rdatatable/data.table/issues/696). Thanks to Michael Smith and Garrett See for reporting.
32. `DT[, list(list(.)), by=.]` and `DT[, col := list(list(.)), by=.]` returns correct results in R >=3.1.0 as well. The bug was due to recent (welcoming) changes in R v3.1.0 where `list(.)` does not result in a *copy*. Closes [#481](https://github.com/Rdatatable/data.table/issues/481). Also thanks to KrishnaPG for filing [#728](https://github.com/Rdatatable/data.table/issues/728).
32. `DT[, list(list(.)), by=.]` and `DT[, col := list(list(.)), by=.]` now return correct results in R >= 3.1.0. The bug was due to a welcome change in R 3.1.0 where `list(.)` no longer copies. Closes [#481](https://github.com/Rdatatable/data.table/issues/481). Also thanks to KrishnaPG for filing [#728](https://github.com/Rdatatable/data.table/issues/728).
33. `dcast.data.table` handles `fun.aggregate` argument properly when called from within a function that accepts `fun.aggregate` argument and passes to `dcast.data.table()`. Closes [#713](https://github.com/Rdatatable/data.table/issues/713). Thanks to mathematicalcoffee for reporting [here](http://stackoverflow.com/q/24542976/559784) on SO.
Expand Down
8 changes: 4 additions & 4 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -7694,10 +7694,10 @@ test(1618.4, fread("a,c,b\n1,2,3", select=c(2:3)), data.table(c=2L, b=3L))
test(1618.5, fread("a,c,b\n1,2,3", select=c("b", "c"), col.names=c("q", "r")), data.table(q=3L, r=2L))
test(1618.6, fread("a,c,b\n1,2,3", select=c("b", "z")), data.table(b=3L), warning="Column name 'z' not found.*skipping")

# fix for #1270 - uncomment when #1270 re-fixed, so it can pass R 2.15.0
#DT = data.table(x=1:2, y=5:6)
#test(1619.1, DT[, .BY, by=x]$BY, as.list(1:2))
#test(1619.2, DT[, bycol := .BY, by=x]$bycol, as.list(1:2))
# fix for #1270. Have been problems with R before vs after 3.1.0 here. But now ok in all R versions.
DT = data.table(x=1:2, y=5:6)
test(1619.1, DT[, .BY, by=x]$BY, as.list(1:2))
test(1619.2, DT[, bycol := .BY, by=x]$bycol, as.list(1:2))

# fix for #473
DT = data.frame(x=1, y=2)
Expand Down
30 changes: 28 additions & 2 deletions src/assign.c
Original file line number Diff line number Diff line change
Expand Up @@ -472,7 +472,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
vlen = length(thisvalue);
if (length(rows)==0 && targetlen==vlen && (vlen>0 || nrow==0)) {
if ( NAMED(thisvalue)==2 || // set() protects the NAMED of atomic vectors from .Call setting arguments to 2 by wrapping with list
(TYPEOF(values)==VECSXP && i>LENGTH(values)-1)) { // recycled RHS would have columns pointing to others, #2298.
(TYPEOF(values)==VECSXP && i>LENGTH(values)-1)) { // recycled RHS would have columns pointing to others, #185.
if (verbose) {
if (NAMED(thisvalue)==2) Rprintf("RHS for item %d has been duplicated because NAMED is %d, but then is being plonked.\n",i+1, NAMED(thisvalue));
else Rprintf("RHS for item %d has been duplicated because the list of RHS values (length %d) is being recycled, but then is being plonked.\n", i+1, length(values));
Expand Down Expand Up @@ -709,16 +709,41 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
return(dt); // needed for `*tmp*` mechanism (when := isn't used), and to return the new object after a := for compound syntax.
}

static Rboolean anyNamed(SEXP x) {
if (NAMED(x)) return TRUE;
if (isNewList(x)) for (int i=0; i<LENGTH(x); i++)
if (anyNamed(VECTOR_ELT(x,i))) return TRUE;
return FALSE;
}

void memrecycle(SEXP target, SEXP where, int start, int len, SEXP source)
// like memcpy but recycles source and takes care of aging
// 'where' a 1-based INTEGER vector subset of target to assign to, or NULL or integer()
// assigns to target[start:start+len-1] or target[where[start:start+len-1]] where start is 0-based
{
int r = 0, w;
int r=0, w, protecti=0;
if (len<1) return;
int slen = length(source) > len ? len : length(source); // fix for 5647. when length(source) > len, slen must be len.
if (slen<1) return;
if (TYPEOF(target) != TYPEOF(source)) error("Internal error: TYPEOF(target)['%s']!=TYPEOF(source)['%s']", type2char(TYPEOF(target)),type2char(TYPEOF(source)));
if (isNewList(source)) {
// A list() column; i.e. target is a column of pointers to SEXPs rather than the much more common case
// where memrecycle copies the DATAPTR data to the atomic target from the atomic source.
// If any item within the list is NAMED then take a fresh copy. So far this has occurred from dogroups.c when
// j returns .BY or similar specials as-is within a list(). Those specials are static inside
// dogroups so if we don't copy now the last value written to them by dogroups becomes repeated in the result;
// i.e. the wrong result.
// If source is itself recycled later (many list() column items pointing to the same object) we are ok with that
// since we now have a fresh copy and := will not assign with a list() column's cell value; := only changes the
// SEXP pointed to.
// If source is already not named (because j already created a fresh unnamed vector within a list()) we don't want to
// duplicate unnecessarily, hence checking for named rather than duplicating always.
// See #481 and #1270
if (anyNamed(source)) {
source = PROTECT(duplicate(source));
protecti++;
}
}
size_t size = SIZEOF(target);
if (!length(where)) {
switch (TYPEOF(target)) {
Expand Down Expand Up @@ -794,6 +819,7 @@ void memrecycle(SEXP target, SEXP where, int start, int len, SEXP source)
}
// if slen>10 it may be worth memcpy, but we'd need to first know if 'where' was a contiguous subset
}
UNPROTECT(protecti);
}

SEXP allocNAVector(SEXPTYPE type, R_len_t n)
Expand Down
67 changes: 5 additions & 62 deletions src/dogroups.c
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,6 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
SEXP names, names2, xknames, bynames, dtnames, ans=NULL, jval, thiscol, SDall, BY, N, I, GRP, iSD, xSD, rownames, s, RHS, listwrap, target, source, tmp;
SEXP *nameSyms, *xknameSyms;
Rboolean wasvector, firstalloc=FALSE, NullWarnDone=FALSE, recycleWarn=TRUE;
#if defined(R_VERSION) && R_VERSION >= R_Version(3, 1, 0)
SEXP dupcol;
int named=0;
#endif
size_t size; // must be size_t, otherwise bug #5305 (integer overflow in memcpy)
clock_t tstart=0, tblock[10]={0}; int nblock[10]={0};

Expand All @@ -50,7 +46,6 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
SDall = findVar(install(".SDall"), env);

defineVar(install(".BY"), BY = allocVector(VECSXP, ngrpcols), env);
SET_NAMED(BY, 2); // for #1270
bynames = PROTECT(allocVector(STRSXP, ngrpcols)); protecti++; // TO DO: do we really need bynames, can we assign names afterwards in one step?
for (i=0; i<ngrpcols; i++) {
j = INTEGER(grpcols)[i]-1;
Expand Down Expand Up @@ -292,41 +287,18 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
// fix for #4990 - `:=` did not issue recycling warning during "by" operation.
if (vlen<grpn && vlen>0 && grpn%vlen != 0)
warning("Supplied %d items to be assigned to group %d of size %d in column '%s' (recycled leaving remainder of %d items).",vlen,i+1,grpn,CHAR(STRING_ELT(dtnames,INTEGER(lhs)[j]-1)),grpn%vlen);
// fix for issues/481 for := case
// missed it in commit: https://github.com/Rdatatable/data.table/commit/86276f48798491d328caa72f6ebcce4d51649440
// see that link (or scroll down for the non := version) for comments
#if defined(R_VERSION) && R_VERSION >= R_Version(3, 1, 0)
named=0;
if (isNewList(RHS)) {
if (NAMED(RHS) == 2) { named=2; RHS = PROTECT(duplicate(RHS)); } // for #1270
else {
dupcol = VECTOR_ELT(RHS, 0);
named = NAMED(dupcol);
while(isNewList(dupcol)) {
if (named == 2) break;
else {
dupcol = VECTOR_ELT(dupcol, 0);
named = NAMED(dupcol);
}
}
if (named == 2) RHS = PROTECT(duplicate(RHS));
}
}
memrecycle(target, order, INTEGER(starts)[i]-1, grpn, RHS);
if (named == 2) UNPROTECT(1);
#else

memrecycle(target, order, INTEGER(starts)[i]-1, grpn, RHS);
#endif

// fixes bug #2531. Got to set the class back. See comment below for explanation. This is the new fix. Works great!
// Also fix for #5437 (bug due to regression in 1.9.2+)
copyMostAttrib(RHS, target); // not names, otherwise test 778 would fail
copyMostAttrib(RHS, target); // not names, otherwise test 778 would fail.
/* OLD FIX: commented now. The fix below resulted in segfault on factor columns because I dint set the "levels"
Instead of fixing that, I just removed setting class if it's factor. Not appropriate fix.
Correct fix of copying all attributes (except names) added above. Now, everything should be alright.
Test 1144 (#5104) will provide the right output now. Modified accordingly.
OUTDATED: if (!isFactor(RHS)) setAttrib(target, R_ClassSymbol, getAttrib(RHS, R_ClassSymbol));
OUTDATED: // added !isFactor(RHS) to fix #5104 (side-effect of fixing #2531) */
OUTDATED: // added !isFactor(RHS) to fix #5104 (side-effect of fixing #2531)
See also #155 and #36 */

}
UNPROTECT(1);
continue;
Expand Down Expand Up @@ -444,36 +416,7 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
warning("Column %d of result for group %d is length %d but the longest column in this result is %d. Recycled leaving remainder of %d items. This warning is once only for the first group with this issue.",j+1,i+1,thislen,maxn,maxn%thislen);
recycleWarn = FALSE;
}
// fix for issues/481
#if defined(R_VERSION) && R_VERSION >= R_Version(3, 1, 0)
// added version because, for ex: DT[, list(list(unique(y))), by=x] gets duplicated
// because unique(y) returns NAMED(2). So, do it only if v>= 3.1.0. If <3.1.0,
// it gets duplicated anyway, so avoid copying twice!
named=0;
if (isNewList(source)) {
if (NAMED(source) == 2) { named=2; source = PROTECT(duplicate(source)); } // for #1270
else {
dupcol = VECTOR_ELT(source, 0);
named = NAMED(dupcol);
while(isNewList(dupcol)) {
// while loop basically peels each list() layer one by one until there's no
// list() wrapped anymore. Ex: consider DT[, list(list(list(sum(y)))), by=x] -
// here, we don't need to duplicate, but we won't know that until we reach
// 'sum(y)' and know that it's NAMED() != 2.
if (named == 2) break;
else {
dupcol = VECTOR_ELT(dupcol, 0);
named = NAMED(dupcol);
}
}
if (named == 2) source = PROTECT(duplicate(source));
}
}
memrecycle(target, R_NilValue, thisansloc, maxn, source);
if (named == 2) UNPROTECT(1);
#else
memrecycle(target, R_NilValue, thisansloc, maxn, source);
#endif
}
ansloc += maxn;
if (firstalloc) {
Expand Down

0 comments on commit d0f8dda

Please sign in to comment.