Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix naming in j=c() under by= queries with lapply() optimization #4883

Merged
merged 33 commits into from
Sep 6, 2024

Conversation

myoung3
Copy link
Contributor

@myoung3 myoung3 commented Jan 31, 2021

fixes #2311

@codecov
Copy link

codecov bot commented Jan 31, 2021

Codecov Report

Merging #4883 (819beaf) into master (7f0ce14) will increase coverage by 0.07%.
The diff coverage is 100.00%.

❗ Current head 819beaf differs from pull request most recent head 1c021f0. Consider uploading reports for the commit 1c021f0 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4883      +/-   ##
==========================================
+ Coverage   99.38%   99.46%   +0.07%     
==========================================
  Files          77       73       -4     
  Lines       14506    14427      -79     
==========================================
- Hits        14417    14350      -67     
+ Misses         89       77      -12     
Impacted Files Coverage Δ
R/data.table.R 99.94% <100.00%> (-0.01%) ⬇️
src/fastmean.c 96.82% <0.00%> (-3.18%) ⬇️
src/uniqlist.c 98.26% <0.00%> (-1.27%) ⬇️
src/fmelt.c 99.05% <0.00%> (-0.95%) ⬇️
src/between.c 99.21% <0.00%> (-0.79%) ⬇️
src/subset.c 99.50% <0.00%> (-0.50%) ⬇️
src/frollR.c 99.53% <0.00%> (-0.47%) ⬇️
src/forder.c 99.61% <0.00%> (-0.39%) ⬇️
src/dogroups.c 99.66% <0.00%> (-0.34%) ⬇️
src/assign.c 99.70% <0.00%> (-0.15%) ⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f0ce14...1c021f0. Read the comment docs.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests on a prefix having dot in its name would be useful, as well as prefix with a whitespace inside.

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 1, 2021

@jangorecki Good idea. I'll add those and regorganize the tests a bit. I also kind of want to run these all twice, once for option optimize=0 and once for optimize=3. Optimization is really the culprit here (particularly the difference between optimize=0 and optimize=1) and we want to be sure everything behaves the same regardless of optimization level. Is there a good way to loop through options, rerunning the same tests? Or should I just copy/paste then manually edit the test numbers?

@jangorecki
Copy link
Member

Copy paste will be fine, if there are many tests you can wrap them in { block so it is easier to fold and see where branches starts and finishes.

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 2, 2021

I need to look into nested concatenation. From the current build:

library(data.table)
M <- as.data.table(mtcars)
names( M[, c(list(mpg,b=hp),lapply(.SD, mean)), by="cyl", .SDcols=c("vs", "am")])
#> [1] "cyl" "V1"  "b"   "vs"  "am"
names( M[, c(list(mpg,b=hp),c(lapply(.SD, mean))), by="cyl", .SDcols=c("vs", "am")])
#> [1] "cyl" ""    "b"   "vs"  "am"

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 20, 2021

@jangorecki can you take a look at this? I added tests with "." and " " and seem to have fixed the nested c(c()) inconsistency.

Michael Young added 2 commits February 19, 2021 22:33
…iable name for the column created by x[1]. previously this could be an empty string column name in some circumstances
@myoung3
Copy link
Contributor Author

myoung3 commented Feb 20, 2021

@jangorecki yep, it's added. See 2164.019 and 2164.119

@myoung3
Copy link
Contributor Author

myoung3 commented Mar 27, 2021

@jangorecki Hey Jan is there any more testing you think should be done on this? This is ready from my perspective but if you or others have further suggestions I'm happy to implement them.

@legendre6891
Copy link

@jangorecki Just another very happy user of data.table chiming in: it would be really great to have this PR in the next release. I hit upon this problem this PR solves quite often.

@grantmcdermott
Copy link
Contributor

@myoung3 Any chance you could fix those merge conflicts (and hopefully trigger another look by Jan or other DT core member)?

For my part, I'll say that this PR looks to go on my local system. All of the new features in the latest dev version of data.table are fantastic... but this one is high up on my wish list. I'm hoping it makes the cut before CRAN submission time ;-)

@MichaelChirico MichaelChirico added this to the 1.14.1 milestone Aug 31, 2021
@MichaelChirico
Copy link
Member

Hi @myoung3, would you like to update this PR for inclusion in the next release?

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 28, 2024

@MichaelChirico Yeah I can probably take this on. It will take some time to reacquaint myself with the changes since the naming code is a bit esoteric. What is the release timeline?

What are your thoughts on whether this change needs to be phased in gradually? See @jangorecki's comment above. On the one hand, it will change how columns are named which could break code. On the other hand, columns are currently named inconsistently (iirc, depending on whether there is a by statement) to the point of naming in this context being bugged, and I'm not sure a bug fix would warrant gradual introduction over multiple versions.

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 28, 2024

What are your thoughts on whether this change needs to be phased in gradually?

Sorry, I haven't read anything carefully yet, I'll give you a better answer when I return to review (ideally after it's updated to current master to get the cleanest understanding of the PR possible).

What is the release timeline?

O(months), no urgent rush. Per the new GOVERNANCE doc:

* Regular CRAN releases should ideally occur twice per year, and can include new features.

We are looking at roughly July.

@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@TysonStanley
Copy link
Member

Hi @myoung3 ! Thanks for this PR. Looks like this is a highly requested feature. Some minor conflicts for DESCRIPTION, NEWS and tests are left. @MichaelChirico thoughts on phasing this one in? Feel like this does fix a bug that ppl should have avoided in practice (but it will likely break in some places).

@MichaelChirico MichaelChirico changed the title Named lapply Fix naming in j=c() under by= queries with lapply() optimization Sep 6, 2024
@MichaelChirico MichaelChirico self-requested a review as a code owner September 6, 2024 06:42
@MichaelChirico
Copy link
Member

Thanks for the PR! I've invited you to join Rdatatable/data.table as a member. That should make it easier to contribute+collaborate going forward, in particular by publishing branches directly to this repo.

@MichaelChirico
Copy link
Member

Re: possibility of breaking change -- lets see what revdeps tells us.

Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test suite looks good, new behavior looks correct, consistent, especially a huge improvement for the ability to do c(sum=lapply(.SD, sum), max=lapply(.SD, max)) and get unique names out of it just like that. Great work!

The code in [.data.table is a bit messy, but that region is already a huge mess that you've added to only marginally. It should be refactored for readability with some helpers, later.

@MichaelChirico
Copy link
Member

atime failure is unrelated, #6481 is for that.

@MichaelChirico MichaelChirico merged commit cf05b4a into Rdatatable:master Sep 6, 2024
7 of 8 checks passed
@grantmcdermott
Copy link
Contributor

Was just wishing for this functionality again, whilst working a new project this week. Huzzah!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

let j = c(prefix = lapply(.SD, f)) work when optimized
7 participants