-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join vignette #6478
Join vignette #6478
Conversation
vignettes/datatable-joins.Rmd
Outdated
) | ||
``` | ||
|
||
In this vignette you will learn how to perform any join operation using next resources available in the `data.table` syntax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete "next"
set.seed(5415) | ||
|
||
ProductSales = data.table( | ||
id = 1:10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here there are ten rows which may be overly complex for a first demonstration example
could it be reduced to two rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be challenging, as I would need to modify the examples below, since they are related. However, I do not expect the user to pay much attention to that detail.
We can consider avoiding the table display.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is ok to keep all ten rows if necessary, but smaller is better in terms of learning examples usually.
vignettes/datatable-joins.Rmd
Outdated
``` | ||
x[i, on, nomatch] | ||
| | | | | ||
| | | ----> If NULL only returns rows linked in x and i tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use \ and _ instead of - ?
|
\__
vignettes/datatable-joins.Rmd
Outdated
| | | ----> If NULL only returns rows linked in x and i tables | ||
| | ----> a character vector o list defining match logict | ||
| ----> principal data.table, list or data.frame | ||
----> secundary data.table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
secondary
vignettes/datatable-joins.Rmd
Outdated
| | | | | ||
| | | ----> If NULL only returns rows linked in x and i tables | ||
| | ----> a character vector o list defining match logict | ||
| ----> principal data.table, list or data.frame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
primary instead of principal?
also where do these names, principal/secondary come from? I don't see them in ?data.table so I would suggest to keep the standard/docmented names, i
and x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Toby, thanks for pointing this out, I will change it from principal to primary.
I am trying to keep using the documented reference.
I just thought it could be useful for a new user to understand that the i
table is more important in defining the number of rows to return, regardless of whether they are trying to make a left or right join.
vignettes/datatable-joins.Rmd
Outdated
----> secundary data.table | ||
``` | ||
|
||
An important difference between the regular `data.table` syntax is that _the only argument you can pass by position is the `i` argument_, the rest as you will see are going to be passed by name, so **feel free to change the argument order any time using the argument names** if it seems more convenient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not true. In R you can pass any argument by name or position. on is 15th so it is difficult, but possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point.
Here is the new text
Please keep in mind that the standard argument order in data.table is
dt[i, j, by]
. For join operations, it is recommended to pass theon
andnomatch
arguments by name to avoid usingj
andby
when they are not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand "to avoid using j and by when they are not needed." can you please clarify?
Here is my attempt to clarify. Is this similar to what you meant? "For join operations using the data table square brackets, the first argument (i
=table or names) is often specified as a positional argument, and the on
and nomatch
arguments are often specified by name, for example: dt1[dt2, on="variable", nomatch=0L]
."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is a demonstration that it is possible to use on argument by position instead of name (I don't think this is a great idea to do in practice though haha)
> dt1=data.table(x=1:2, y=3:4)
> dt2=data.table(x=2)
> dt1[dt2,,,,,,,,,,,,,,"x"]
x y
<int> <int>
1: 2 4
vignettes/datatable-joins.Rmd
Outdated
|
||
### 3.1. Right join | ||
|
||
Use this method if you need to combine columns from 2 tables based on one or more references but ***keeping all rows present in the table located on the right***. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change right to i
? or at least mention i
in addition to right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "table in the square brackets" or "i
argument" should be mentioned in addition to "table located on the right"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change right to
i
? or at least mentioni
in addition to right
I think it can be confusing as the left join it's also an i
join, just exchanging the position of each table
vignettes/datatable-joins.Rmd
Outdated
|
||
```{r} | ||
Products[ProductReceived, | ||
on = c("id" = "product_id")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"id" quotes not necessary
vignettes/datatable-joins.Rmd
Outdated
Our recommendation is to use the second alternative if possible, as it is **faster** and uses **less memory** than the first one. | ||
|
||
|
||
##### 3.1.3.1. Managing shared column Names with the j argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
four levels of section numbering (3.1.3.1) is difficult to understand. can it be kept to 2 or 3 please? (3.1) or (3.1.3)
vignettes/datatable-joins.Rmd
Outdated
dt1 = | ||
ProductReceived[Products, | ||
on = c("product_id" = "id"), | ||
by = .EACHI, | ||
j = .(total_value_received = sum(price * count))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please consider alternative whitespace formatting below.
when each argument of square brackets is on its own line, each argument can be commented easily
dt1 = ProductReceived[
Products,
on = c("product_id" = "id"),
by = .EACHI,
j = .(total_value_received = sum(price * count))
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this format.
vignettes/datatable-joins.Rmd
Outdated
|
||
```r | ||
DT[ ... | ||
][ ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DT[
...
][
...
][
...
]
vignettes/datatable-joins.Rmd
Outdated
] | ||
``` | ||
|
||
So far, if after applying all that operations **we want to join new columns without removing any row**, we would need to stop the chaining process, save a temporal table and later apply the join operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temporal -> temporary
vignettes/datatable-joins.Rmd
Outdated
allow.cartesian = TRUE] | ||
``` | ||
|
||
> `allow.cartesian` is defaulted to FALSE as this joins can can lead to a very large number of rows in the result.For example, if Table A has 100 rows and Table B has 50 rows, their Cartesian product would result in 5000 rows (100 * 50). This can quickly become memory-intensive for large datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add space after period (.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the highly significant contribution!
looks great overall, I left a few comments to think about.
is this vignette candidate too for translations ? |
After getting the approval, I can make a version in Spanish as it is my native language. |
Some tweaks. Related: Rdatatable#6478 Rdatatable#2181
Hi @AngelFelizR this is a highly non-trivial contribution of documentation, so can you please add your name to DESCRIPTION as contributor? After that I will merge. Thanks for your revisions! |
Thanks Toby |
great thanks! |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6478 +/- ##
=======================================
Coverage 98.62% 98.62%
=======================================
Files 79 79
Lines 14450 14450
=======================================
Hits 14251 14251
Misses 199 199 ☔ View full report in Codecov by Sentry. |
also I invited you as a project member, so after you accept, please create new branches/PRs under Rdatatable org, instead of your fork. |
This pull request is my solution to issue #2181