-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vertex attributes #67
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current set of patches is a very good start. We need to improve the implementation in various places, but this may take some more time.
util-covariates.R
Outdated
start.end = get.range.bounds(range) | ||
|
||
if(what == "cumulative" || what == "all") { | ||
start.end[1] = as.POSIXct("1970-01-01 00:00:00", "GMT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second argument is optional and is also pre-defined in other parts of the code (see also here). So, just omit the second argument here.
util-covariates.R
Outdated
@@ -0,0 +1,96 @@ | |||
## (c) Felix Prasse, 2017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename the file to util-networks-covariates.R
as we clearly have a relation to the functionality in util-networks.R
.
util-covariates.R
Outdated
} | ||
|
||
if(what == "all") { | ||
start.end[2] = as.POSIXct("9999-01-01 00:00:00", "GMT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment on Line 22.
util-covariates.R
Outdated
#' @param project.data The entire project date | ||
#' @param compute.attr The function to compute the attribute to add. Must return a named list | ||
#' with the names being the name of the vertex. | ||
#' @param cumulative Should the attribute be calculated from project start to end of range? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation is outdated. Please also add documentation for all other functions in this file.
util-covariates.R
Outdated
|
||
#' Utility function to compute vertex attributes for a list of network | ||
#' | ||
#' Only network ranges with dates work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Important: This function only works for lists of networks which have timestamps used in their range names."
util-covariates.R
Outdated
|
||
|
||
add.mail.attribute = function(list.of.networks, data, what = c("range", "cumulative", "all")) { | ||
compute.vertex.attributes(list.of.networks, data, "email.addresses", what, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the names of the attributes (in this case, "email.addresses"
), we should keep to the names that are already used in the network configuration for the edge attributes or in the reading functions as column headers (e.g., here).
Thus, we would use "author.email"
as the name for the vertex attribute here.
util-covariates.R
Outdated
add.mail.attribute = function(list.of.networks, data, what = c("range", "cumulative", "all")) { | ||
compute.vertex.attributes(list.of.networks, data, "email.addresses", what, | ||
function(range.data, net) { | ||
author.to.mail = range.data$get.author2commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is obsolete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not obsolete, but we can use something like get.author2mail
instead. In the end, that does not make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. Simply overlooked this... 🤦♂️ Then, we should use range.data$get.author2mail()
in Line 54 to keep consistent with other functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I've got this right, get.autor2mail only works if the same author has at least authored a commit and an email on the mailing list. But couldn't it happen (like in the test data) that an author never wrote an email?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. That can definitively happen. However, this problem can also appear the other way round: There are authors who wrote e-mails but never have committed.
Should we you use both data sources together and make a unique union then? That should not be a problem as the e-mail addresses used in the commit data and in the mail data are identical for each author.
One additional remark: Here we can definitely remove the what
parameter as the e-mail address of a developer cannot be different for different network ranges due to the disambiguation algorithms of Codeface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the misunderstanding before. I talked with Thomas, opened an issue for this topic (#69), implemented the changes (254d4de), and, finally, opened a pull request (#71).
I suggest -- and Thomas confirmed -- that we add the attribute independent of the existence of the new email
column in the author data (ProjectData$get.authors()
): if the column exists, add the real e-mail addresses; if not, add NA
as the vertex attribute. So, just wait for #71 to be merged and then rebase your branch and use the functionality.
util-covariates.R
Outdated
add.commit.count = function(list.of.networks, data, what = c("range", "cumulative", "all")) { | ||
compute.vertex.attributes(list.of.networks, data, "commit.count", what, function(range.data, net) { | ||
commit.count.df = get.author.commit.count(range.data)[c("author.name", "freq")] | ||
commit.count.list = structure(commit.count.df$freq, names = commit.count.df$author.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please access columns in data.frames via the square-bracket notation. Here, you would use commit.count.df["freq"]
and commit.count.df["author.name"]
.
Note: I will add a corresponding line to the contribution guide.
util-covariates.R
Outdated
add.artifact.count = function(list.of.networks, data, what = c("range", "cumulative", "all")) { | ||
compute.vertex.attributes(list.of.networks, data, "artifact.count", what, | ||
function(range.data, net) { | ||
return (lapply(range.data$get.author2artifact(), length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bockthom: Do we want to have the amount of unique touched artifacts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. The idea is, to use the amount of touched artifacts as a predictor for a kind of "importance". So, basically, we are interested in how many different artifacts a developer has changed, but (currently) not in how often the developer has changed certain artifacts.
@flx5 Please make sure that you have a unique list of artifacts before determining the length.
@clhunsen If one is interested in the individual artifacts changed by a certain developer, we could provide an additional vertex attribute which consists of a list of artifacts and an individual change count for each changed artifact. However, I am not sure whether we should provide something like that, too.
util-covariates.R
Outdated
|
||
add.author.classification = function(list.of.networks, data, what = c("range", "cumulative", "all"), | ||
type = c("network.degree", "network.eigen", "commit.count", "loc.count")) { | ||
compute.vertex.attributes(list.of.networks, data, "classification", what, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As most classification functions get additional parameters, we need to pass a classification result here and not call the classification function. In this place, we need to discuss how to use the what
parameter then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, it would be nice to have both: a function that simply adds the attributes according a given classification, but also a wrapper function which takes all the possible parameters, calls the classification function and also calls the further function.
The what
parameter can be used in the wrapper function somehow then - but then we have to carefully specify additional options...
util-covariates.R
Outdated
function(range.data, net) { | ||
return (lapply(range.data$get.author2commit(), | ||
function(x) { | ||
return ( c(start = min(x$date), end = max(x$date)) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether we really need this kind of activity-range information. We already have the date of the first activity as an extra attribute. The date of the last activity is not that interesting than the one of the first activity.
However, the 'active ranges' attribute, as it is called in #34 and also the example implementation, should check in which (resp. how much) of the ranges (e.g., threemonth windows) a developer was active at all.
So, I would like to have a boolean vector stating whether a developer was active or not. Depending on the what
parameter, this vector covers just the current range (which is always TRUE
), the current and the past ranges, or all ranges.
@clhunsen Any additional comments on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not right now.
Thomas and I made a lot of comments, so we should restrict this PR to not get overwhelmed with all the implementation-related comments and problems: I suggest, we start with the basic set of simple count attributes here that are relatively easy to implement and add the additional set of more sophisticated attributes in a later PR. @flx5, just stick to the following set at first (subset from #34):
Additionally, only implement it for range-level and project-level statistics. We will have a look at the cumulative stuff later. This way, we may have a clearer look for the essentials and potential problems. If you want, you can open a fresh and clean PR for that. Your decision. |
To fix your tests, if I understood correctly, you just need to move the function |
Add methods required to add vertex attributes to the graph. Add method to calculate commit count as vertex attribute Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
- Rename aggregation to aggregation.level and all to project - Rename add.vertex.attributes.* methods to add.vertex.attribute.* - Rename compute.vertex.attribute.with.n2r to add.vertex.attribute - Rename compute.vertex.attribute to split.and.add.vertex.attribute - Move networks.to.ranges to util-split and rename it to split.data.by.networks - Remove explicit return statement where possible - Add missing documentation Signed-off-by: Felix Prasse <[email protected]>
…nd issues Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Add test for add.vertex.attribute Add test for split.and.add.vertex.attribute Add test for add.vertex.attribute.commit.count Add test for add.vertex.attribute.author.email Add test for add.vertex.attribute.artifact.count Add test for add.vertex.attribute.first.activity Signed-off-by: Felix Prasse <[email protected]>
The tests were referencing to a wrong project data variable Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Add test for add.vertex.attribute.author.role.simple Add test for add.vertex.attribute.active.ranges Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
Add artifact attributes Add commit count based on committer Signed-off-by: Felix Prasse <[email protected]>
Signed-off-by: Felix Prasse <[email protected]>
If I didn't forget anything this should be complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional to the fact that much inline documentation should be added and the indentation fixed in as many places, the overall impression is really good. I found only one major bug: two function with the same name, resulting from a copy-paste action, obviously.
@flx5: Do you want to work to resolve my comments? Otherwise, I will take it.
nets.with.attr = split.and.add.vertex.attribute(list.of.networks, project.data, name, aggregation.level, default.value, | ||
function(range.data, net) { | ||
artifact.to.commit = get.key.to.value.from.df(range.data$get.commits.filtered.empty(), "artifact", "hash") | ||
artifact.change.count = lapply(artifact.to.commit, nrow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When calling get.key.to.value.from.df
, we do not call unique
on the column indicated by value
parameter (here, "hash"
). Consequently, we should call unique
before nrow
in the lapply
function here.
Under the assumption that a commit is a cohesive change, a feature should be seen as changed only once although it might been touched in several files within a single commit.
@@ -451,6 +453,50 @@ get.commit.count.threshold = function(range.data) { | |||
return(threshold) | |||
} | |||
|
|||
#' Get the commit count per comitter in the given range data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the documentation here, we should state that we only count the commits of committers iff they are not the author.
#' @param range.data The data to count on | ||
#' @return A data frame in descending order by the commit count. | ||
get.committer.commit.count = function(range.data) { | ||
logging::logdebug("get.contributer.commit.count: starting.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the tracing comment to: "get.committer.commit.count: starting."
.
## Execute a query to get the commit count per author | ||
res = sqldf::sqldf("select *, COUNT(*) as `freq` from `commits.df` group by `committer.name` order by `freq` desc") | ||
|
||
logging::logdebug("get.author.commit.count: finished.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the tracing comment to: "get.committer.commit.count: finished."
.
} | ||
|
||
## Execute a query to get the commit count per author | ||
res = sqldf::sqldf("select *, COUNT(*) as `freq` from `commits.df` group by `committer.name` order by `freq` desc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should capitalize all SQL keywords as in the other sqldf
statements.
start.end = get.range.bounds(range) | ||
|
||
if(aggregation.level == "cumulative" || aggregation.level == "project") { | ||
start.end[1] = as.POSIXct("1970-01-01 00:00:00") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should better use list.of.ranges[1]
here as the list of networks can be a subset of all networks obtained from the (complete) project.data
object.
CF.SELECTION.PROCESS = "testing" | ||
CASESTUDY = "test" | ||
ARTIFACT = "feature" | ||
AGGREGATION.LEVELS = c("range", "cumulative", "project") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we define this somewhere in the main files (and not here) as we may want to use it also somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I do not see any advantages in defining the aggregation level somewhere else -- we only use it for the node attributes. Or would you like to avoid the duplicates in the parameters of the vertex-attribute functions? We could do that and use a constant for the aggregation levels, but then the functions' parameters would be less intuitive, I guess.
networks.with.attr = add.vertex.attribute.commit.count.author(networks.and.data[["networks"]], | ||
networks.and.data[["project.data"]], | ||
aggregation.level = level) | ||
actual.attributes = lapply(networks.with.attr, function(net) igraph::vertex_attr(net, "commit.count")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use igraph::get.vertex.attribute
, here and anywhere else.
lapply(AGGREGATION.LEVELS, function(level) { | ||
networks.with.attr = add.vertex.attribute.commit.count.author(networks.and.data[["networks"]], | ||
networks.and.data[["project.data"]], | ||
aggregation.level = level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation messed up.
network.covariates.test.build.expected = function(a,b,c) { | ||
arguments = list(a,b,c) | ||
|
||
names(arguments) = c('2016-07-12 15:00:00-2016-07-12 16:00:00', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bins should be derived dynamically. The solution would be to make mybins
in Line 37 a public constant and call construct.ranges
here.
@clhunsen I can fix those issues after my exams in two to three weeks. |
That would be great as I will be busy correcting exams. 😄 Thanks. |
I am working on this right now with the help of Thomas. I will open a separate pull request soon. |
This adds vertex attributes as discussed in #34 .
Usage example:
Keep in mind that this is an early PR lacking proper documentation.