Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce issue-based artifact-network creation, new tests and fix current tests + minor bug fixes #244

Merged
merged 20 commits into from
Dec 20, 2023

Conversation

maxloeffler
Copy link
Contributor

@maxloeffler maxloeffler commented Sep 5, 2023

Prerequisites

  • I adhere to the coding conventions (described here) in my code.
  • I have updated the copyright headers of the files I have modified.
  • I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
  • I have put signed-off tags in all commits.
  • I have updated the changelog file NEWS.md appropriately.
  • I have checked whether I need to adjust the showcase file showcase.R with respect to my changes.
  • The pull request is opened against the branch dev.

Changelog

Added

  • Add issue-based artifact-networks
  • Add a new split.data.by.bins function (not to be confused with a previously existing function that had the same name and was renamed in this context), which splits data based on given activity-based bins

Changed/Improved

  • Enhance testing data by adding add_link and referenced_by issue events which connect issues to form edges in issue-based artifact-networks
  • Add input validation for the bins parameter in split.data.time.based and split.data.by.bins
  • Rename split.data.by.bins into split.dataframe.by.bins as this it what it does

Fixed

  • Reformat event.info.1 column of issue data according to the <issue-%source-%id> format, if the content of the event.info.1 field references another issue
  • Fix an issue in activity-based splitting where elements close to the border of bins might be assigned to the wrong bin. The issue was caused by the usage of split.data.time.based inside split.data.activity.based to split data into the previously derived bins, when elements close to bin borders share the same timestamps. It is fixed by replacing split.data.time.based by split.data.by.bins
  • Remove the last range when using a sliding-window approach and the last range's elements are fully contained in the second last range
  • Rename vertex attribute IssueEvent to Issue in multi-networks, to be consistent with bipartite-networks

Introduce edge construction into the 'get.artifact.network.issue' function.
Connect 'add_link' and 'referenced_by' issue-events by an edge.

This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
Add new 'add_link' and 'referenced_by' issue events to the testing data to
allow for new tests. Add a test for the construction of an issue-based
artifact-network with 'issues.only.comments' turned on and off.

This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
For some issue-events the 'event.info.1' column is to be interpreted as
a reference to an issue. In this case it is more consistent to reformat
the column entry into the <issue-[issue.source]-[event.info.1]> format.

This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
To be consistent with bipartite networks it is necessary to rename the
vertex attribute 'IssueEvent' to 'Issue' in multi-networks.

This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
Modify 'split.data.time.based' to be able to split by activity-based bins.
Rename the function to 'split.data.by.time.or.bins'. Introduce wrapper
functions 'split.data.by.bins.vector' and 'split.data.time.based' to call
'split.data.by.time.or.bins'.

Add 'include.duplicate.ids' parameter in 'split.get.bins.activity.based'
to obtain bins covering all data elements from 'df' by which the split
is being performed, regardless of the elements ids uniqueness.

In 'split.data.activity.based', after calculating the bins to place data
elements into, replace the time-based splitting by
'split.data.by.bins.vector'. Time-based splitting is incorrect for the
case that the date of the last element in a bin is the same as the date
of the first element of the next bin.

Adjust calculation of 'offset.end' in 'split.data.activity.based' to fix
a bug where because of a short last window the end offset would cross
the border of the last window, overlapping into the second last. Because
of this overlap the last sliding windows would not be calculated as expected.

This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
@maxloeffler maxloeffler changed the title Dev Introduce issue-based artifact-network creation, new tests and fix current tests + minor bug fixes Sep 5, 2023
@bockthom bockthom added this to the v4.4 milestone Sep 5, 2023
@bockthom bockthom requested review from bockthom and hechtlC September 5, 2023 10:37
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaLoefUDS for your work on this PR and fixing all the bugs you encountered during adjusting all the test data. Your changes look good to me, I only have a couple of questions regarding some implementation details, some of which need further investigations prior to discussing it - see my comments below.

Notice: I did not have a look at the adjusted tests yet, and I also will need another pass on the changes to the splitting functionality to be able to completely understand all of them...

util-networks.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-networks.R Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As promised, I had a closer look at the changed splitting functionality, and I also started to have a look at the tests (these were the two parts that were still missing from my last review.)

For the splitting functionality, I have a couple of questions that I would like to discuss in our next meeting to get a better understanding of what's going on there. Please have a look at my individual comments (most of them address just comments, but we should discuss them though). In general, the new functionality looks good to me, but I would like to make sure that I understand everything correctly and that we consider special corner cases correctly.

For the tests, I decided to skip reviewing the tests for now as I identified an unexpected change in the test data - we should discuss this in our next meeting before I will have a closer look at the tests themselves. Again, see my detailed comments below.

util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated
@@ -214,7 +270,7 @@ split.data.time.based = function(project.data, time.period = "3 months", bins =

## add splitting information to project configuration
project.conf.new$set.splitting.info(
type = "time-based",
type = if (split.by.time) "time-based" else "activity-based",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not related to this PR, but still interesting): Do we set the type for the splitting info also in other functions? Maybe it would be an idea to introduce string constants for the types, if there are just these two. (Does not need to be done in this PR, we can open another issue for that in take of this later. So just having a look whether these types appear also somewhere else be enough for now....

Copy link
Contributor Author

@maxloeffler maxloeffler Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are only set in one or two place (except in tests where expected results are built). I dont think its worth creating constants for them yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Then let's ignore that for now.

util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Show resolved Hide resolved
This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
Instead of creating only undirected issue-based artifact-networks, we now
take the directedness information out of the network config. Edges are
already created in a way that they can be interpreted as directed
edges from the issue referencing to the referenced issue.

Replace for loop in edge creation by more efficient mclapply and fix
some minor formatting inconsistencies.

This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
Rework the algorithm to create sliding windows in activity-based splitting.
Instead of cutting off half a range many elements at the end before building
sliding windows (which creates a lot of edge cases), build sliding windows with
every element up to the last one. Then remove the last incomplete range.
The contents of the last incomplete range will be fully included in the second
last range and therefore redundant.
Sometimes the last incomplete range is a regular range. Previously the last
range always had to be a regular range. This means that removing the last
incomplete range requires updating the tests.

Additionally fix and improve documentation of splitting methods and fix
minor spelling bugs.

This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
This renaming makes sense as the method only splits dataframes. Rename
'split.data.by.bins.vector' to 'split.data.by.bins' as it is more readable
and easier to understand.

This works towards se-sic#239.

Signed-off-by: Maximilian Löffler <[email protected]>
util-split.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I still haven't managed to look at the tests. But I have had a look at the new changes of the main implementation. In general, the new changes look good. I only spotted a few comments that are still hard to understand and one implementation detail which I don't understand, making it hard for me to judge its validity. See my individual comments below.

I will have a look at the tests in the next iteration.

util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
@maxloeffler
Copy link
Contributor Author

There is now only one point open (excluding the NEWS.md). @hechtlC What is your opinion on adding input validation for the bins variable in the splitting function (as mentioned in an earlier comment)?

util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
@bockthom
Copy link
Collaborator

bockthom commented Nov 15, 2023

There is now only one point open (excluding the NEWS.md). @hechtlC What is your opinion on adding input validation for the bins variable in the splitting function (as mentioned in an earlier comment)?

@MaLoefUDS There is still one other point open:

"When we take the directed option from the network configuration, we also should have a test for the directed issue artifact network."

As far as I can see, the tests in tests/test-networks-artifact.R only test undirected networks.
Maybe it is possible to use the existing tests and just make parameterized tests out of it, using patrick - then you just need to pass directed as a parameter and use it everywhere you need it in the test

Edit: In an earlier version I wrote "packrat" instead of "patrick", sorry.

Check type of the 'bins' parameter and its components in the wrapper
functions 'split.data.by.time' and 'split.data.by.bins'. Move
'split.data.by.time.or.bins' into a new category for internal helper
functions to discourage direct invocation.

Signed-off-by: Maximilian Löffler <[email protected]>
@maxloeffler maxloeffler force-pushed the dev branch 2 times, most recently from a87782e to 1f62944 Compare November 28, 2023 09:28
@maxloeffler
Copy link
Contributor Author

(I just found out that I used paste0 is not what I want here so I needed to amend once more 💀)

tests/test-split-data-time-based.R Outdated Show resolved Hide resolved
tests/test-split-data-activity-based.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
util-split.R Outdated Show resolved Hide resolved
Check if the 'bins' parameter of the 'split.data.by.bins' actually
contains 'bins' component. Use 'get.date.from.string' instead of
accessing 'lubridate::ymd_hms' directly, to encapsulate date conversion.
Allow 'vector' component of 'bins' to be of any subclass of
'numeric' instead of explicitly 'numeric'.

Disallow lists that contain elements that are not representing a date
in 'split.data.time.based' as they do not comply with the expected
format of bins for 'split.data.by.time.or.bins'.

Signed-off-by: Maximilian Löffler <[email protected]>
@maxloeffler maxloeffler force-pushed the dev branch 2 times, most recently from cbc80e1 to 6e1beeb Compare December 6, 2023 09:22
tests/test-split-misc.R Show resolved Hide resolved
Add tests that call 'split.data.time.based' and 'split.data.by.bins'
with various malformed 'bins' parameters and expect failure.

Signed-off-by: Maximilian Löffler <[email protected]>
@bockthom
Copy link
Collaborator

Thanks @MaLoefUDS for addressing my last comment. Now, everything I've spotted seems to be fixed now.
I will do my best to have a look at the changed tests until our meeting on Wednesday, because I would like to merge this PR before Christmas.

So, given that I don't spot anything problematic in the changed tests (but we'll see), the only thing missing is the update of the NEWS.md. Could you please already collect the different changes from this PR and sort them into the different categories (fixed, added, changed/improved), so that we can discuss them on Wednesday? (Not every single change needs to be listed separately, many of them can be combined into one single item, I guess.) Also don't forget about this one here: #244 (comment)

Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some time to have a look at the tests on a sample basis.
In general, the tests (at which I had a look at) look good. I spotted a few minor inconsistencies and a few changes that we should double-check and document.

In addition, I spotted one inconsistency in the test data 🙈

Moreover, after everything is fixed in the test data, we need to copy
test_feature/feature/issues-github.list to test_proximity/proximity/issues-github.list, and the same also for issues-jira.list, as our basic assumption is that the issue data for feature and proximity are the same.

tests/test-data.R Outdated Show resolved Hide resolved
tests/test-split-data-activity-based.R Outdated Show resolved Hide resolved
tests/test-split-data-activity-based.R Outdated Show resolved Hide resolved
tests/test-split-data-activity-based.R Outdated Show resolved Hide resolved
tests/test-networks-artifact.R Show resolved Hide resolved
Reverse order of reference from Jira issue 332 and Jira issue 328, since
previously, Jira issue 332 was referenced before its creation.
Copy GitHub and Jira issue data from 'test_feature/feature' also to
'test_proximity/proximity' to keep the data consistent.

Adjust all effected tests to comply with the changed testing data.

Signed-off-by: Maximilian Löffler <[email protected]>
Signed-off-by: Maximilian Löffler <[email protected]>
Signed-off-by: Maximilian Löffler <[email protected]>
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the NEWS.md. There is one major issue: You have added the changes to the v4.3 version, which has already been released. This is not a good idea, we will not change a past release. So, please move all your changes to a new section and just call it ## unversioned. We will add a version number when the next release will be scheduled. Also make sure that we have two empty lines when the version section ends, to be consistent with the previous versions.

I agree with the commits you have referenced, but sometimes I would like to add another commit in addition, if a change is not fully implemented in one commit (and here I only speak of real implementation commits that affect the behavior, not only tiny changes). I added suggestions for these commits below.

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one more comment: Please use backticks around function names and strings such as referenced_by or IssueEvent instead of ', such that it is properly highlighted by markdown.

Also use backticks instead of single ticks for proper markdown highlighting.
Improve changelog messages by clarifing and properly focusing on the
important changes as well as adding more relevant commit hashes.

Signed-off-by: Maximilian Löffler <[email protected]>
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, everything seems to be fine with this PR.
Many thanks @MaLoefUDS for your commitment and endurance in this PR. This was a really tough one, also due to plenty of necessary changes to the test data and splitting tests. But we also spotted bugs along this way and improved the splitting functionality substantially, which is a good result for a long PR that was supposed to "just" introduce issue-based artifact networks 😄

@bockthom bockthom merged commit 5febbda into se-sic:dev Dec 20, 2023
5 checks passed
maxloeffler added a commit to maxloeffler/coronet that referenced this pull request Jan 21, 2024
* Adapt the sliding-window data-splitting approach from PR se-sic#244 to network
splitting, especially the removal of the last (redundant) range.
* Fix an edge-case bug in said algorithm, where the wrong last split is removed.
This could happen when the subject to split can be exactly divided by the
desired size of a split.
* Fix a faulty test, that only became apparent after the adjustments to the
sliding-window algorithm in network-splitting

Signed-off-by: Maximilian Löffler <[email protected]>
maxloeffler added a commit to maxloeffler/coronet that referenced this pull request Jan 21, 2024
* Adapt the sliding-window data-splitting approach from PR se-sic#244 to network
splitting, especially the removal of the last (redundant) range.
* Fix an edge-case bug in said algorithm, where the wrong last split is removed.
This could happen when the subject to split can be exactly divided by the
desired size of a split.
* Fix a faulty test, that only became apparent after the adjustments to the
sliding-window algorithm in network-splitting

Signed-off-by: Maximilian Löffler <[email protected]>
maxloeffler added a commit to maxloeffler/coronet that referenced this pull request Jan 25, 2024
* Adapt the sliding-window data-splitting approach from PR se-sic#244 to network
splitting, especially the removal of the last (redundant) range.
* Fix an edge-case bug in said algorithm, where the wrong last split is removed.
This could happen when the subject to split can be exactly divided by the
desired size of a split.
* Fix a faulty test, that only became apparent after the adjustments to the
sliding-window algorithm in network-splitting

Signed-off-by: Maximilian Löffler <[email protected]>
maxloeffler added a commit to maxloeffler/coronet that referenced this pull request Feb 14, 2024
* Adapt the sliding-window data-splitting approach from PR se-sic#244 to network
splitting, especially the removal of the last (redundant) range.
* Fix an edge-case bug in said algorithm, where the wrong last split is removed.
This could happen when the subject to split can be exactly divided by the
desired size of a split.
* Fix a faulty test, that only became apparent after the adjustments to the
sliding-window algorithm in network-splitting

Signed-off-by: Maximilian Löffler <[email protected]>
@bockthom bockthom mentioned this pull request Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants