Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add commit message merge functionality #193

Merged
merged 43 commits into from
Feb 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
9589e52
Add test commit message data
nlschn Dec 13, 2020
85b1d05
Add new function to read commit messages
nlschn Dec 13, 2020
61618db
Change commitMessages.list test files to have for the right line breaks
nlschn Dec 16, 2020
17a61ed
Adapt commit message read test to new test files
nlschn Dec 16, 2020
c624c90
Adapt read.commit.messages to handle line breaks correctly
nlschn Dec 16, 2020
fdc414a
Add functions that enable merging commit messages into data
nlschn Dec 21, 2020
5db90d8
Add new configuration option for commit messages
nlschn Dec 21, 2020
f80b24b
Replace seq with seq_along and add missing log statement in util-read.R
nlschn Dec 21, 2020
9414357
Add tests for merging and fix bug when merging only titles
nlschn Dec 28, 2020
359b12c
Add description of changes to unversioned section of NEWS.md
nlschn Jan 2, 2021
70c8395
Remove unnecessary empty lines from several files
nlschn Jan 7, 2021
89a6ea6
Fix a syntax error in util-read
nlschn Jan 7, 2021
6e9147e
Fix merging by hash instead of commit.id
nlschn Jan 8, 2021
c9c7ff7
Modify README and NEWS
nlschn Jan 13, 2021
0457dd5
Rename "message.body" column to "message" everywhere
nlschn Jan 13, 2021
7e61dcb
Fix style issues and improve message processing
nlschn Jan 13, 2021
8e28a1f
Put merge functionality into own function
nlschn Jan 13, 2021
703ab3e
Fix error when returning a variable that is not defined
nlschn Jan 13, 2021
7caaa8d
Simplify data frame creation in read.commit.messages
nlschn Jan 15, 2021
8dd410c
Reorder functions in util read and replace special functions
nlschn Jan 15, 2021
eb1cec8
Fix comments in and change order in 'set.commits'
nlschn Jan 15, 2021
d5c8c78
Add helper function to format 'commit.id' column
nlschn Jan 15, 2021
43e1894
Change commit message merge process
nlschn Jan 15, 2021
70b3cb6
Change order of data sources to be alphabetical
nlschn Jan 16, 2021
31e0f85
Update 'NEWS.md' with commit hashes
nlschn Jan 16, 2021
a0d5e32
Add package 'data.table' to coronet and refactor README
nlschn Jan 20, 2021
4c49269
Increase perfomance of commit message read
nlschn Jan 20, 2021
19655dd
Update my copyright notices
nlschn Jan 20, 2021
a36bde4
Fix spelling errors in 'README.md' and 'util-conf.R'
nlschn Jan 20, 2021
aab0751
Use new helper function in tests to format commit ids
nlschn Jan 25, 2021
0859b9a
Replace for-loop with lapply call in function to read commit messages
nlschn Jan 25, 2021
fc5d20f
Fix minor comment issues and add checks before updating commit messages
nlschn Jan 25, 2021
686459e
Initialize commit message data on RangeData-objects in 'util-split.R'
nlschn Jan 25, 2021
613a773
Fix minor spelling errors
nlschn Jan 25, 2021
98e83b0
Change all data split tests to include commit message data
nlschn Jan 25, 2021
2e42fca
Change all sliding window data tests to include commit message data
nlschn Jan 25, 2021
c052dfb
Fix minor comment issue in 'test-split-sliding-window.R'
nlschn Jan 26, 2021
d3bbae0
Add new cleanup functions for commit messages and synchronicity
nlschn Jan 30, 2021
9385084
Fix wrong variable name in 'cleanup.synchronicity'
nlschn Jan 30, 2021
63b6f79
Add cleanup functions to NEWS.md
nlschn Feb 1, 2021
c63a25a
Remove unnecassary function calls and add logging output
nlschn Feb 1, 2021
e1e1ba8
Fix regex when filtering out spaces and change data frame assignment
nlschn Feb 1, 2021
18843a8
Fix problems in CI pipeline for R-3.3
bockthom Feb 3, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .drone.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ steps:

- name: R-3.3
pull: if-not-exists
image: r-base:3.3.3
image: r-base:3.3.2
commands: *runTests
depends_on: [clone]

Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## Unversioned

### Added
- Add functionality to read and process commit messages in order to merge them to the commit data (see issue #180). Three values are available for the new attribute `commit.messages` in `ProjectConf`: `none`, `title` and `messages` (PR #193, 85b1d0572c0fb9f4c062bceb1363b0398f98b85f, fdc414ade1a640f533e809a25cfe012e42b3cffa, 43e1894998e18faff3a65114fa65ee54e1d2f66e)
- Add functions `cleanup.commit.message.data` and `cleanup.synchronicity.data` to remove commit hashes that are not any more present in the commit data from the commit message data or synchronicity data (PR #193, 98e83b037ecc88d9a29e8e4ca93598a9978e85a2)

### Changed/Improved
- Add `.drone.yml` to enable running our CI pipelines on drone.io (PR #191, 1c5804b59c582cf34af6970b435add51452fbd11)

Expand Down
97 changes: 53 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,41 +10,41 @@ If you wonder: The name `coronet` derives as an acronym from the words "configur


## Table of contents

- [Integration](#integration)
* [Requirements](#requirements)
* [R](#r)
* [packrat (recommended)](#packrat)
* [Folder structure of the input data](#folder-structure-of-the-input-data)
* [Needed R packages](#needed-r-packages)
* [Submodule](#submodule)
* [Selecting the correct version](#selecting-the-correct-version)
- [Functionality](#functionality)
* [Configuration](#configuration)
* [Data sources](#data-sources)
* [Network construction](#network-construction)
* [Data sources for network construction](#data-sources-for-network-construction)
* [Types of networks](#types-of-networks)
* [Relations](#relations)
* [Edge-construction algorithms for author networks](#edge-construction-algorithms-for-author-networks)
* [Vertex and edge attributes](#vertex-and-edge-attributes)
* [Further functionalities](#further-functionalities)
* [Splitting data and networks based on defined time windows](#splitting-data-and-networks-based-on-defined-time-windows)
* [Cutting data to unified date ranges](#cutting-data-to-unified-date-ranges)
* [Handling data independently](#handling-data-independently)
* [How-to](#how-to)
* [File/Module overview](#filemodule-overview)
- [Configuration classes](#configuration-classes)
* [ProjectConf](#projectconf)
* [Basic information](#basic-information)
* [Artifact-related information](#artifact-related-information)
* [Revision-related information](#revision-related-information)
* [Data paths](#data-paths)
* [Splitting information](#splitting-information)
* [(Configurable) Data-retrieval-related parameters](#configurable-data-retrieval-related-parameters)
* [NetworkConf](#networkconf)
- [License](#license)
- [Work in progress](#work-in-progress)
- [Integration](#integration)
- [Requirements](#requirements)
- [`R`](#r)
- [`packrat` (recommended)](#packrat-recommended)
- [Folder structure of the input data](#folder-structure-of-the-input-data)
- [Needed R packages](#needed-r-packages)
- [Submodule](#submodule)
- [Selecting the correct version](#selecting-the-correct-version)
- [Functionality](#functionality)
- [Configuration](#configuration)
- [Data sources](#data-sources)
- [Network construction](#network-construction)
- [Data sources for network construction](#data-sources-for-network-construction)
- [Types of networks](#types-of-networks)
- [Relations](#relations)
- [Edge-construction algorithms for author networks](#edge-construction-algorithms-for-author-networks)
- [Vertex and edge attributes](#vertex-and-edge-attributes)
- [Further functionalities](#further-functionalities)
- [Splitting data and networks based on defined time windows](#splitting-data-and-networks-based-on-defined-time-windows)
- [Cutting data to unified date ranges](#cutting-data-to-unified-date-ranges)
- [Handling data independently](#handling-data-independently)
- [How-to](#how-to)
- [File/Module overview](#filemodule-overview)
- [Configuration classes](#configuration-classes)
- [ProjectConf](#projectconf)
- [Basic information](#basic-information)
- [Artifact-related information](#artifact-related-information)
- [Revision-related information](#revision-related-information)
- [Data paths](#data-paths)
- [Splitting information](#splitting-information)
- [(Configurable) Data-retrieval-related parameters](#configurable-data-retrieval-related-parameters)
- [NetworkConf](#networkconf)
- [Contributing](#contributing)
- [License](#license)
- [Work in progress](#work-in-progress)


## Integration
Expand Down Expand Up @@ -123,6 +123,7 @@ Alternatively, you can run `Rscript install.R` to install the packages.
- `parallel`: For parallelization
- `logging`: Logging
- `sqldf`: For advanced aggregation of `data.frame` objects
- `data.table`: For faster data processing
- `testthat`: For the test suite
- `patrick`: For the test suite
- `ggplot2`: For plotting of data
Expand Down Expand Up @@ -179,11 +180,16 @@ There are two distinguishable types of data sources that are both handled by the
* Issue data (called `"issues"` internally)

- Additional (orthogonal) data sources (augmentable to main data sources, not splittable)
* Commit messages are available through the parameter `commit.messages` in the [`ProjectConf`](#configurable-data-retrieval-related-parameters) class. Three values can be used:
1. `none` is the default value and does not impact the configuration at all.
2. `title` merges the commit message titles (i.e. the first non white space line of a commit message) to the commit data. This gives the data frame an additional column `title`.
3. `messages` merges both titles and message bodies to the commit data frame. This adds two new columns `title` and `message`.
* [PaStA](https://github.com/lfd/PaStA/) data (patch-stack analysis, see also the parameter `pasta` in the [`ProjectConf`](#configurable-data-retrieval-related-parameters) class))
* Patch-stack analysis to link patches sent to mailing lists and upstream commits
* Synchronicity information on commits (see also the parameter `synchronicity` in the [`ProjectConf`](#configurable-data-retrieval-related-parameters) class)
* Synchronous commits are commits that change a source-code artifact that has also been changed by another author within a reasonable time-window.



The important difference is that the *main data sources* are used internally to construct artifact vertices in relevant types of networks. Additionally, these data sources can be used as a basis for splitting `ProjectData` in a time-based or activity-based manner – obtaining `RangeData` instances as a result (see file `split.R` and the contained functions). Thus, `RangeData` objects contain only data of a specific period of time.

The *additional data sources* are orthogonal to the main data sources, can augment them by additional information, and, thus, are not split at any time.
Expand Down Expand Up @@ -532,16 +538,23 @@ There is no way to update the entries, except for the revision-based parameters.
- `commits.filter.untracked.files`
* Remove all information concerning untracked files from the commit data. This effect becomes clear when retrieving commits using `get.commits.filtered`, because then the result of which does not contain any commits that solely changed untracked files. Networks built on top of this `ProjectData` do also not contain any information about untracked files.
* [*`TRUE`*, `FALSE`]
- `mails.filter.patchstack.mails`
* Filter patchstack mails from the mail data. In a thread, a patchstack spans the first sequence of mails where each mail has been authored by the thread creator and has been sent within a short time window after the preceding mail. The mails spanned by a patchstack are called
'patchstack mails' and for each patchstack, every patchstack mail but the first one are filtered when `mails.filter.patchstack.mails = TRUE`.
* [`TRUE`, *`FALSE`*]
- `commmit.messages`
* Read and add commit messages to commits. The column `title` will contain the first line of the message and, if selected, the column `message` will contain the rest.
* [*`none`*, `title`, `messages`]
- `issues.only.comments`
* Only use comments from the issue data on disk and no further events such as references and label changes
* [*`TRUE`*, `FALSE`]
- `issues.from.source`
* Choose from which sources the issue data on disk is read in. Multiple sources can be chosen.
* [*`github`, `jira`*]
- `mails.filter.patchstack.mails`
* Filter patchstack mails from the mail data. In a thread, a patchstack spans the first sequence of mails where each mail has been authored by the thread creator and has been sent within a short time window after the preceding mail. The mails spanned by a patchstack are called
'patchstack mails' and for each patchstack, every patchstack mail but the first one are filtered when `mails.filter.patchstack.mails = TRUE`.
* [`TRUE`, *`FALSE`*]
- `pasta`
* Read and integrate [PaStA](https://github.com/lfd/PaStA/) data with commit and mail data (columns `pasta` and `revision.set.id`)
* [`TRUE`, *`FALSE`*]
* **Note**: To include PaStA-based edge attributes, you need to give the `"pasta"` edge attribute for `edge.attributes`.
- `synchronicity`
* Read and add synchronicity data to commits (column `synchronicity`)
* [`TRUE`, *`FALSE`*]
Expand All @@ -550,10 +563,6 @@ There is no way to update the entries, except for the revision-based parameters.
* The time-window (in days) to use for synchronicity data if enabled by `synchronicity = TRUE`
* [1, *5*, 10, 15]
* **Note**: If, at least, one artifact in a commit has been edited by more than one developer within the configured time window, then the whole commit is considered to be synchronous.
- `pasta`
* Read and integrate [PaStA](https://github.com/lfd/PaStA/) data with commit and mail data (columns `pasta` and `revision.set.id`)
* [`TRUE`, *`FALSE`*]
* **Note**: To include PaStA-based edge attributes, you need to give the `"pasta"` edge attribute for `edge.attributes`.

### NetworkConf

Expand Down
1 change: 1 addition & 0 deletions install.R
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ packages = c(
"parallel",
"logging",
"sqldf",
"data.table",
"testthat",
"patrick",
"ggplot2",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
32712;"72c8dd25d3dd6d18f46e2b26a5f5b1e2e8dc28d0";"Add stuff"
32713;"5a5ec9675e98187e1e92561e1888aa6f04faa338";" Add some more stuff "
32710;"3a0ed78458b3976243db6829f63eba3eead26774";" I added important things the things are nothing"
32714;"1143db502761379c2bfcecc2007fc34282e7ee61";" I wish it would work now"
32715;"418d1dc4929ad1df251d2aeb833dd45757b04a6f";"Wish intensifies"
32716;"d01921773fae4bed8186b0aa411d6a2f7a6626e6";" ... still doesn't work as expected "
32711;"0a1a5c523d835459c42f33e863623138555e2526";""
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
32712;"72c8dd25d3dd6d18f46e2b26a5f5b1e2e8dc28d0";"Add stuff"
32713;"5a5ec9675e98187e1e92561e1888aa6f04faa338";" Add some more stuff "
32710;"3a0ed78458b3976243db6829f63eba3eead26774";" I added important things the things are nothing"
32714;"1143db502761379c2bfcecc2007fc34282e7ee61";" I wish it would work now"
32715;"418d1dc4929ad1df251d2aeb833dd45757b04a6f";"Wish intensifies"
32716;"d01921773fae4bed8186b0aa411d6a2f7a6626e6";" ... still doesn't work as expected "
32711;"0a1a5c523d835459c42f33e863623138555e2526";""
Loading