Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Core/Peripheral Classification Methods #276

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
- Add commit network as a new type of network. It uses commits as vertices and connects them either via cochange or commit interactions. This includes adding new config parameters and the function `add.vertex.attribute.commit.network` for adding vertex attributes to a commit network (PR #263, ab73271781e8e9a0715f784936df4b371d64c338, ab73271781e8e9a0715f784936df4b371d64c338, cd9a930fcb54ff465c2a5a7c43cfe82ac15c134d)
- Add `remove.duplicate.edges` function that takes a network as input and conflates identical edges (PR #268, d9a4be417b340812b744f59398ba6460ba527e1c, 0c2f47c4fea6f5f2f582c0259f8cf23af985058a, c6e90dd9cb462232563f753f414da14a24b392a3)
- Add `cumulative` as an argument to `construct.ranges` which enables the creation of cumulative ranges from given revisions (PR #268, a135f6bb6f83ccb03ae27c735c2700fccc1ee0c8, 8ec207f1e306ef6a641fb0205a9982fa89c7e0d9)
- Add four new metrics that can be used for the classification of authors into core and peripheral: Betweenness, Closeness, Pagerank and Eccentricity (PR #276, 65d5c9cc86708777ef458b0c2e744ab4b846bdd1, b392d1a125d0f306b4bce8d95032162a328a3ce2, c5d37d40024e32ad5778fa5971a45bc08f7631e0)

### Changed/Improved

Expand Down
52 changes: 52 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ If you wonder: The name `coronet` derives as an acronym from the words "configur
- [Splitting data and networks based on defined time windows](#splitting-data-and-networks-based-on-defined-time-windows)
- [Cutting data to unified date ranges](#cutting-data-to-unified-date-ranges)
- [Handling data independently](#handling-data-independently)
- [Core/Peripheral classification](#coreperipheral-classification)
- [Count-based metrics](#count-based-metrics)
- [Network-based metrics](#network-based-metrics)
- [How-to](#how-to)
- [File/Module overview](#filemodule-overview)
- [Configuration classes](#configuration-classes)
Expand Down Expand Up @@ -375,6 +378,55 @@ Analogously, the `NetworkConf` parameter `unify.date.ranges` enables this very f

In some cases, it is not necessary to build a network to get the information you need. Therefore, please remember that we offer the possibility to get the raw data or mappings between, e.g., authors and the files they edited. The data inside an instance of `ProjectData` can be accessed independently. Examples can be found in the file `showcase.R`.

#### Core/Peripheral classification

Core/Peripheral classification descibes the process of dividing the authors of a project into either `core` or `peripheral` developers based on the principle that the core developers contribute most of the work in a given project. The concrete threshold can be configured in `CORE.THRESHOLD` and is set to 80% per default, a value commonly used in literature. In practice, this is done by assigning scores to developers to approximate their importance in a project and then dividing the authors into `core` or `peripheral` based on these scores such that the desired split is achieved.

##### Count-based metrics

In this section, we provide descriptions of the different algorithms we provide for classifying authors into core or peripheral authors using count-based metrics.
- `commit.count`
* calculates scores based on the number of commits per author
- `loc.count`
* calculates scores based on the number of lines of code changed by each author
- `mail.count`
* calculates scores based on the number of mails sent per author
- `mail.thread.count`
* calculates scores based on the number of mail threads each author participated in
- `issue.count`
* calculates scores based on the number of issues each author participated in
- `issue.comment.count`
* calculates scores based on the number of comments each author made in issues
- `issue.commented.in.count`
* calculates scores based on the number of issues each author commented in
- `issue.created.count`
* calculates scores based on the number of issues each author created

##### Network-based metrics

In this section, we provide descriptions of the different algorithms we provide for classifying authors into core or peripheral authors using metrics that are used on author networks.
- `network.degree`
* calculates scores for authors based on the vertex degrees in an author network
* the degree of a vertex is the number of adjacent edges
- `network.eigen`
* calculates scores for authors based on the eigenvector centralities in an author network
* eigenvector centrality measures the importance of vertices within a network by awarding scores for adjacent edges proportional to the score of the connected vertex
- `network.hierarchy`
* calculates scores for authors based on the hierarchy found within an author network
* hierarchical scores are calculated by dividing the vertex degree by the clustering coefficient of each vertex
- `network.betweenness`
* calculates scores for authors based on the betweenness of vertices in an author network
* betweenness measures the number of shortest paths between any two vertices that go through each vertex
- `network.closeness`
* calculates scores for authors based on the closeness of vertices in an author network
* closeness measures how close vertices are to each other by calculating the sum of their shortest paths to all other vertices
- `network.pagerank`
* calculates scores for authors based on the pagerank of vertices in an author network
* pagerank refers to the pagerank algorithm employed by google, which is closely related to eigenvector centrality
- `network.eccentricity`
* calculates scores for authors based on the eccentricity of vertices in an author network
* eccentricity measures the length of the shortest path to each vertex's furthest reachable vertex

### How-to

In this section, we give a short example on how to initialize all needed objects and build a bipartite network.
Expand Down
112 changes: 112 additions & 0 deletions tests/test-core-peripheral.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
## Copyright 2019 by Christian Hechtl <[email protected]>
## Copyright 2021 by Christian Hechtl <[email protected]>
## Copyright 2023-2024 by Maximilian Löffler <[email protected]>
## Copyright 2024 by Leo Sendelbach <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the copyright header and include 2025 😉

## All Rights Reserved.


Expand Down Expand Up @@ -105,6 +106,117 @@ test_that("Eigenvector classification", {
expect_equal(expected, result, tolerance = 0.0001)
})

test_that("Hierarchy classification", {

vertices = data.frame(
name = c("Olaf", "Thomas", "Karl"),
kind = TYPE.AUTHOR,
type = TYPE.AUTHOR
)
edges = data.frame(
from = c("Olaf", "Thomas", "Karl", "Thomas"),
to = c("Thomas", "Karl", "Olaf", "Thomas"),
func = c("GLOBAL", "test2.c::test2", "GLOBAL", "test2.c::test2"),
hash = c("0a1a5c523d835459c42f33e863623138555e2526",
"418d1dc4929ad1df251d2aeb833dd45757b04a6f",
"5a5ec9675e98187e1e92561e1888aa6f04faa338",
"d01921773fae4bed8186b0aa411d6a2f7a6626e6"),
file = c("GLOBAL", "test2.c", "GLOBAL", "test2.c"),
base.hash = c("3a0ed78458b3976243db6829f63eba3eead26774",
"0a1a5c523d835459c42f33e863623138555e2526",
"1143db502761379c2bfcecc2007fc34282e7ee61",
"0a1a5c523d835459c42f33e863623138555e2526"),
base.func = c("test2.c::test2", "test2.c::test2",
"test3.c::test_function", "test2.c::test2"),
base.file = c("test2.c", "test2.c", "test3.c", "test2.c"),
artifact.type = c("CommitInteraction", "CommitInteraction", "CommitInteraction", "CommitInteraction"),
weight = c(1, 1, 1, 1),
type = c(TYPE.EDGES.INTRA, TYPE.EDGES.INTRA, TYPE.EDGES.INTRA, TYPE.EDGES.INTRA),
relation = c("commit.interaction", "commit.interaction", "commit.interaction", "commit.interaction")
)
test.network = igraph::graph_from_data_frame(edges, directed = FALSE, vertices = vertices)

## Act
result = get.author.class.network.hierarchy(test.network)
## Assert
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a blank line before the comment?

expected.core = data.frame(author.name = c("Thomas"),
hierarchy = c(4))
expected.peripheral = data.frame(author.name = c("Olaf", "Karl"),
hierarchy = c(2, 2))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result)
})

test_that("Betweenness classification", {

## Act
result = get.author.class.network.betweenness(network)

## Assert
expected.core = data.frame(author.name = c("Olaf"),
betweenness.centrality = c(1))
expected.peripheral = data.frame(author.name = c("Björn", "udo", "Thomas", "Fritz [email protected]",
"georg", "Hans"),
betweenness.centrality = c(0, 0, 0, 0, 0, 0))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result)
})

test_that("Closeness classification", {

## Act
result = get.author.class.network.closeness(network)

## Assert
expected.core = data.frame(author.name = c("Olaf"),
closeness.centrality = c(0.5))
expected.peripheral = data.frame(author.name = c("Björn", "Thomas", "udo", "Fritz [email protected]",
"georg", "Hans"),
closeness.centrality = c(0.33333, 0.33333, 0.0, 0.0, 0.0, 0.0))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result, tolerance = 0.0001)
})

test_that("Pagerank classification", {

## Act
result = get.author.class.network.pagerank(network)

## Assert
expected.core = data.frame(author.name = c("Olaf"),
pagerank.centrality = c(0.40541))
expected.peripheral = data.frame(author.name = c("Björn", "Thomas", "udo", "Fritz [email protected]",
"georg", "Hans"),
pagerank.centrality = c(0.21396, 0.21396, 0.041667, 0.041667, 0.041667, 0.041667))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result, tolerance = 0.0001)
})

test_that("Eccentricity classification", {

## Act
result = get.author.class.network.eccentricity(network)

## Assert
expected.core = data.frame(author.name = c("Olaf"),
eccentricity = c(1))
expected.peripheral = data.frame(author.name = c("Björn", "udo", "Thomas", "Fritz [email protected]",
"georg", "Hans"),
eccentricity = c(0, 0, 0, 0, 0, 0))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result)
})

# TODO: Add a test for hierarchy classification
Copy link
Collaborator

@bockthom bockthom Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this TODO comment as soon as you have added another test for hierarchy classification. Thanks!


test_that("Commit-count classification using 'result.limit'" , {
Expand Down
Loading
Loading