Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When to do timestamp extraction for issue data? Before or after event filtering? #185

Closed
bockthom opened this issue Nov 11, 2020 · 4 comments

Comments

@bockthom
Copy link
Collaborator

Description

During various debugging activities in the last days, I stumbled over the following issue:

When performing data cutting (get.data.cut.to.same.date) on issue data, the outcome sometimes looks weird. Then I found out why: The timestamps are extracted before the filtering on "issues.only.comments" takes place -- see the function call extract.timestamps in the following code excerpt:
https://github.com/se-passau/coronet/blob/91fc448bad64dd17fa96a720322fe7d46337cc43/util-data.R#L900-L915

Hence, the data cutting function uses timestamps that may be outside the data-cutting range if they are related to an issue event which is not a comment. This might be confusing if a user is not aware of that. But then I thought about it: How can this happen? Can it happen that there are timestamps in the issue data which date back to before the first comment? -- Each issue / PR starts with a comment - so how can this happen? After doing some research, I found answer for these questions: If we have a pull request, all the commits added to this pull-request are an event themselves, the date of those events is equal to the author date of the commit. Actually adding this commit to the PR is a separate event having the timestamp of the time at which the commit was pushed. So, everything is ok, but we should think about how we want to handle this -- there are several options:


Option 1: Extract timestamps after the event-filtering on "issues.only.comments" has taken place.

Advantage: We only get the timestamps we actually use (but then it would be necessary to also write the filtered data to the data object, reducing the needs for filtering again and again whenever calling get.issues).

Disadvantage: The timestamps won't be updatable anymore. That is, as soon we have filtered the comment-events once, there is no way to return to the original data. However, we could change the implementation of the "issues.only.comments" configuration option in a way that it re-reads the data completely whenever changing this configuration option.


Option 2: Filter events that may pre-date the actual issue data (like "commit_added" events which rather belong to commit data than to issue-data)

Here we would also have two sub-options: Either remove them completely while reading the issue data, or just remove them from timestamp extraction.

Advantage: Comparably easy to implement, not many changes necessary.

Disadvantage: We may loose information on those events (depending on the sub-option) -- but this may be wanted in some kind.


Option 3: Don't change anything.

Advantage: Nothing to do, stop discussing it.

Disadvantage: Confusing behavior when there are very old commits referenced in a pull request and the pull request was initiated many months later than the commit was authored. This somehow circumvents the data cutting procedure as such old commits lead us to believe that the issue data started months ago, but actually it was just an old commit that was added to a pull request yesterday and the actual issue data started also yesterday.


What do you folks think about that? Which option would you prefer? I am also open for new suggestions.

@bockthom bockthom added this to the v3.7 milestone Nov 11, 2020
@bockthom
Copy link
Collaborator Author

After talking about this issue with @hechtlC, we came up with a new solution, which circumvents all the disadvantages of the options stated above. Hence, this is our favorite option:

Option 4: Adjust dates of the commit_added events to not predate the issue creation date

When reading the issue data, set all the event dates which predate the creation of the issue (PR) to the beginning of the issue, as there cannot be any event taking place before the issue was created. To not lose the original date of such events, store the original date in one of the eventInfo columns, instead.

Advantages: We don't lose any data, the timestamps for the start of the issue data is determined correctly, and filtering is independent from all the date-specific changes.

Hence, this is to be implemented some day.

@bockthom bockthom modified the milestones: v3.7, v3.8 Dec 1, 2020
@clhunsen
Copy link
Collaborator

clhunsen commented Feb 7, 2021

The timestamps are extracted before the filtering on "issues.only.comments" takes place -- see the function call extract.timestamps in the following code excerpt:
https://github.com/se-passau/coronet/blob/91fc448bad64dd17fa96a720322fe7d46337cc43/util-data.R#L900-L915

I just want to share the thoughts that I had while revisiting this issue. In the end, I see a mixture of Options 1 and 4 as a resolution for this issue. And here is why and how:

  • Right now, the issue data is filtered on-the-fly and in any case when calling ProjectData$get.issues() – but the result is never stored persistently (i.e., in the ProjectData object). Maybe, this is an outdated mechanism. When looking at the corresponding documentation in the README file, I come to the conclusion that this ProjectConf parameter should be treated similarly to mails.filter.patchstack.mails for the mail data – that is: persistently. I am not sure why we implemented the issue filtering as it is, in the end, it is not consistent with the other data sources right now, isn't it?
  • In either way, the timestamp extraction should always be executed after filtering to yield a consistent behavior, for any data source!
  • I see the problem with commit-related data inside the issue data andI support the idea of an timestamp adjustment to circumvent any weirdly shaped data (and also networks). The reasoning for Option 4 fits. Not sure whether this needs to be configurable (turned on by default, for example) to ensure that users can access the original data somehow.

Not sure whether I miss something here. Just my thoughts on this anyway. What are yours?

@clhunsen clhunsen mentioned this issue Feb 7, 2021
7 tasks
@bockthom
Copy link
Collaborator Author

bockthom commented Feb 7, 2021

Thanks for your thoughts @clhunsen! I completely agree with you.

I am not sure why we implemented the issue filtering as it is, in the end, it is not consistent with the other data sources right now, isn't it?

That's true. We just stumbled over this inconsistency a few days ago in PR #194, as we need to be able to access both filtered and unfiltered issue data there. We agreed that issue filtering should be implemented similarly to commit filtering and @JoJoDeveloping is currently working on that for PR #194. Our goal is to adjust the issue filtering such that it behaves like commit filtering (and make sure that the configuration option "isues.only.comments" still behaves as before).

@bockthom
Copy link
Collaborator Author

bockthom commented Feb 7, 2021

  • In either way, the timestamp extraction should always be executed after filtering to yield a consistent behavior, for any data source!

I thought about that for a couple of minutes and still agree with you that we need a consistent behavior across all data sources. However, I am not sure if we really need to perform the timestamp extracting after filtering. The purpose of the timestamp extraction is to find out for which time periods we have data at all. Filtering just excludes certain events/files/etc. which are out of interest. But the data is, though, available for the complete, unfiltered period, even if there is no event of interest in this unfiltered period, there could potentially be one. Therefore, one could also argue that timestamp extraction before filtering is sufficient. Nevertheless, I would like to keep the behavior as it is currently implemented for commit timestamp extraction (to keep backwards compatibility) but adjust the issue timestamp extraction to be consistent. In the end, it should make (almost) no difference whether timestamps are extracted before or after filtering as filtering should have no effect on the timestamps any more (when we go with option 4 to adjust the weirdly shaped pull-request data.)

JoJoDeveloping added a commit to JoJoDeveloping/coronet that referenced this issue Mar 10, 2021
…t.issues to make it similar to get.commits[.filtered]

Close se-sic#185.
JoJoDeveloping added a commit to JoJoDeveloping/coronet that referenced this issue Mar 10, 2021
…t.issues to make it similar to get.commits[.filtered]

Close se-sic#185.

Signed-off-by: Johannes Hostert <[email protected]>
JoJoDeveloping added a commit to JoJoDeveloping/coronet that referenced this issue Mar 23, 2021
…t.issues to make it similar to get.commits[.filtered]

Close se-sic#185.

Signed-off-by: Johannes Hostert <[email protected]>
@bockthom bockthom mentioned this issue Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants