Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong bins attribute after sliding-window-based network splitting & inconsistent behavior of construct.ranges #256

Closed
bockthom opened this issue Mar 23, 2024 · 6 comments

Comments

@bockthom
Copy link
Collaborator

Unfortunately, we have to postpone the planned release of coronet until we have fixed the following bug(s) related to sliding-window splitting, which I spotted during pre-release testing of analysis scripts with real empirical data:

Description

(1) The network attribute "bins" is incorrect after splitting.networks.time.based using sliding.window = TRUE.

(2) construct.ranges from given bins with sliding.window = TRUE is not in line with the recent changes of sliding-window splitting. I am still not sure whether this is an actual bug or whether we need additional workarounds for the specific case in which I noticed the reported problem, or whether (1) and (2) are just mutually dependent.

Steps to Reproduce or Minimal Working Example (MWE)

Consider the following sequence of statements:

(1) Split networks time based:
Assume that global.net.list is a list that contains exactly one author network that covers the whole time period.

global.net.list = list(network.builder$get.author.network())
split.networks <- split.networks.time.based(global.net.list, time.period = "6months", sliding.window = TRUE)
split.network.bins <- attr(split.networks[[1]], "bins")

When having data from 2018-10-05 to 2020-02-24, the we will get 5 ranges. However, the "bins" attribute that is attached to the resulting list of networks (see variable split.networks.bins in the above example) is incorrect:

[1] "2018-10-05 16:58:32 UTC" "2019-01-05 00:28:32 UTC"
[3] "2019-04-06 07:58:32 UTC" "2019-07-06 15:28:32 UTC"
[5] "2019-10-05 22:58:32 UTC" "2020-02-24 09:02:51 UTC"

The fourth range starts at 2019-07-06 and ends at 2020-01-05. However, 2020-01-05 is not part of the returned bins.
The fifth ranges starts at 2019-10-05 and ends at 2020-02-24. While the networks are built correctly, the bins attribute misses the end of the second last range in this case (as the range from 2020-01-05 to 2020-02-24 is dropped since it is fully contained in 2019-10-05 to 2020-02-24.)


(2) The above behavior leads to another problem: Range names cannot be constructed anymore from the bins attribute, so construct.ranges produces weird results:

ranges <- construct.ranges(split.network.bins, sliding.window = TRUE, raw = FALSE)
names(split.networks[[1]]) <- construct.ranges(split.network.bins, sliding.window = TRUE, raw = TRUE)

Then, ranges is:
[1] "2018-10-05 16:58:32-2019-04-06 07:58:32"
[2] "2019-01-05 00:28:32-2019-07-06 15:28:32"
[3] "2019-04-06 07:58:32-2019-10-05 22:58:32"
[4] "2019-07-06 15:28:32-2020-02-24 09:02:51"

And names(split.networks[[1]]) is:
[1] "2018-10-05 16:58:32-2019-04-06 07:58:32"
[2] "2019-01-05 00:28:32-2019-07-06 15:28:32"
[3] "2019-04-06 07:58:32-2019-10-05 22:58:32"
[4] "2019-07-06 15:28:32-2020-02-24 09:02:51"
[5] NA

What we can clearly see is: Although we have 5 range networks, we can construct only the names of 4 ranges.
On the one hand, this comes from the bug above: Due to 2020-01-05 being missing in the bins attribute, the fourth range is too long and, thus, no fifth range is computed.

On the other hand, even if 2020-01-05 would be present, I assume that construct.ranges would return 6 ranges then (but I have not tested this assumption yet), because then it will also construct the range 2020-01-05-2020-02-24 which is fully contained in the previous range. Thus, range construction in construct.ranges leads to different results than the ranges that result from network splitting. I think, both should somehow be consistent.

For this second problem, I have two ideas:

  • If there is no reason why construct.ranges should result in different results than splitting, we could just fix construct.ranges accordingly.
  • But what if the behavior of construct.ranges is intended? What should happen if I construct these ranges by hand and I want to keep the 6th range in the example that is fully contained in the data. In such a case, one would need to differentiate between "keeping the fully contained range" and "removing the fully contained range" within construct.ranges. If we want to support both options, construct.ranges would need an additional boolean flag (e.g., remove.fully.contained.sliding.range or similar). If so, what would be the default value then? Should the default value support the splitting functionality? Or should the default value support user demands since construct.ranges does not belong to the splitting module. To be honest, I don't know yet which of both arguments does sound more convincing to me...

Any thoughts on this?

Versions

This bug affects the current unversioned dev branch, which is scheduled to become version 4.4 soon. Thus, this bug needs to be fixed before we can release version 4.4.

Additional Information

Regarding bug (1): I just used split.networks.time.based, but potentially the incomplete bins attribute could also be produced via similar splitting functions. Thus, we need to check all network splitting and data splitting functions.

Regarding bug (2): We should also check all variants of construct.ranges and related functions.

@bockthom
Copy link
Collaborator Author

@MaLoefUDS This issue should have highest priority now. Please address this issue before working on the other issues that we have assigned to you. Thanks!

@maxloeffler
Copy link
Contributor

Oh that is something that slipped through PR #249. From my investigations the problem originates here, where we reconstruct the bins attribute from ranges, to minimize the locations of code where we pass bins as an attribute down the splitting functions: (split.network.time.based.by.ranges)

coronet/util-split.R

Lines 817 to 822 in 1d3d1a3

## convert ranges to bins
bins.starts = sapply(ranges.bounds, function(range) range[1])
bins.end = ranges.bounds[[length(ranges.bounds)]][2]
bins.date = get.date.from.unix.timestamp(c(bins.starts, bins.end))
attr(nets.split, "bins") = bins.date

The applied algorithm fails, since it does not account for the fact that the last sliding range does have a different ending time than the last regular range. The end time of the last sliding range is also not taken as the start of any range, since no sliding range follows the last sliding range obviously. Therefore, we lose the end date here.

I have not found a fix for the algorithm yet, however, I will think about it. Thank you for noticing this issue of mine!

@maxloeffler
Copy link
Contributor

I have just tried this simple "fix":

    ## convert ranges to bins
    bins.starts = sapply(ranges.bounds, function(range) range[1])
    bins.ends = sapply(ranges.bounds, function(range) range[2])
    bins.date = get.date.from.unix.timestamp(unique(c(bins.starts, bins.ends)))

It only breaks 1 test and manual debugging that test showed that it should probably break, i.e., the test follows the same scenario that you described in the MWE. I think, in conclusion, this problem really is showing a lack of clear definition of what the bins attribute really should semantically represent.

For now, I did not have a look at problem 2) you described, in the hope that it is directly correlated to 1). However, I will have a look probably tomorrow morning!

@maxloeffler
Copy link
Contributor

Okay now I have also had a look at problem 2) and I can calm down many of your worries, I believe.

Yes the problem in construct.ranges does stem from the problem in split.networks.time.based.by.ranges. Applying my fix resolves the problem and we obtain the following ranges (which are in line with what we assume to be correct):

[1] "2018-10-05 16:58:32-2019-04-06 07:58:32"
[2] "2019-01-05 00:28:32-2019-07-06 15:28:32"
[3] "2019-04-06 07:58:32-2019-10-05 22:58:32"
[4] "2019-07-06 15:28:32-2020-01-05 06:28:32"
[5] "2019-10-05 22:58:32-2020-02-24 09:02:51"

We also do not cause any further complications for the remainder of your MWE code. Everything just looks to be doing good! After this conclusion, I will prepare a commit including a fix for split.networks.time.based.by.ranges as described above which will include a fix to the one test that indeed does depict a similar scenario to your MWE, which we must have overlooked in the past.

Cheers!

@bockthom
Copy link
Collaborator Author

bockthom commented Mar 29, 2024

I have just tried this simple "fix":

    ## convert ranges to bins
    bins.starts = sapply(ranges.bounds, function(range) range[1])
    bins.ends = sapply(ranges.bounds, function(range) range[2])
    bins.date = get.date.from.unix.timestamp(unique(c(bins.starts, bins.ends)))

Great! I've tried your fix in the script which breaks without this fix. Fortunately, using your fix, the script does not end up in an error any more. So, your fix looks good to me and seems to work in the tested scenario with real-world data.

It only breaks 1 test and manual debugging that test showed that it should probably break, i.e., the test follows the same scenario that you described in the MWE. I think, in conclusion, this problem really is showing a lack of clear definition of what the bins attribute really should semantically represent.

True. I never did communicate clearly enough what the bins attribute should semantically represent. And I also did not explain what the actual use cases for the bins are. In the particular case, we use the bins attribute to re-construct ranges and range names. And what's also true is that I have not spotted the error in the particular test case you are talking about.
So, indeed, this is something that slipped through PR #249.

@bockthom
Copy link
Collaborator Author

Okay now I have also had a look at problem 2) and I can calm down many of your worries, I believe.

Yes the problem in construct.ranges does stem from the problem in split.networks.time.based.by.ranges. Applying my fix resolves the problem and we obtain the following ranges (which are in line with what we assume to be correct):

[1] "2018-10-05 16:58:32-2019-04-06 07:58:32" [2] "2019-01-05 00:28:32-2019-07-06 15:28:32" [3] "2019-04-06 07:58:32-2019-10-05 22:58:32" [4] "2019-07-06 15:28:32-2020-01-05 06:28:32" [5] "2019-10-05 22:58:32-2020-02-24 09:02:51"

We also do not cause any further complications for the remainder of your MWE code. Everything just looks to be doing good!

Great, thanks for looking at this problem!

After this conclusion, I will prepare a commit including a fix for split.networks.time.based.by.ranges as described above which will include a fix to the one test that indeed does depict a similar scenario to your MWE, which we must have overlooked in the past.

Sounds good. Please add the commit hash of the fix and the issue id of this issue (#256) to the corresponding and already existing "Added" item in the NEWS.md. As this is something that is still not released, the commit still belongs to the "Added" category. Thanks!

maxloeffler added a commit to maxloeffler/coronet that referenced this issue Mar 30, 2024
Occasionally, the 'bins' attribute did not include the end-date
of the second-last range, as described in Issue se-sic#256.

In addition, fix one faulty test.
maxloeffler added a commit to maxloeffler/coronet that referenced this issue Mar 30, 2024
Occasionally, the 'bins' attribute did not include the end-date
of the second-last range, as described in Issue se-sic#256.

In addition, fix one faulty test.

Signed-off-by: Maximilian Löffler <[email protected]>
maxloeffler added a commit to maxloeffler/coronet that referenced this issue Mar 30, 2024
Occasionally, the 'bins' attribute did not include the end-date
of the second-last range, as described in Issue se-sic#256.

In addition, fix one faulty test.

Signed-off-by: Maximilian Löffler <[email protected]>
@bockthom bockthom mentioned this issue Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants