-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Technology share values persist being off due to aggregation by name_ald #265
Comments
Hey @ab-bbva I have made a reprex below using the development version of the code (what is currently on github), and the results seem consistent with what you show above (Note: below I have "given" two loans to Company A1, and Company A2 and a third to Company B, I think this reflects the issue you see). Please let me know if the below is what you expect (In particular, note that the resulting library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(r2dii.data)
library(r2dii.analysis)
matched <- tibble::tribble(
~id_loan, ~loan_size_outstanding, ~loan_size_outstanding_currency, ~loan_size_credit_limit, ~loan_size_credit_limit_currency, ~id_2dii, ~level, ~score, ~sector, ~name_ald, ~sector_ald, ~name_direct_loantaker,
"L1", 1, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A1",
"L2", 1, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A2",
"L3", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company b", "automotive", "Company B"
)
ald <- tibble::tribble(
~name_company, ~sector, ~technology, ~year, ~production, ~emission_factor, ~plant_location, ~is_ultimate_owner,
"company a", "automotive", "ice", 2020, 8, 1, "BF", TRUE,
"company a", "automotive", "electric", 2020, 2, 1, "BF", TRUE,
"company b", "automotive", "ice", 2020, 15, 1, "BF", TRUE,
"company b", "automotive", "electric", 2020, 5, 1, "BF", TRUE
)
scenario <- tibble::tribble(
~scenario, ~sector, ~technology, ~region, ~year, ~tmsr, ~smsp, ~scenario_source,
"sds", "automotive", "ice", "global", 2020, 0.5, -0.08, "demo_2020",
"sds", "automotive", "electric", "global", 2020, 0.5, -0.08, "demo_2020"
)
out <- target_market_share(
matched,
ald,
scenario,
region_isos_demo,
by_company = TRUE,
weight_production = FALSE
)
out %>%
arrange(name_ald) %>%
filter(metric == "projected") %>%
select(name_ald, metric, technology, technology_share)
#> # A tibble: 4 x 4
#> name_ald metric technology technology_share
#> <chr> <chr> <chr> <dbl>
#> 1 company a projected electric 0.2
#> 2 company a projected ice 0.8
#> 3 company b projected electric 0.25
#> 4 company b projected ice 0.75 Created on 2021-01-18 by the reprex package (v0.3.0) |
Indeed! Ok, I see the point. Working on it now :-) |
For the record, and in case others are interested, the following reprex highlights the bug: library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(r2dii.data)
library(r2dii.analysis)
matched_split_loans <- tibble::tribble(
~id_loan, ~loan_size_outstanding, ~loan_size_outstanding_currency, ~loan_size_credit_limit, ~loan_size_credit_limit_currency, ~id_2dii, ~level, ~score, ~sector, ~name_ald, ~sector_ald, ~name_direct_loantaker,
"L1", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A1",
"L2", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A2",
"L3", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company b", "automotive", "Company B",
"L4", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A3",
"L5", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A4"
)
matched_agg_loans <- tibble::tribble(
~id_loan, ~loan_size_outstanding, ~loan_size_outstanding_currency, ~loan_size_credit_limit, ~loan_size_credit_limit_currency, ~id_2dii, ~level, ~score, ~sector, ~name_ald, ~sector_ald, ~name_direct_loantaker,
"L1", 8, "EUR", 8, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A",
"L2", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company b", "automotive", "Company B"
)
ald <- tibble::tribble(
~name_company, ~sector, ~technology, ~year, ~production, ~emission_factor, ~plant_location, ~is_ultimate_owner,
"company a", "automotive", "ice", 2020, 8, 1, "BF", TRUE,
"company a", "automotive", "electric", 2020, 2, 1, "BF", TRUE,
"company b", "automotive", "ice", 2020, 15, 1, "BF", TRUE,
"company b", "automotive", "electric", 2020, 5, 1, "BF", TRUE
)
scenario <- tibble::tribble(
~scenario, ~sector, ~technology, ~region, ~year, ~tmsr, ~smsp, ~scenario_source,
"sds", "automotive", "ice", "global", 2020, 0.5, -0.08, "demo_2020",
"sds", "automotive", "electric", "global", 2020, 0.5, -0.08, "demo_2020"
)
out_split <- target_market_share(
matched_split_loans,
ald,
scenario,
region_isos_demo,
by_company = FALSE,
weight_production = TRUE
)
out_split %>%
filter(
metric == "projected",
year == 2020,
technology == "ice"
) %>%
select(metric, technology, technology_share)
#> # A tibble: 1 x 3
#> metric technology technology_share
#> <chr> <chr> <dbl>
#> 1 projected ice 0.775
out_agg <- target_market_share(
matched_agg_loans,
ald,
scenario,
region_isos_demo,
by_company = FALSE,
weight_production = TRUE
)
out_agg %>%
filter(
metric == "projected",
year == 2020,
technology == "ice"
) %>%
select(metric, technology, technology_share)
#> # A tibble: 1 x 3
#> metric technology technology_share
#> <chr> <chr> <dbl>
#> 1 projected ice 0.790 Created on 2021-01-19 by the reprex package (v0.3.0) |
@jdhoffa, |
@jdhoffa @maurolepore Hi. Just to let you know, this issue is still active on the latest release. Values keep being off, and it is caused by the same problem. As long as aggregations in |
Here I show a reprex, with the latest release of the package, The fix was achieved in the body of library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(r2dii.data)
library(r2dii.analysis)
packageVersion("r2dii.analysis")
#> [1] '0.1.5'
matched_split_loans <- tibble::tribble(
~id_loan, ~loan_size_outstanding, ~loan_size_outstanding_currency, ~loan_size_credit_limit, ~loan_size_credit_limit_currency, ~id_2dii, ~level, ~score, ~sector, ~name_ald, ~sector_ald, ~name_direct_loantaker,
"L1", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A1",
"L2", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A2",
"L3", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company b", "automotive", "Company B",
"L4", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A3",
"L5", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A4"
)
matched_agg_loans <- tibble::tribble(
~id_loan, ~loan_size_outstanding, ~loan_size_outstanding_currency, ~loan_size_credit_limit, ~loan_size_credit_limit_currency, ~id_2dii, ~level, ~score, ~sector, ~name_ald, ~sector_ald, ~name_direct_loantaker,
"L1", 8, "EUR", 8, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A",
"L2", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company b", "automotive", "Company B"
)
ald <- tibble::tribble(
~name_company, ~sector, ~technology, ~year, ~production, ~emission_factor, ~plant_location, ~is_ultimate_owner,
"company a", "automotive", "ice", 2020, 8, 1, "BF", TRUE,
"company a", "automotive", "electric", 2020, 2, 1, "BF", TRUE,
"company b", "automotive", "ice", 2020, 15, 1, "BF", TRUE,
"company b", "automotive", "electric", 2020, 5, 1, "BF", TRUE
)
scenario <- tibble::tribble(
~scenario, ~sector, ~technology, ~region, ~year, ~tmsr, ~smsp, ~scenario_source,
"sds", "automotive", "ice", "global", 2020, 0.5, -0.08, "demo_2020",
"sds", "automotive", "electric", "global", 2020, 0.5, -0.08, "demo_2020"
)
out_split <- target_market_share(
matched_split_loans,
ald,
scenario,
region_isos_demo,
by_company = FALSE,
weight_production = TRUE
)
out_split %>%
filter(
metric == "projected",
year == 2020,
technology == "ice"
) %>%
select(metric, technology, technology_share)
#> # A tibble: 1 x 3
#> metric technology technology_share
#> <chr> <chr> <dbl>
#> 1 projected ice 0.790
out_agg <- target_market_share(
matched_agg_loans,
ald,
scenario,
region_isos_demo,
by_company = FALSE,
weight_production = TRUE
)
out_agg %>%
filter(
metric == "projected",
year == 2020,
technology == "ice"
) %>%
select(metric, technology, technology_share)
#> # A tibble: 1 x 3
#> metric technology technology_share
#> <chr> <chr> <dbl>
#> 1 projected ice 0.790 Created on 2021-02-19 by the reprex package (v1.0.0) |
Since And can you confirm that you see similar behaviour locally when you run the above reprex, using the function |
Function On the other hand, I can confirm that both reprex examples on this thread output now the expected values. The problem persists when using the real loanbook, as it is more complex, I guess. Let me see if I can provide you with a more complex example that recreates the loanbook, so that you can also identify the issue. |
So I believe I figured it out, the issue is coming from library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(r2dii.data)
library(r2dii.analysis)
packageVersion("r2dii.analysis")
#> [1] '0.1.5.9000'
matched_diff_levels <- tibble::tribble(
~id_loan, ~loan_size_outstanding, ~loan_size_outstanding_currency, ~loan_size_credit_limit, ~loan_size_credit_limit_currency, ~id_2dii, ~level, ~score, ~sector, ~name_ald, ~sector_ald, ~name_direct_loantaker,
"L1", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A1",
"L2", 2, "EUR", 2, "EUR", "UP1", "direct_loantaker", 1, "automotive", "company a", "automotive", "Company A2",
"L3", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company b", "automotive", "Company B"
)
matched_same_levels <- tibble::tribble(
~id_loan, ~loan_size_outstanding, ~loan_size_outstanding_currency, ~loan_size_credit_limit, ~loan_size_credit_limit_currency, ~id_2dii, ~level, ~score, ~sector, ~name_ald, ~sector_ald, ~name_direct_loantaker,
"L1", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A1",
"L2", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company a", "automotive", "Company A2",
"L3", 2, "EUR", 2, "EUR", "UP1", "ultimate_parent", 1, "automotive", "company b", "automotive", "Company B"
)
ald <- tibble::tribble(
~name_company, ~sector, ~technology, ~year, ~production, ~emission_factor, ~plant_location, ~is_ultimate_owner,
"company a", "automotive", "ice", 2020, 8, 1, "BF", TRUE,
"company a", "automotive", "electric", 2020, 2, 1, "BF", TRUE,
"company b", "automotive", "ice", 2020, 15, 1, "BF", TRUE,
"company b", "automotive", "electric", 2020, 5, 1, "BF", TRUE
)
scenario <- tibble::tribble(
~scenario, ~sector, ~technology, ~region, ~year, ~tmsr, ~smsp, ~scenario_source,
"sds", "automotive", "ice", "global", 2020, 0.5, -0.08, "demo_2020",
"sds", "automotive", "electric", "global", 2020, 0.5, -0.08, "demo_2020"
)
out_diff <- target_market_share(
matched_diff_levels,
ald,
scenario,
region_isos_demo,
by_company = FALSE,
weight_production = TRUE
)
out_diff %>%
filter(
metric == "projected",
year == 2020,
technology == "ice"
) %>%
select(metric, technology, technology_share)
#> # A tibble: 1 x 3
#> metric technology technology_share
#> <chr> <chr> <dbl>
#> 1 projected ice 0.775
out_same <- target_market_share(
matched_same_levels,
ald,
scenario,
region_isos_demo,
by_company = FALSE,
weight_production = TRUE
)
out_same %>%
filter(
metric == "projected",
year == 2020,
technology == "ice"
) %>%
select(metric, technology, technology_share)
#> # A tibble: 1 x 3
#> metric technology technology_share
#> <chr> <chr> <dbl>
#> 1 projected ice 0.783 Created on 2021-02-22 by the reprex package (v1.0.0) |
Hi, @jdhoffa. I tried the modification that you proposed, which actually makes sense, but without changing the grouping in the calculation of |
Hey, so I appreciate why you think this (and beyond that, I genuinely appreciate the amount of time/ thought you are putting into this), but I also just want to explain why I disagree. data <- aggregate_by_loan_id(data)
data <- join_ald_scenario(data, ald, scenario, region_isos)
... (some more code)
if (weight_production) {
data <- summarize_weighted_production(
data,
!!!rlang::syms(summary_groups),
use_credit_limit = use_credit_limit
)
} else {
data <- summarize_unweighted_production(
data,
!!!rlang::syms(summary_groups)
)
}
Rather than add I hope this makes sense! |
I know it must be frustrating to keep repeating exactly what you think should happen, only to have me not do that, so I just wanted to give the context of why I don't believe this is the appropriate solution! |
Okay, thank you so much for putting it like this. I just didn't understand why you didn't want to modify that function. Also, through your explanation I have correctly modified
Great! |
Fantastic! The fix is already in this open PR #282 so it should soon be reflected on CRAN :) |
Function
add_technology_share()
from module summarize_weighted_production.R groups data by name_ald and this generates the technology mix or technology share for each company -- it will later be multiplied by the loan weights and summed over to give the final (weighted) technology share.add_technology_share <- function(data) { data %>% group_by(.data$sector_ald, .data$year, .data$scenario, .data$name_ald) %>% mutate(technology_share = .data$production / sum(.data$production)) %>% group_by(!!!dplyr::groups(data)) }
The issue arises when several name_direct_loantakers are matched to one unique name_ald (this happens when matches are performed at the ultimate_parent level). Function
add_technology_share()
will aggregate all repeated instances of name_ald (there will be as many as the number of direct loantakers assigned to the ALD company) and sum production, yielding a bigger total production and thus a lower technology share than the real one.Substituting name_ald by name_direct_loantaker in this function solves the issue. I'm attaching a simplified example for visualization purposes.
The text was updated successfully, but these errors were encountered: