Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Risk factor refactor #195

Merged
merged 7 commits into from
Aug 7, 2023
Merged

Risk factor refactor #195

merged 7 commits into from
Aug 7, 2023

Conversation

jamesturner246
Copy link
Contributor

Fixes #166
(not really)

This has been somewhat of a lacklustre attempt that fell short of my ambition... All that was really achieved is the removal of duplicate risk factor listings from France.EBHLM.json, and changing a couple of data manager methods to fail on missing data, rather than warn and continue. As such, it should be quite quick to review. The main issue still seems to be the presence of code for handling Hierarchical static and dynamic models in the wider codebase, which is resisting refactor efforts.

In a moment of catharsis, I've now convinced myself that the best solution going forward is to merge static and dynamic RF model into a single model, which encapsulates all this extra HLM-specific code within it.

See #194. I'd appreciate your input, as I'm itching to get started merging them, rather than doing more piecemeal refactors that aren't solving the underlying problem.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2023

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is autogenerated from a folder with 50+ files and the country data file, not sure we gain much by hand editing this, perhaps changing the script that generate this file would be more beneficial. Alternatively, you could remove this section from the config file and read from here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the respective healthgps-tools script should also be updated. I'm considering a Python rewrite of the scripts, to sidestep the Windows R binding dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @israel-vieira.

Though we do want the config files in this repo to be up to date and working with the current version of the code for testing purposes. Whether we edit it directly or just the script for generating it isn't so important I think, as long as it's correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made an issue for later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of diseases can be long, 30+. The current validation will list all the invalid diseases key, if any, before exiting. The exception alternative will break at the first invalid key, the use will need to keep on trying until all are correct, in the end, both methods are accomplishing the required validation, it is just a matter of style.

Mapping the config with the back-end in a case-sensitive world is never user friend, the identifier type tries to minimize the pain by making all keys lowercase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case-sensitivity: fair point. In which case, the task then becomes removing the parts of the codebase that are case-sensitive. I've spotted a few.

Diseases: agree, a style choice. My concern in these cases has always been that by warning and continuing for dodgy input, that it makes the problem slightly less obvious to diagnose when it fails later on in the program, than if it were to immediately fail on loading. Perhaps this can be replaced with a function which reads all diseases, detects missing ones with printed debug info, before throwing an exception for any missing inputs, as opposed to a case-by-case exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @israel-vieira on this - collecting all that is invalid and providing a full list to the user (within meaningful indication of what is wrong) results on a better user experience than having to try again and again until all issues are sorted. We have another project where this is the approach and is indeed much better.

All in all, having a strong validation and input pre-processing steps where things can fail early and inconsistencies fixed (casing, proxis, etc.) is a good thing that will result in simpler, easier to understand (and diagnose) code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't consider users input validation as an exceptional circumstance, the validation should focus on helping the users to fix the wrong or missing information provided, perhaps the system should show valid options before exiting gracefully.

In general, users don't read or understand stack traces with a message hidden inside a large block of text, this is kind of information is more relevant for developers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two issues here:

  1. Whether to "fast fail" when loading the config or provide a list of failures at the end
  2. Whether to implement this using exceptions or some other mechanism (e.g. std::optional)

My view is:

  1. Yes, if the validation fails, we should continue attempting to load the file and provide a list of errors at the end, rather than aborting immediately (for the most part)
  2. @jamesturner246's refactor of get_country and get_disease_info to use exceptions is sensible and makes things clearer

The problem with at least some of the config loading code atm is that if validation fails, it doesn't abort immediately, but it also doesn't abort after the validation step either, so while you may get a warning on the console, the program will blithely plough on using default values, which is not good (see #161).

I'm a big believer in using exceptions to propagate errors in C++; it can make code much cleaner and safer. I'll give an example of how you might do "lazy failure" when loading disease info using exceptions:

std::vector<core::DiseaseInfo> get_diseases_info(core::Datastore &data_api, Configuration &config) {
    const auto diseases = data_api.get_diseases();
    fmt::print("\nThere are {} diseases in storage, {} selected.\n", diseases.size(),
               config.diseases.size());

    std::vector<core::DiseaseInfo> result;
    std::set<std::string> missing_diseases;
    for (const auto &code : config.diseases) {
        try {
            result.emplace_back(data_api.get_disease_info(code));
        } catch (const std::invalid_argument &) {
            missing_diseases.emplace(code);
        }
    }

    if (!missing_diseases.empty()) {
        std::cerr << "ERROR: The following diseases were not found:\n";
        for (const auto &disease : missing_diseases) {
            std::cerr << "\t" << disease << "\n";
        }

        throw std::runtime_error{"Some invalid diseases were given"};
    }

    return result;
}

(There might be things we could tweak -- e.g. I haven't used the fmt library -- but this is just to give an idea.)

This code assumes that the caller will appropriately handle any thrown std::runtime_error, which it absolutely should do (we might want to use a bespoke error type rather than std::runtime_error though). So it means you a) tell the user what errors have occurred and b) propagate the errors up the call stack. Ultimately, the main config loading function should be wrapped in a try...catch and if an error occurs, it just tells the user and exits the program (the user will already have been informed of the specific issues with the config loading by this point).

Note that this code is also safer, because a std::runtime_error will also be thrown if the "disease" value is missing, meaning that the caller knows that something has gone wrong, as opposed to the current state of affairs where just a warning message is displayed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this ^. The user gets helpful human readable messages noting which keys were missing, and execution stops before there's a chance of accidental continuation on default values. I'm happy to merge, and happy to leave it with you if you already have the idea in mind.

Copy link
Contributor

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right approach on the whole. I've made some minor suggestions.

I agree with @dalonsoa and @israel-vieira that we do probably want to do "lazy failure" when loading the config and print errors as we go along, rather than aborting as soon as an error is found. I think what you've done is an improvement though, so I'd be inclined to merge this as is for now and worry about that later. As I was planning on tackling this myself (see #161), I'm happy for you to leave this task for me, if you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @israel-vieira.

Though we do want the config files in this repo to be up to date and working with the current version of the code for testing purposes. Whether we edit it directly or just the script for generating it isn't so important I think, as long as it's correct.

src/HealthGPS.Console/configuration.cpp Outdated Show resolved Hide resolved
src/HealthGPS.Datastore/datamanager.cpp Outdated Show resolved Hide resolved
src/HealthGPS.Datastore/datamanager.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two issues here:

  1. Whether to "fast fail" when loading the config or provide a list of failures at the end
  2. Whether to implement this using exceptions or some other mechanism (e.g. std::optional)

My view is:

  1. Yes, if the validation fails, we should continue attempting to load the file and provide a list of errors at the end, rather than aborting immediately (for the most part)
  2. @jamesturner246's refactor of get_country and get_disease_info to use exceptions is sensible and makes things clearer

The problem with at least some of the config loading code atm is that if validation fails, it doesn't abort immediately, but it also doesn't abort after the validation step either, so while you may get a warning on the console, the program will blithely plough on using default values, which is not good (see #161).

I'm a big believer in using exceptions to propagate errors in C++; it can make code much cleaner and safer. I'll give an example of how you might do "lazy failure" when loading disease info using exceptions:

std::vector<core::DiseaseInfo> get_diseases_info(core::Datastore &data_api, Configuration &config) {
    const auto diseases = data_api.get_diseases();
    fmt::print("\nThere are {} diseases in storage, {} selected.\n", diseases.size(),
               config.diseases.size());

    std::vector<core::DiseaseInfo> result;
    std::set<std::string> missing_diseases;
    for (const auto &code : config.diseases) {
        try {
            result.emplace_back(data_api.get_disease_info(code));
        } catch (const std::invalid_argument &) {
            missing_diseases.emplace(code);
        }
    }

    if (!missing_diseases.empty()) {
        std::cerr << "ERROR: The following diseases were not found:\n";
        for (const auto &disease : missing_diseases) {
            std::cerr << "\t" << disease << "\n";
        }

        throw std::runtime_error{"Some invalid diseases were given"};
    }

    return result;
}

(There might be things we could tweak -- e.g. I haven't used the fmt library -- but this is just to give an idea.)

This code assumes that the caller will appropriately handle any thrown std::runtime_error, which it absolutely should do (we might want to use a bespoke error type rather than std::runtime_error though). So it means you a) tell the user what errors have occurred and b) propagate the errors up the call stack. Ultimately, the main config loading function should be wrapped in a try...catch and if an error occurs, it just tells the user and exits the program (the user will already have been informed of the specific issues with the config loading by this point).

Note that this code is also safer, because a std::runtime_error will also be thrown if the "disease" value is missing, meaning that the caller knows that something has gone wrong, as opposed to the current state of affairs where just a warning message is displayed.

src/HealthGPS.Tests/Datastore.Test.cpp Show resolved Hide resolved
src/HealthGPS.Tests/Datastore.Test.cpp Outdated Show resolved Hide resolved
src/HealthGPS.Tests/Datastore.Test.cpp Outdated Show resolved Hide resolved
src/HealthGPS.Tests/Datastore.Test.cpp Outdated Show resolved Hide resolved
src/HealthGPS.Tests/Datastore.Test.cpp Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, nothing to add that @alexdewar has not explained much, much better - and with examples.

Copy link
Contributor

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jamesturner246 jamesturner246 merged commit 19b4519 into main Aug 7, 2023
@jamesturner246 jamesturner246 deleted the rf_refactor_166 branch August 7, 2023 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decouple risk factor code from DataTable backend
4 participants