Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitise input data #200

Merged
merged 48 commits into from
Aug 21, 2023
Merged

Sanitise input data #200

merged 48 commits into from
Aug 21, 2023

Conversation

alexdewar
Copy link
Contributor

@alexdewar alexdewar commented Aug 9, 2023

So I think this is finally ready for review. I'm afraid it's rather on the long side if you go by the diffstat, however a lot of the change is just moving things from one file to another and much of the rest is tests so I hope you'll forgive me 😉.

The goal of this PR was to refactor the config parsing code to make loading JSON files safer. Currently, exceptions are only used intermittently and in many places, if the JSON file is malformed then a warning will be emitted, but the simulation will still run, just using default values. I've tried to make these config parsing errors as informative as possible, by printing error messages whenever malformed values are found and only throwing an exception at the last possible point. I've tried not to modify the file format in any way and the existing example data files load (seemingly) correctly.

I've also added tests for all of the config parsing code as well as (I think) any other bits I've changed, so hopefully we won't break any of this going forward.

I've rearranged files somewhat too. In particular, configuration.cpp ended up rather long by the time I'd finished with it, so I moved the code for parsing command line options into its own files as well as the lower-level config parsing code.

Finally, while I was working on this I introduced smatterings of type changes all over to make the types more descriptive and safer, e.g. converting std::strings to std::filesystem::paths where appropriate.

Fixes #161.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

clang-tidy review says "All clean, LGTM! 👍"

@alexdewar alexdewar force-pushed the sanitise_input_data branch from 6aa55d0 to 0e35a28 Compare August 9, 2023 12:20
@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

clang-tidy review says "All clean, LGTM! 👍"

@alexdewar alexdewar force-pushed the sanitise_input_data branch from 0e35a28 to 57e0a72 Compare August 9, 2023 12:44
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@alexdewar alexdewar force-pushed the sanitise_input_data branch from 57e0a72 to dc1a050 Compare August 9, 2023 13:01
@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

clang-tidy review says "All clean, LGTM! 👍"

@jamesturner246 jamesturner246 mentioned this pull request Aug 9, 2023
@alexdewar alexdewar force-pushed the sanitise_input_data branch from dc1a050 to 7931f38 Compare August 9, 2023 14:37
@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/HealthGPS.Console/configuration_parsing.cpp Outdated Show resolved Hide resolved
src/HealthGPS.Console/configuration_parsing.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/HealthGPS.Tests/Configuration.Test.cpp Outdated Show resolved Hide resolved
src/HealthGPS.Tests/Configuration.Test.cpp Outdated Show resolved Hide resolved
@alexdewar alexdewar force-pushed the sanitise_input_data branch 2 times, most recently from 6e1313b to d857633 Compare August 11, 2023 10:09
@alexdewar alexdewar marked this pull request as ready for review August 15, 2023 15:58
Copy link
Contributor

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just commented on a couple of very high level points. Otherwise, I think it looks good and, as far as I can tell, the split of functionality in new files makes sense.

I don't know if it is common in C++, but it might be useful to add a documentation header in each file explaining what the contents of that file are about, so developers - old and new - can learn what to expect in each case when reading the code. We should do the same in our python projects, by the way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same way you have done with the configuration files, I miss some more thorough docstrings in this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope poco is a clever acronym of something meaningful in Health-GPS. In Spanish, it means little, so it always looks odd to me when reading the code :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently it means "plain old class object": https://imperialchepi.github.io/healthgps/architecture

I'm not sure that this naming is terribly informative though. And I don't know why these structs live in host::poco rather than hgps::poco either 🤷. Maybe we should rename it at some point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, some docstrings - at least for the functions you have touched - will be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there's no header file for this one? Not that matters, but I'm curious about why sometimes there is and sometimes there isn't, being rather novice in C++.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically because there are no other .cpp that need any of the functions in program.cpp, so we don't need to bother. It seems like most functions in HealthGPS are defined in headers, but a lot of these definitions could be dropped if we wanted because the functions aren't actually being used outside of a single .cpp file.

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@jamesturner246 jamesturner246 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bunch safer, with some nice reuse of existing core code. 💯

There's quite a lot going on here, though. I wasn't able to go through with as much rigour as I'd like, so I've probably missed some things. I think breaking up these changes and planning/synchronising should be in our minds going forward in these kinds of big PRs.

I've picked some things that jumped out at me, and perhaps I'll have a few more once I digest it tomorrow.

Comment on lines 118 to 137
template <typename T>
void to_json(nlohmann::json &j, const std::optional<hgps::core::Interval<T>> &p) {
if (p) {
j = *p;
} else {
// Null interval expressed as empty JSON array
j = nlohmann::json::array();
}
}

template <typename T>
void from_json(const nlohmann::json &j, std::optional<hgps::core::Interval<T>> &p) {
// Treat null JSON values and empty arrays as a null Interval
if (j.is_null() || (j.is_array() && j.empty())) {
p = std::nullopt;
} else {
p = j.get<hgps::core::Interval<T>>();
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't needed, since they will be handled by the Interval<T> to/from functions indirectly though the optional<T> to/from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but currently the code uses empty arrays to indicate a null interval rather than a JSON null value, which is really what it should be, so I added these helpers to avoid introducing a breaking change. I meant to add a comment saying as much, but forgot.

That said, I don't like this and I think the right solution is to do away with this special-casing and just use null values instead, though that would mean changing config files too. Shall I open an issue for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that the null values are preferred. I think I already started using null values elsewhere for things. /Go ahead.

@@ -5,16 +5,15 @@ find_package(TBB CONFIG REQUIRED)
find_path(RAPIDCSV_INCLUDE_DIRS "rapidcsv.h")
find_package(Threads REQUIRED)

add_executable(HealthGPS.Console "")
target_compile_features(HealthGPS.Console PUBLIC cxx_std_${CMAKE_CXX_STANDARD})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why no target_compile_features(HealthGPS.Console PUBLIC cxx_std_${CMAKE_CXX_STANDARD})?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's redundant because targets are already built using ${CMAKE_CXX_STANDARD} anyway.

std::map<std::string, std::string> columns;

auto operator<=>(const FileInfo &rhs) const = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the spaceship operators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new C++20 thing. If you add a default spaceship operator, then it automatically generates other comparison functions too (i.e. operator==, operator< etc.): https://en.cppreference.com/w/cpp/language/default_comparisons

Oddly you don't get a default operator== otherwise, which seems a little odd to me.

Comment on lines +7 to +11
// Data file information
void to_json(json &j, const FileInfo &p) {
j = json{
{"name", p.name}, {"format", p.format}, {"delimiter", p.delimiter}, {"columns", p.columns}};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently used by one of the unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, mu bad.

Comment on lines +94 to +96
// SES Model Information
void to_json(json &j, const SESInfo &p) {
j = json{{"function_name", p.function}, {"function_parameters", p.parameters}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... in fact, I don't think any of the POCO-type to_json functions are actually needed anywhere. The only JSON writing is internal hgps event messages, and done in result_file_writer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same and went through deleting a bunch of them, but then when I got round
to writing unit tests I found they were useful for that.

src/HealthGPS/mapping.h Show resolved Hide resolved
}

// NOLINTNEXTLINE(bugprone-exception-escape)
void rebase_valid_path_to(const json &j, const std::string &key, std::filesystem::path &out,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the return-by-reference also needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, really. It's just that out might not be updated. Maybe in that case,
returning an empty std::filesystem::path wouldn't be terrible though.

Alternatively, this function could just return a bool indicating whether it's
succeeded, which is what I've done elsewhere (e.g. for the get_to function). That
would at least be consistent.

What do you think?

Copy link
Contributor

@jamesturner246 jamesturner246 Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant that there's two rebase_valid_path_to, one on ln 34 which returns success as the return value, and another which returns success via a bool reference on ln 51. Are both necessary?

Yeah, for the sake of consistency, the boolean flag method is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overload returning a bool isn't actually used anywhere, but I figured we might as
well keep it because we might do in future and it kind of makes sense to implement the
other overload using it anyway. That said, it's up to you. I could get rid of it and it
would save us a few lines of code.

Copy link
Contributor

@jamesturner246 jamesturner246 Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this header is more for risk factor model JSON parsing stuff, right? Was the plan to have a separate header for the parsing interfaces of each model type? Maybe renaming this something like configuration_parsing_risk_factor.h would clarify this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. The reason I put these bits into their own header was really just because it felt like configuration_parsing.h was getting a bit cluttered and these functions only needed to be exported so we could access them for the tests.

@alexdewar
Copy link
Contributor Author

Thanks for the review!

There's quite a lot going on here, though. I wasn't able to go through with as much rigour as I'd like, so I've probably missed some things. I think breaking up these changes and planning/synchronising should be in our minds going forward in these kinds of big PRs.

Yeah, sorry about that 😞. If I'd split this up into smaller PRs it also would have made it less likely we'd end up with merge conflicts too. For the next big refactor I'll plan ahead a bit more rather than just winging it.

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

We don't need the extra option to be able to use an empty array. Let's make users express null Intervals with a JSON null value.
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@jamesturner246 jamesturner246 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead 👍

@alexdewar alexdewar enabled auto-merge August 21, 2023 09:48
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@alexdewar alexdewar merged commit 44c8aea into main Aug 21, 2023
@alexdewar alexdewar deleted the sanitise_input_data branch August 21, 2023 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sanitisation of input data
3 participants