- Minor text-updates in README, explain texts and javadoc. No new functionality or changes.
- Refactor parts of metrics, by removing the conformal metrics that only used a single confidence level - instead all confidence dependent metrics are now of PlotMetric type.
- Refactor
GridSearch
so that it works for bothSingleValueMetric
andPlotMetric
types (i.e. all metrics from CPSign). - Fix various minor bugs for secondary metrics based on the above two updates to make e.g. the
GridResultCSVWriter
output the correct values for secondary metrics. - Replace some metrics and update names, e.g. new
ModelCalibration
(for conformal models) replacingCPRegressionCalibrationPlotBuilder
andCPClassificationCalibrationPlotBuilder
. Remove the "PlotBuilder" suffix for all plot metrics, to have class names similar or identical to thegetName()
of the given class.
- Massive improvement in runtime for duplicate resolving transformations - e.g. new cpLogD with 2.2M compounds took 19.5 hours with old algorithm and now less than 1 second using the same machine.
- Update slf4j and logback deps due to dependabot alert for new vulnerability.
- Improve output of CLI
model-info
after multiple save/load of a model, as the "data" nesting previously would have 0 as number of observations (no training examples saved in serialized models) - now overwriting with the info from the predictor. - General fix of typos and improvements in output of e.g. toString methods.
- Enforce consistent number of columns in CSV files, will cause failures unless every line has the same number of columns as the header row. This may be a breaking change for some files but will likely produce more robust results.
- Changed the default delimiter of
CSV
to ",
", from previously being tab (i.e."\t
"). For CLI users there is now an aliasTSV
for CSV input - which has the default delimiter set to tab.
- Update CDK version 2.8 to 2.9. Reduces the 'fatjar' size as well as removes some vulnerabilities flagged by github build system.
- Resolved bug when using the
--echo
flag from the CLI - which apparently changed the internal state of the picocli parser and caused issues unless it was placed as the last parameter. - Added API methods for getting
FeatureInfo
- i.e. mappings from feature index to descriptive statistics as well as check for missing values. - Improved the
list-features
CLI method to include descriptive statics of each feature as well as checking for missing values. Information written in--verbose
mode. - Improved the config selection from the CLI as well as fixing general spelling errors.
- Solved major bug for generating prediction images using the CLI, which was completely turned off but should now work as expected.
- Improved error messages from the CLI to indicate the parameter flags correctly. Solved minor bugs for that lookup as well as extended it for parameters based on method annotations.
- Solved minor bug with swapped
get[Min|Max]SplitIndex
values for theLOOSplitter
class. - Solved bug with filtering of property files, which caused the project version not being correctly injected into
confai.properties
which is used for keeping track of the CPSign version. Thus the version could not be found either at API or CLI level. - Moved the encrypt-api to a separate project which simplifies and decouples that API vs the remaining code. Now the other components can rely on a fixed version (as it is not likely to change often or at all) and not have issues with incompatibility due to transitive dependencies to the
parent
project pom (i.e. encrypt-api would rely on e.g. the 2.0.0-rc3 version whilst the remaining cpsign modules will rely on a later version of the parent pom). - Added two callback interfaces for progress information and monitoring while running
GridSearch
jobs, one that only get information (ProgressCallback
) and the other one (ProgressMonitor
) that can choose to exit the grid search e.g. due to time-restraints. Note that these are synchronous calls at this point (CPSign is not multithreaded otherwise so no use adding that for this one task). Added an implementation tailored for the CLI so that information is printed in the terminal (e.g. " - Finished 2/10 grid points") to provide more feedback about the progress so e.g. too long jobs can be aborted earlier.
Note: due to a minor mistake in updating the encrypt-api version to the updated 2.0.1 version in the test-utils module, the build failed at GitHub and RC1 is amended directly by release candidate 3 (RC3).
- Introduced a common
ProgressTracker
that is shared among the processing stack while loading and processing new data. This way a specific threshold of "number of allowed failures" can be adhered to instead of having individual thresholds for each level of processing. This trickles down into--early-termination
of the CLI now stopping when it should. - Fixed bug where
MolAndActivityConverter
would stop when encountering the maximum specified number of conversion failures, and do it silently without causing the expected exception. This bug was also present when using the CLIprecompute
program - and is now solved. - Remove the possibility to use clustered POSIX flags to the CLI, which resulted in strange behavior that was hard to find the cause of.
- Update the Fuzzy matching logics to (hopefully) improve matching of parameters to the CLI.
- Update dependency version of Picocli.
- Added logics for trying to deduce the user error to improve error messages to CLI users, specifically for running
precompute
. Also streamlined the output to be more coherent amongpredict
,predict-online
andvalidate
. - Extracted prediction logics of
predict
andpredict-online
into common utility class, again to streamline the behavior seen by the user. - Added more information to the
model-info
program of the CLI, i.e. in case a model can be used for generating prediction images and the number of attributes/features used. - Added new interface
ChemFilter
that replaces the[set|get]MinHAC
of aChemDataset
(API level). Added two implementations;HACFilter
(allows to set both minimum and maximum HAC) andMolMassFilter
(min/max molecular mass). At the CLI level the--min-hac
is now deprecated and will give a deprecation message if used, and is replaced by the--chem-filters
parameter that allows to set multiple filters. To keep consistent with earlier versions there is a defaultHACFilter
using the same minimum heavy atom count of 5. In order to not apply any chemical filters the user can give--chem-filters none
. - At CLI level there is a new
explain chem-filters
that give some general information as well as list the currently available filters and sub-parameters of these. - Improved aggregation of metrics from multi-test splits (i.e. when computing mean +/- standard deviation), also improve output from CLI to the user (i.e. remove the wording "Calibration plot" when only having a single confidence level).