-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cross reference mode #235
cross reference mode #235
Conversation
from ml4cvd.explorations import sample_from_char_model, mri_dates, ecg_dates, predictions_to_pngs, sort_csv | ||
from ml4cvd.explorations import tabulate_correlations_of_tensors, test_labels_to_label_map, infer_with_pixels, explore | ||
from ml4cvd.explorations import plot_heatmap_of_tensors, plot_while_learning, plot_histograms_of_tensors_in_pdf, cross_reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you format these import statements so they are cleaner? I like the style set by Black
:
from ml4sts.models import (
get_feature_values,
train_model_parallel,
evaluate_predictions,
format_results_as_df,
initialize_model,
threshold_predictions,
generate_crossfold_indices,
add_fold_and_model_info_to_results,
)
@paolodi what is the convention for ML4CVD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments. Will provide more detailed review soon. Can you please request review from a Broadie? Thanks and outstanding work.
@paolodi would you mind taking a look at this? thank you for the time to review so many of my PRs 😅 |
I think it would be good if cross-ref can take
or
or
etc
|
We also need the functionality to cross-reference between two dates that are specified by keys in If the CSV at
we would like to count an ECG as a "hit" if it it falls between (but not on; be conservative) Several possible windows could be specified:
Lastly, we need an option
to count as a hit. For context, this feature is important for a project where we want to assess cohort ∩ ECG before vs. after the initiation of an anti-cancer therapy, and are interested in patients have paired ECGs pre and post. Is this feasible? |
Edit: old suggestion in this comment outdated, see updated #235 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StevieSong great job on this PR! I really like the new functionality, and my main comments are suggestions on making it blend even more with the rest of the codebase in terms of argument naming. Also, I wonder whether by using df.merge
, we may avoid one more dependency and generalize the join
to extend over multiple TMAPS. This might also help consolidating the filtering, which is now split in two parts instead.
Again great job overall!
ml4cvd/arguments.py
Outdated
@@ -204,6 +204,18 @@ def parse_args(): | |||
parser.add_argument('--num_workers', default=multiprocessing.cpu_count(), type=int, help="Number of workers to use for every tensor generator.") | |||
parser.add_argument('--cache_size', default=3.5e9/multiprocessing.cpu_count(), type=float, help="Tensor map cache size per worker.") | |||
|
|||
# Cross reference arguments | |||
parser.add_argument('--tensors_name', default='Tensors', help='Name of dataset at tensors.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we really need this. Can we infer the name somehow from --tensors
? The problem with these "naming" arguments tend to be a bit too generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would have to be pretty fancy, for example to cross reference the tensorized ECG hd5 files, we use the path /data/partners_ecg/mgh/explore/tensors_all_union.csv
or /data/partners_ecg/mgh/hd5/
and for reference_tensors
, we'd use something like /data/sts/mgh-all-features-labels.csv
if xref on STS cohort -- doable to extract some name but would either be difficult or not pretty name in output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we dont need name? tensors and references are decent enough descriptors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args.id
lets user name output directory to stay organized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erikr the names are specifically just for having more descriptive names in the output of summary cohort counts and on the graph e.g. ECG in STS (as opposed to Tensors in Reference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Suggest clarifying the help field, e.g. help='Name of dataset at tensors, e.g. ECG. Adds contextual detail to summary CSV and plots.'
@paolodi this arg enables the output of cross_reference
to be understandable, as it fills the x-axis label of a count plot. Much more informative for a plot x-axis label to say "ECG in STS" versus "Tensors in Reference". Just adding context to Steven's reply.
ml4cvd/arguments.py
Outdated
parser.add_argument('--tensors_join', default='partners_ecg_patientid_clean', help='Name of value in tensors to match data in reference.') | ||
parser.add_argument('--tensors_time', default='partners_ecg_datetime', help='Name of value in tensors to perform time cross-ref on. Optional') | ||
parser.add_argument('--reference_tensors', help='Either a csv or directory of hd5 containing a reference dataset.') | ||
parser.add_argument('--reference_name', default='Reference', help='Name of dataset at reference.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we really need this. Can we infer the name somehow from --reference_tensors
? The problem with these "naming" arguments tend to be a bit too generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. This arg is helpful. Can we make this arg and its help message clearer?
@erikr advanced time windowing!
for ICU between tAdmit and tDischarge:
for 30 days before and after some event at time tEvent:
for 30 days before tAdmit and 5 days before tDischarge:
so many options! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fantastic work! nitpicks, questions, style, etc.
I defer to Paolo for final say on "ml4cvd" conventions
ml4cvd/arguments.py
Outdated
@@ -204,6 +204,18 @@ def parse_args(): | |||
parser.add_argument('--num_workers', default=multiprocessing.cpu_count(), type=int, help="Number of workers to use for every tensor generator.") | |||
parser.add_argument('--cache_size', default=3.5e9/multiprocessing.cpu_count(), type=float, help="Tensor map cache size per worker.") | |||
|
|||
# Cross reference arguments | |||
parser.add_argument('--tensors_name', default='Tensors', help='Name of dataset at tensors.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Suggest clarifying the help field, e.g. help='Name of dataset at tensors, e.g. ECG. Adds contextual detail to summary CSV and plots.'
@paolodi this arg enables the output of cross_reference
to be understandable, as it fills the x-axis label of a count plot. Much more informative for a plot x-axis label to say "ECG in STS" versus "Tensors in Reference". Just adding context to Steven's reply.
ml4cvd/arguments.py
Outdated
parser.add_argument('--tensors_name', default='Tensors', help='Name of dataset at tensors.') | ||
parser.add_argument('--tensors_join', default='partners_ecg_patientid_clean', help='Name of value in tensors to match data in reference.') | ||
parser.add_argument('--tensors_time', default='partners_ecg_datetime', help='Name of value in tensors to perform time cross-ref on. Optional') | ||
parser.add_argument('--reference_tensors', help='Either a csv or directory of hd5 containing a reference dataset.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paolodi Our semantics are also overloaded. "Tensor" refers to both 1) HD5 files and 2) output of TMaps. This is confusing.
We could use tensor
to refer to HD5 files, and tmap
to refer to both the output of tmaps, and the tmap structure itself, but that is confusing since tmaps return tensors :(
If so, would rename --join_tensors
→ --join_tmaps
, and --time_tensor
→ --time_tmaps
.
ml4cvd/arguments.py
Outdated
parser.add_argument('--tensors_join', default='partners_ecg_patientid_clean', help='Name of value in tensors to match data in reference.') | ||
parser.add_argument('--tensors_time', default='partners_ecg_datetime', help='Name of value in tensors to perform time cross-ref on. Optional') | ||
parser.add_argument('--reference_tensors', help='Either a csv or directory of hd5 containing a reference dataset.') | ||
parser.add_argument('--reference_name', default='Reference', help='Name of dataset at reference.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. This arg is helpful. Can we make this arg and its help message clearer?
ml4cvd/arguments.py
Outdated
parser.add_argument('--reference_join', help='Name of value in reference to match data in tensors.') | ||
parser.add_argument('--reference_time', help='Name of value in reference to perform time cross-ref on. Optional') | ||
parser.add_argument('--reference_time_range', help='Either the name of a value in reference or an integer describing the time window relative to reference time to perform time cross-ref on. Optional') | ||
parser.add_argument('--reference_label', help='Name of value in reference to report distribution on.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow PEP8 line lengths. Can black
format this long help message into multiline?
What is an example of --reference_label
? Add to help message?
binwidth = 5 | ||
ax.hist(day_diffs, bins=range(day_diffs.min(), day_diffs.max() + binwidth, binwidth)) | ||
ax.set_xlabel('Days relative to event') | ||
ax.set_ylabel('Number of patients') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea for next PR: make this flexible if we are interested in number of records, rather than number of patients
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think this is perfect, thanks so much for the changes @StevieSong and @erikr. Great job!
The only thing that I am not a huge fan of is the csv
option for --tensors
. "Philosphically", I think that we should always encourage users to provide TMAPs and thus specify helpful metainformation (e.g., a good name to use in plots, data types, category channels, mesh features etc.). At the end of the day, that's the main core feature of ML4CVD. The approach you are proposing potentially allows to bypass TMAPs altogether (e.g., if you pass CSVs for both tensors
and reference_tensors
). I understand that you might want to save time by precomputing TMAPs, and the way we allowed that in the past was through _build_tensor_from_file
in tensor_from_file.py
. The advantage is more flexibility (e.g., you can normalize, validate, specify delimiters) and you could use the metainformation also to enrich the analysis.
Anyways, I think this adds a very nice weapon to the TMAP statistics arsenal, and can be merged as is. I'd encourage, though, not to use the CSV-only option without going through the definition of a specific TMAP with _build_tensor_from_file
. In the future we may decide to prevent --tensors
to be anything else than a path to a directory containing hd5
files.
@paolodi I'll open a new issue for converting cross reference data access in csvs to tmaps, thanks for the time and review! |
* xref base * rename clean cols, escape col names * source -> tensors, add help to args * plot changes, require time range with time * add fpath to xref output, if found * cleanup logic, remove pandasql * advanced time windowing * cleanup
* xref base * rename clean cols, escape col names * source -> tensors, add help to args * plot changes, require time range with time * add fpath to xref output, if found * cleanup logic, remove pandasql * advanced time windowing * cleanup
* xref base * rename clean cols, escape col names * source -> tensors, add help to args * plot changes, require time range with time * add fpath to xref output, if found * cleanup logic, remove pandasql * advanced time windowing * cleanup
* xref base * rename clean cols, escape col names * source -> tensors, add help to args * plot changes, require time range with time * add fpath to xref output, if found * cleanup logic, remove pandasql * advanced time windowing * cleanup
* xref base * rename clean cols, escape col names * source -> tensors, add help to args * plot changes, require time range with time * add fpath to xref output, if found * cleanup logic, remove pandasql * advanced time windowing * cleanup
Create a new mode that performs cross referencing on two datasets and saves summary information, distribution of labels, and plots the occurrence of cross referenced data points relative to some time window
specify the dataset to cross reference with
tensors
argumentsspecify the dataset to use as a reference with
reference
argumentscloses #158
closes #186
closes #188
closes #200