cross reference mode #235

StevenSong · 2020-04-26T23:43:39Z

Create a new mode that performs cross referencing on two datasets and saves summary information, distribution of labels, and plots the occurrence of cross referenced data points relative to some time window

specify the dataset to cross reference with tensors arguments
specify the dataset to use as a reference with reference arguments

closes #158
closes #186
closes #188
closes #200

ml4cvd/arguments.py

ml4cvd/plots.py

erikr · 2020-04-28T22:59:30Z

ml4cvd/recipes.py

+from ml4cvd.explorations import sample_from_char_model, mri_dates, ecg_dates, predictions_to_pngs, sort_csv
+from ml4cvd.explorations import tabulate_correlations_of_tensors, test_labels_to_label_map, infer_with_pixels, explore
+from ml4cvd.explorations import plot_heatmap_of_tensors, plot_while_learning, plot_histograms_of_tensors_in_pdf, cross_reference


Can you format these import statements so they are cleaner? I like the style set by Black:

from ml4sts.models import ( get_feature_values, train_model_parallel, evaluate_predictions, format_results_as_df, initialize_model, threshold_predictions, generate_crossfold_indices, add_fold_and_model_info_to_results, )

@paolodi what is the convention for ML4CVD?

erikr

Few comments. Will provide more detailed review soon. Can you please request review from a Broadie? Thanks and outstanding work.

StevenSong · 2020-04-29T00:52:45Z

@paolodi would you mind taking a look at this? thank you for the time to review so many of my PRs 😅

erikr · 2020-04-29T01:41:04Z

I think it would be good if cross-ref can take

--window_start -365
--window_end 0

or

--window_start -30
--window_end 30

or

--window_start 0
--window_end 60

etc

if the value is missing, default is 0
if window_start > window_end, raise ValueError
if both are missing, could xref on MRNs and ignore datetimes

erikr · 2020-04-29T01:55:57Z

We also need the functionality to cross-reference between two dates that are specified by keys in args.reference_tensors.

If the CSV at args.reference_tensors contains:

| mrn   | dt_event_1 | dt_event_2 |
|-------|------------|------------|
| 12345 | 01/22/2019 | 04/01/2019 |
| 45241 | 11/05/2015 | 03/14/2016 |
| 87277 | 08/08/2016 | 11/12/2016 |

we would like to count an ECG as a "hit" if it it falls between (but not on; be conservative) dt_event_1 and dt_event_2.

Several possible windows could be specified:

[args.reference_label1, args.reference_label2]
[args.reference_label1, d days]
[d days, args.reference_label1]
[args.reference_label2, d days]
[d days, args.reference_label2]

Lastly, we need an option args.cross_ref_match_before_and_after_event. If True, look for hits in two time periods:

[d days, args.reference_label1]
[args.reference_label1, d days]

to count as a hit.

For context, this feature is important for a project where we want to assess cohort ∩ ECG before vs. after the initiation of an anti-cancer therapy, and are interested in patients have paired ECGs pre and post.

Is this feasible?

StevenSong · 2020-05-01T13:34:42Z

We also need the functionality to cross-reference between two dates that are specified by keys in args.reference_tensors.

we would like to count an ECG as a "hit" if it it falls between (but not on; be conservative) dt_event_1 and dt_event_2.

Lastly, we need an option args.cross_ref_match_before_and_after_event. If True, look for hits in two time periods:

Edit: old suggestion in this comment outdated, see updated #235 (comment)

paolodi

@StevieSong great job on this PR! I really like the new functionality, and my main comments are suggestions on making it blend even more with the rest of the codebase in terms of argument naming. Also, I wonder whether by using df.merge, we may avoid one more dependency and generalize the join to extend over multiple TMAPS. This might also help consolidating the filtering, which is now split in two parts instead.

Again great job overall!

docker/vm_boot_images/config/tensorflow-requirements.txt

ml4cvd/arguments.py

paolodi · 2020-05-01T16:13:04Z

ml4cvd/arguments.py

@@ -204,6 +204,18 @@ def parse_args():
    parser.add_argument('--num_workers', default=multiprocessing.cpu_count(), type=int, help="Number of workers to use for every tensor generator.")
    parser.add_argument('--cache_size', default=3.5e9/multiprocessing.cpu_count(), type=float, help="Tensor map cache size per worker.")

+    # Cross reference arguments
+    parser.add_argument('--tensors_name', default='Tensors', help='Name of dataset at tensors.')


Not sure we really need this. Can we infer the name somehow from --tensors? The problem with these "naming" arguments tend to be a bit too generic

this would have to be pretty fancy, for example to cross reference the tensorized ECG hd5 files, we use the path /data/partners_ecg/mgh/explore/tensors_all_union.csv or /data/partners_ecg/mgh/hd5/ and for reference_tensors, we'd use something like /data/sts/mgh-all-features-labels.csv if xref on STS cohort -- doable to extract some name but would either be difficult or not pretty name in output

maybe we dont need name? tensors and references are decent enough descriptors

args.id lets user name output directory to stay organized

@erikr the names are specifically just for having more descriptive names in the output of summary cohort counts and on the graph e.g. ECG in STS (as opposed to Tensors in Reference)

Understood. Suggest clarifying the help field, e.g. help='Name of dataset at tensors, e.g. ECG. Adds contextual detail to summary CSV and plots.'

@paolodi this arg enables the output of cross_reference to be understandable, as it fills the x-axis label of a count plot. Much more informative for a plot x-axis label to say "ECG in STS" versus "Tensors in Reference". Just adding context to Steven's reply.

ml4cvd/arguments.py

paolodi · 2020-05-01T16:18:48Z

ml4cvd/arguments.py

+    parser.add_argument('--tensors_join', default='partners_ecg_patientid_clean', help='Name of value in tensors to match data in reference.')
+    parser.add_argument('--tensors_time', default='partners_ecg_datetime', help='Name of value in tensors to perform time cross-ref on. Optional')
+    parser.add_argument('--reference_tensors', help='Either a csv or directory of hd5 containing a reference dataset.')
+    parser.add_argument('--reference_name', default='Reference', help='Name of dataset at reference.')


Not sure we really need this. Can we infer the name somehow from --reference_tensors? The problem with these "naming" arguments tend to be a bit too generic

See comment above. This arg is helpful. Can we make this arg and its help message clearer?

ml4cvd/explorations.py

ml4cvd/plots.py

StevenSong · 2020-05-04T22:56:49Z

@erikr advanced time windowing!
Specify the time tensor and offsets to use - for STS 30 day pre-op

--reference_start_time_tensor surgdt -30
--reference_end_time_tensor surgdt

for ICU between tAdmit and tDischarge:

--reference_start_time_tensor tAdmit
--reference_end_time_tensor tDischarge

for 30 days before and after some event at time tEvent:

--reference_start_time_tensor tEvent -30
--reference_end_time_tensor tEvent 30

for 30 days before tAdmit and 5 days before tDischarge:

--reference_start_time_tensor tAdmit -30
--reference_end_time_tensor tDischarge -5

so many options!

erikr

fantastic work! nitpicks, questions, style, etc.

I defer to Paolo for final say on "ml4cvd" conventions

erikr · 2020-05-05T12:44:40Z

ml4cvd/arguments.py

@@ -204,6 +204,18 @@ def parse_args():
    parser.add_argument('--num_workers', default=multiprocessing.cpu_count(), type=int, help="Number of workers to use for every tensor generator.")
    parser.add_argument('--cache_size', default=3.5e9/multiprocessing.cpu_count(), type=float, help="Tensor map cache size per worker.")

+    # Cross reference arguments
+    parser.add_argument('--tensors_name', default='Tensors', help='Name of dataset at tensors.')


Understood. Suggest clarifying the help field, e.g. help='Name of dataset at tensors, e.g. ECG. Adds contextual detail to summary CSV and plots.'

@paolodi this arg enables the output of cross_reference to be understandable, as it fills the x-axis label of a count plot. Much more informative for a plot x-axis label to say "ECG in STS" versus "Tensors in Reference". Just adding context to Steven's reply.

erikr · 2020-05-05T12:54:51Z

ml4cvd/arguments.py

+    parser.add_argument('--tensors_name', default='Tensors', help='Name of dataset at tensors.')
+    parser.add_argument('--tensors_join', default='partners_ecg_patientid_clean', help='Name of value in tensors to match data in reference.')
+    parser.add_argument('--tensors_time', default='partners_ecg_datetime', help='Name of value in tensors to perform time cross-ref on. Optional')
+    parser.add_argument('--reference_tensors', help='Either a csv or directory of hd5 containing a reference dataset.')


@paolodi Our semantics are also overloaded. "Tensor" refers to both 1) HD5 files and 2) output of TMaps. This is confusing.

We could use tensor to refer to HD5 files, and tmap to refer to both the output of tmaps, and the tmap structure itself, but that is confusing since tmaps return tensors :(

If so, would rename --join_tensors → --join_tmaps, and --time_tensor → --time_tmaps.

erikr · 2020-05-05T12:57:03Z

ml4cvd/arguments.py

+    parser.add_argument('--tensors_join', default='partners_ecg_patientid_clean', help='Name of value in tensors to match data in reference.')
+    parser.add_argument('--tensors_time', default='partners_ecg_datetime', help='Name of value in tensors to perform time cross-ref on. Optional')
+    parser.add_argument('--reference_tensors', help='Either a csv or directory of hd5 containing a reference dataset.')
+    parser.add_argument('--reference_name', default='Reference', help='Name of dataset at reference.')


See comment above. This arg is helpful. Can we make this arg and its help message clearer?

erikr · 2020-05-05T13:04:53Z

ml4cvd/arguments.py

+    parser.add_argument('--reference_join', help='Name of value in reference to match data in tensors.')
+    parser.add_argument('--reference_time', help='Name of value in reference to perform time cross-ref on. Optional')
+    parser.add_argument('--reference_time_range', help='Either the name of a value in reference or an integer describing the time window relative to reference time to perform time cross-ref on. Optional')
+    parser.add_argument('--reference_label', help='Name of value in reference to report distribution on.')


Let's follow PEP8 line lengths. Can black format this long help message into multiline?

What is an example of --reference_label? Add to help message?

ml4cvd/explorations.py

erikr · 2020-05-05T13:12:12Z

ml4cvd/plots.py

+    binwidth = 5
+    ax.hist(day_diffs, bins=range(day_diffs.min(), day_diffs.max() + binwidth, binwidth))
+    ax.set_xlabel('Days relative to event')
+    ax.set_ylabel('Number of patients')


Idea for next PR: make this flexible if we are interested in number of records, rather than number of patients

paolodi

Ok I think this is perfect, thanks so much for the changes @StevieSong and @erikr. Great job!

The only thing that I am not a huge fan of is the csv option for --tensors. "Philosphically", I think that we should always encourage users to provide TMAPs and thus specify helpful metainformation (e.g., a good name to use in plots, data types, category channels, mesh features etc.). At the end of the day, that's the main core feature of ML4CVD. The approach you are proposing potentially allows to bypass TMAPs altogether (e.g., if you pass CSVs for both tensors and reference_tensors). I understand that you might want to save time by precomputing TMAPs, and the way we allowed that in the past was through _build_tensor_from_file in tensor_from_file.py. The advantage is more flexibility (e.g., you can normalize, validate, specify delimiters) and you could use the metainformation also to enrich the analysis.

Anyways, I think this adds a very nice weapon to the TMAP statistics arsenal, and can be merged as is. I'd encourage, though, not to use the CSV-only option without going through the definition of a specific TMAP with _build_tensor_from_file. In the future we may decide to prevent --tensors to be anything else than a path to a directory containing hd5 files.

StevenSong · 2020-05-07T21:14:52Z

@paolodi I'll open a new issue for converting cross reference data access in csvs to tmaps, thanks for the time and review!

* xref base * rename clean cols, escape col names * source -> tensors, add help to args * plot changes, require time range with time * add fpath to xref output, if found * cleanup logic, remove pandasql * advanced time windowing * cleanup

StevenSong added 3 commits April 23, 2020 11:35

xref base

def6c9d

rename clean cols, escape col names

c20df7f

source -> tensors, add help to args

fa18947

StevenSong added the enhancement New feature or request label Apr 26, 2020

StevenSong requested a review from erikr April 26, 2020 23:46

erikr reviewed Apr 28, 2020

View reviewed changes

ml4cvd/arguments.py Outdated Show resolved Hide resolved

erikr reviewed Apr 28, 2020

View reviewed changes

ml4cvd/arguments.py Outdated Show resolved Hide resolved

erikr reviewed Apr 28, 2020

View reviewed changes

ml4cvd/plots.py Outdated Show resolved Hide resolved

erikr reviewed Apr 28, 2020

View reviewed changes

erikr suggested changes Apr 28, 2020

View reviewed changes

erikr assigned StevenSong Apr 28, 2020

plot changes, require time range with time

ccde4c5

StevenSong requested a review from paolodi April 29, 2020 00:51

paolodi suggested changes May 1, 2020

View reviewed changes

StevenSong added 4 commits May 1, 2020 15:18

add fpath to xref output, if found

04426ca

Merge branch 'master' into ss_partners_xref

8ea2241

cleanup logic, remove pandasql

adcb6ee

advanced time windowing

77f691e

StevenSong requested review from paolodi and erikr May 4, 2020 22:57

erikr suggested changes May 5, 2020

View reviewed changes

cleanup

6ea3424

paolodi approved these changes May 7, 2020

View reviewed changes

erikr approved these changes May 7, 2020

View reviewed changes

StevenSong merged commit 3f20b3a into master May 7, 2020

StevenSong deleted the ss_partners_xref branch May 7, 2020 21:16

StevenSong mentioned this pull request May 7, 2020

convert cross reference access to data in csv to use a TMap #256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cross reference mode #235

cross reference mode #235

StevenSong commented Apr 26, 2020 •

edited

Loading

erikr Apr 28, 2020 •

edited

Loading

erikr left a comment

StevenSong commented Apr 29, 2020

erikr commented Apr 29, 2020 •

edited

Loading

erikr commented Apr 29, 2020 •

edited

Loading

StevenSong commented May 1, 2020 •

edited

Loading

paolodi left a comment •

edited

Loading

paolodi May 1, 2020

StevenSong May 4, 2020 •

edited

Loading

StevenSong May 4, 2020

erikr May 4, 2020 •

edited

Loading

StevenSong May 5, 2020

erikr May 5, 2020

paolodi May 1, 2020

erikr May 5, 2020

StevenSong commented May 4, 2020

erikr left a comment

erikr May 5, 2020

erikr May 5, 2020

erikr May 5, 2020

erikr May 5, 2020

erikr May 5, 2020

paolodi left a comment •

edited

Loading

StevenSong commented May 7, 2020

cross reference mode #235

cross reference mode #235

Conversation

StevenSong commented Apr 26, 2020 • edited Loading

erikr Apr 28, 2020 • edited Loading

Choose a reason for hiding this comment

erikr left a comment

Choose a reason for hiding this comment

StevenSong commented Apr 29, 2020

erikr commented Apr 29, 2020 • edited Loading

erikr commented Apr 29, 2020 • edited Loading

StevenSong commented May 1, 2020 • edited Loading

paolodi left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StevenSong May 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikr May 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StevenSong commented May 4, 2020

erikr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paolodi left a comment • edited Loading

Choose a reason for hiding this comment

StevenSong commented May 7, 2020

StevenSong commented Apr 26, 2020 •

edited

Loading

erikr Apr 28, 2020 •

edited

Loading

erikr commented Apr 29, 2020 •

edited

Loading

erikr commented Apr 29, 2020 •

edited

Loading

StevenSong commented May 1, 2020 •

edited

Loading

paolodi left a comment •

edited

Loading

StevenSong May 4, 2020 •

edited

Loading

erikr May 4, 2020 •

edited

Loading

paolodi left a comment •

edited

Loading