Imbalance and noise handling #9

dhyoon0527 · 2020-09-16T14:11:17Z

Hello Ed,

First of all, thank you for this amazing work. I was wondering if you could answer the following questions, it'd be really appreciated.

How do you think RETAIN handles with class-imbalance situation? I personally saw it handled well in my dataset, but couldn't follow why it worked great.
Patients can have common medical codes like hypertension or diabetes, which can be regarded as noise in terms of predicting other disease. Can RETAIN capture such codes with less weight even though they appear a lot?

Best Regards,

mp2893 · 2020-09-17T03:55:53Z

Hi dhyoon0527,

Thanks for taking interest in our work. As for your questions:

There is no specialized component in RETAIN that is designed to handle class-imbalance. (It uses a simple cross entropy loss). Not sure why it would work well on your dataset, but I'm sure other sequence models would equally work well.
Suppose you want to predict breast cancer with RETAIN. Then both hypertension and diabetes would occur about the same number times in both cases and controls (since hypertension and diabetes are not really relevant to breast cancer). In this case, RETAIN would learn that hypertension and diabetes are not useful features, and assign low attention to both. But if you are predicting kidney disease with RETAIN, then I would assume diabetes would occur much more often in cases. So RETAIN would learn to assign higher attention on diabetes.

Best,
Ed

dhyoon0527 · 2020-09-18T13:18:42Z

Ed, Thank you for your answer!

For imbalanced dataset, let's say 9:1 ratio, I see model predicts final score like 0.9 0.1 in general, and even like 0.85 0.15 for positive events. I expected (or hoped) for those rare positive events, it'd show like 0.2 0.8.

I tried to scale only positive risk scores, but could you explain why would this happen and wondering if you have idea to solve this.

I saw in your other paper (Doctor AI) that tries to predict what (and when) code(s) for the next visits. I'm trying to utilize this idea to predict the probability of patient to fall in X disease with in certain time buckets (within 90 days or [-180, 90) days -- so discretization might be needed. What do you think the correct pipeline for RETAIN to predict such problem?

I currently segregated patient's visit in cumulative fashion: [[a]] / [[a],[b,c]] / [[a], [b,c], [d]] with following time bucket: [1, 0,0] / [0,1,0] / [0,0,1] and RETAIN would explain usefulness of codes/visits from multiple visit cases.

Best Regards,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imbalance and noise handling #9

Imbalance and noise handling #9

dhyoon0527 commented Sep 16, 2020

mp2893 commented Sep 17, 2020

dhyoon0527 commented Sep 18, 2020 •

edited

Loading

Imbalance and noise handling #9

Imbalance and noise handling #9

Comments

dhyoon0527 commented Sep 16, 2020

mp2893 commented Sep 17, 2020

dhyoon0527 commented Sep 18, 2020 • edited Loading

dhyoon0527 commented Sep 18, 2020 •

edited

Loading