Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imbalance and noise handling #9

Open
dhyoon0527 opened this issue Sep 16, 2020 · 2 comments
Open

Imbalance and noise handling #9

dhyoon0527 opened this issue Sep 16, 2020 · 2 comments

Comments

@dhyoon0527
Copy link

Hello Ed,

First of all, thank you for this amazing work. I was wondering if you could answer the following questions, it'd be really appreciated.

  1. How do you think RETAIN handles with class-imbalance situation? I personally saw it handled well in my dataset, but couldn't follow why it worked great.

  2. Patients can have common medical codes like hypertension or diabetes, which can be regarded as noise in terms of predicting other disease. Can RETAIN capture such codes with less weight even though they appear a lot?

Best Regards,

@mp2893
Copy link
Owner

mp2893 commented Sep 17, 2020

Hi dhyoon0527,

Thanks for taking interest in our work. As for your questions:

  1. There is no specialized component in RETAIN that is designed to handle class-imbalance. (It uses a simple cross entropy loss). Not sure why it would work well on your dataset, but I'm sure other sequence models would equally work well.
  2. Suppose you want to predict breast cancer with RETAIN. Then both hypertension and diabetes would occur about the same number times in both cases and controls (since hypertension and diabetes are not really relevant to breast cancer). In this case, RETAIN would learn that hypertension and diabetes are not useful features, and assign low attention to both. But if you are predicting kidney disease with RETAIN, then I would assume diabetes would occur much more often in cases. So RETAIN would learn to assign higher attention on diabetes.

Best,
Ed

@dhyoon0527
Copy link
Author

dhyoon0527 commented Sep 18, 2020

Ed, Thank you for your answer!

  1. For imbalanced dataset, let's say 9:1 ratio, I see model predicts final score like 0.9 0.1 in general, and even like 0.85 0.15 for positive events. I expected (or hoped) for those rare positive events, it'd show like 0.2 0.8.

I tried to scale only positive risk scores, but could you explain why would this happen and wondering if you have idea to solve this.

  1. I saw in your other paper (Doctor AI) that tries to predict what (and when) code(s) for the next visits. I'm trying to utilize this idea to predict the probability of patient to fall in X disease with in certain time buckets (within 90 days or [-180, 90) days -- so discretization might be needed. What do you think the correct pipeline for RETAIN to predict such problem?

I currently segregated patient's visit in cumulative fashion: [[a]] / [[a],[b,c]] / [[a], [b,c], [d]] with following time bucket: [1, 0,0] / [0,1,0] / [0,0,1] and RETAIN would explain usefulness of codes/visits from multiple visit cases.

Best Regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants