Tuesday, April 09, 2013

Imbalanced predictions from imbalanced data set

This time, it is more like a question to you. Any thought and/or idea are more than welcome.

You might know about Machine Learning classifiers (including logistic regression).
It builds a prediction model using training data (most case, historical data), and using the model, it can predict the response of system, or the class.    

It is widely used in wide range of application, such as, image recognition, bioinformatics, marketing (to find the right customers, who will make a purchase), financial firms (to estimate the risk of default) and so on.

However, since the models are trained to maximize the overall accuracy, there will be a problem when we have highly imbalanced data, which is easy to find in reality.

What do we mean by 'imbalanced'?  In data set, most of responses are 0 and very small number of them are 1, in case of binary classification problem.

My current work is about 'descriptive' Accept/Decline(A/D) decision model.
What kind of Acceptance model?

I am working on a simulation model for evaluation of organ allocation policy, SAM (Simulated Allocation Method).

In SAM, there is a A/D model. Why?
Surprisingly more than 50% of times the first candidate (recipient/patient) says "No, I am not going to take that organ. I am going to wait for better one." The offer keeps going to the next and next candidates until one of candidates says 'Yes' or OPO(Organ Procurement  Organization) gives up.

So, it is important to know who would say 'Yes' and takes the organ in simulation.
That's why we are working on a descriptive model rather than a prescriptive model.

Imbalanced data :  Since the offer is going only up to the first Yes, the most of responses (Y/N or A/D) in the data set are 'No's.   Ex) N, N, N, N, N, Y (Stop)

Using this imbalanced data (we have a case of 19.5:1 ratio), any kind of prediction model would ignore the minority label (Y). Our predictor will be very accurate on predicting the majority but very poor at predicting the minority. And, even a dumb predictor that predicts everything as NO can hit a high (around 95 %) accuracy.

Of cause, there are several techniques to correct this imbalance and new measures instead of just % accuracy. Under sampling, over sampling, and different error cost can be used. What All of them do is emphasizing the minority. Ok, by doing so, we will get a prediction model which is somewhat good at predicting both the majority and minority.  You can find a lot of literatures dealing with these techniques.

The problem comes here. Since our model was built using the technique of emphasizing minority, our model has higher tendency of predicting minority label (Y) than it should.
Earlier, I said we have 19.5 No labels per one Yes label. but, by using these technique we are predicting about 3~4 times more Yes labels; the predicted label ratio becomes about 6:1.

This emphasized prediction model might be fine for some purposes, like marketing. When the firm send out their catalogs, they may use a prediction model to find customers who will buy the product thru catalog.  It doesn't matter the firm send out more catalogs than the real purchases.

But, some applications, it is important to predict the labels(responses) with the right proportion.
We are developing an evaluation method which is specified to our problem. But it will not work for general cases. (If you are interested in my current work, come to Chicago in June, @INFORM healthcare conference ^^)

And, I could not find a literature which tells me what to do with this model with emphasized tendency.

What can we do with imbalanced data when the distorted tendency is not good for the problem?


  1. One of my friends complained that he cannot post a comment on my blog.
    If you have a problem to post your comment, please let me know, thru email or twitter.

  2. My first thought is to ask why, in the context of your problem, it is important to know who will accept an organ. I suspect it has to do with the probability an organ goes to waste. If you can turn that into expected costs for false positives and false negatives, you can firm an expected cost function for prediction error and look for the fur that minimizes it over the training sample.

  3. Thanks for the comment, Paul. I didn't explain enough why who will accept an organ is important. It is because the LYFT(Life Years gained from Transplant) depends on the matching characteristics of donor/patient. We want to find at least similar patients to the real patients who accepted, in terms of characteristics.
    And, Of course, discard rate is one of the most important statistics here. But, we exclude the discard cases here because discard happens when OPO gives up. Unfortunately, I don't have any idea on how they decides when to give up.
    You idea of costs for false negative and false positive is great. However, we don't know how much the costs should be. And, out client don't know either.