Tuesday, April 09, 2013

Imbalanced predictions from imbalanced data set

This time, it is more like a question to you. Any thought and/or idea are more than welcome.

You might know about Machine Learning classifiers (including logistic regression).
It builds a prediction model using training data (most case, historical data), and using the model, it can predict the response of system, or the class.    

It is widely used in wide range of application, such as, image recognition, bioinformatics, marketing (to find the right customers, who will make a purchase), financial firms (to estimate the risk of default) and so on.

However, since the models are trained to maximize the overall accuracy, there will be a problem when we have highly imbalanced data, which is easy to find in reality.

What do we mean by 'imbalanced'?  In data set, most of responses are 0 and very small number of them are 1, in case of binary classification problem.

My current work is about 'descriptive' Accept/Decline(A/D) decision model.
What kind of Acceptance model?

I am working on a simulation model for evaluation of organ allocation policy, SAM (Simulated Allocation Method).

In SAM, there is a A/D model. Why?
Surprisingly more than 50% of times the first candidate (recipient/patient) says "No, I am not going to take that organ. I am going to wait for better one." The offer keeps going to the next and next candidates until one of candidates says 'Yes' or OPO(Organ Procurement  Organization) gives up.

So, it is important to know who would say 'Yes' and takes the organ in simulation.
That's why we are working on a descriptive model rather than a prescriptive model.

Imbalanced data :  Since the offer is going only up to the first Yes, the most of responses (Y/N or A/D) in the data set are 'No's.   Ex) N, N, N, N, N, Y (Stop)

Using this imbalanced data (we have a case of 19.5:1 ratio), any kind of prediction model would ignore the minority label (Y). Our predictor will be very accurate on predicting the majority but very poor at predicting the minority. And, even a dumb predictor that predicts everything as NO can hit a high (around 95 %) accuracy.

Of cause, there are several techniques to correct this imbalance and new measures instead of just % accuracy. Under sampling, over sampling, and different error cost can be used. What All of them do is emphasizing the minority. Ok, by doing so, we will get a prediction model which is somewhat good at predicting both the majority and minority.  You can find a lot of literatures dealing with these techniques.

The problem comes here. Since our model was built using the technique of emphasizing minority, our model has higher tendency of predicting minority label (Y) than it should.
Earlier, I said we have 19.5 No labels per one Yes label. but, by using these technique we are predicting about 3~4 times more Yes labels; the predicted label ratio becomes about 6:1.

This emphasized prediction model might be fine for some purposes, like marketing. When the firm send out their catalogs, they may use a prediction model to find customers who will buy the product thru catalog.  It doesn't matter the firm send out more catalogs than the real purchases.

But, some applications, it is important to predict the labels(responses) with the right proportion.
We are developing an evaluation method which is specified to our problem. But it will not work for general cases. (If you are interested in my current work, come to Chicago in June, @INFORM healthcare conference ^^)

And, I could not find a literature which tells me what to do with this model with emphasized tendency.

What can we do with imbalanced data when the distorted tendency is not good for the problem?

Thursday, April 04, 2013

Stochasticity, Randomness and Uncertainty

Operations Research area can be divided into two parts, deterministic and stochastic.
The line between these two areas is getting blur these days.
Stochastic programming is one of examples that is located in between.

Some people use stochastic, random and uncertain interchangeably.
People out of OR community may do so. And, they may not even know the word, 'stochastic'.
But, I saw even within our OR community, people make mistakes.

When we solve some optimization problems, we need to know the system and its' parameters, such as demand, lead time and so on in supply chain system, for example.
When we know the exact numbers or it is ok to set them as fixed numbers, we use the numbers as known parameters. And, it will be a deterministic problem.
However, when we don't know the exact figures, and when it will not be fine to set them as fixed numbers, what should we do?
Yes, probably, we need stochastic.

Then, a Question, here. Do we employ 'Stochastic', when we are lack of information, because we don't know what it will be?
No. Indeed, it is opposite.

Stochastic model requires much more information on the system than deterministic one does, assuming we are dealing with the exact same problem.      
We need to know the random factors' probability distributions. And, distribution cannot be expressed by couple of numbers. (Mean and standard deviation are not good enough unless we know the type of the distribution.)
When we don't know what the variable will be, yes, it is 'uncertain'.
When we have no idea at all on what it will be, it can NOT be 'stochastic'.
Uncertainty includes Stochasticity, obviously.

Of course, there is research areas within, we call, 'stochastic' OR, without knowing exact distributions, such as Beysian update and worse case scenario. But, still you need to know something about the uncertain variables.

I don't either have a clear distinction of 'random' from 'uncertain' and 'stochastic'
My feeling on 'random' is more close to stochastic than to uncertain (non stochastic).
Any idea on definition of 'randomness' and/or any thought welcomed.

P.S. Some people say stochastic problem is more difficult than deterministic one. It is true if the system and models are exactly same. But, that's why research problems on deterministic are far more difficult than the research problems in stochastic, if you look at the models itself, ignoring the variables are deterministic or stochastic.