Information leakage refers to the creation of a prediction model when the prediction model creation (training) data contains information that can inherently not be used, or when information is used that cannot be identified until the variables to be predicted are known.
Let us use a specific example to illustrate information leakage. The Withdrawal prediction tutorial provides the following data as a specific example of a dataset for predicting membership withdrawal.
Row number | Entry data | Customer rank | … | Subscription route | Withdraw or not |
---|---|---|---|---|---|
00001 | 2018/5/21 | Gold | … | Direct | (b) Continuation |
00002 | 2017/1/8 | Platinum | … | Via agency | (b) Continuation |
00003 | 2017/9/3 | Gold | … | Direct | (a) Withdrawal |
In this tutorial, we want to predict whether a customer will withdraw or not.
Let’s suppose that there is a record of “Sending a withdrawal confirmation email” in this data.
Row number | Entry data | Customer rank | … | Subscription route | Sending a withdrawal confirmation email | Withdraw or not |
---|---|---|---|---|---|---|
00001 | 2018/5/21 | Gold | … | Direct | No | (b) Continuation |
00002 | 2017/1/8 | Platinum | … | Via agency | No | (b) Continuation |
00003 | 2017/9/3 | Gold | … | Direct | Yes | (a) Withdrawal |
If you use this data to predict whether a customer will be withdrawing, the prediction model is very accurate. The reason for this is that “Sending a withdrawal confirmation email” is only sent to those who have withdrawn from the service, so if a prediction model is created using information on whether or not a withdrawal confirmation email has been sent, it is possible to easily predict whether or not a customer will withdraw from the service.
However, in the actual operation, before sending the withdrawal confirmation mail, some measures are taken for those who are likely to withdraw. Therefore, a prediction model that predicts whether or not a customer will withdraw from the service by using the information of “Sending a withdrawal confirmation email” is not useful in practice. This is because information about “Sending a withdrawal confirmation email” is only available after the user has left the account, and it is not available before the user has left the account.
Easy Predictive Analytics will prompt you to make sure that your prediction models are not using variables that are not available at the time of prediction if they have a high accuracy.