zaro

What is PSI in Modelling?

Published in Model Monitoring 3 mins read

The Population Stability Index (PSI) in modeling is a crucial metric used to quantify the shift in the distribution of a variable between two samples, typically a training dataset and a validation or production dataset, thereby assessing model stability over time. It essentially measures how much the population has shifted.

Understanding Population Stability Index (PSI)

PSI is particularly valuable in predictive modeling, especially in areas like credit risk, fraud detection, and marketing, where data distributions can change significantly over time, leading to model performance degradation. A high PSI value indicates a significant shift in the population distribution, suggesting the model might need retraining or recalibration.

How PSI Works

PSI compares the distribution of a variable in two different datasets, usually:

  • Expected Distribution: The distribution from the original training dataset (also known as the "base" or "reference" distribution).
  • Actual Distribution: The distribution from a more recent dataset (also known as the "test" or "current" distribution).

The process involves:

  1. Binning: Divide the variable into a set of bins (e.g., deciles or custom ranges).
  2. Calculating Percentage Distribution: Determine the percentage of observations falling into each bin for both the expected and actual distributions.
  3. Calculating PSI for Each Bin: For each bin, calculate (Actual % - Expected %) * ln(Actual % / Expected %).
  4. Calculating Total PSI: Sum the PSI values across all bins to obtain the total PSI score.

Interpreting PSI Values

Generally, PSI values are interpreted as follows:

PSI Value Interpretation Action
Less than 0.1 Insignificant change; the population distribution is stable. No action typically required. Continue monitoring.
Between 0.1 and 0.2 Moderate change; some shift in the population distribution. Investigate the potential reasons for the shift. Consider retraining or recalibrating the model if the shift is deemed concerning for the model's performance.
Greater than 0.2 Significant change; a substantial shift in the population distribution. Retrain or recalibrate the model immediately. Investigate the root cause of the shift.

Important Notes:

  • These thresholds are guidelines and can be adjusted based on the specific context and business requirements.
  • PSI is a univariate measure and doesn't capture complex interactions between variables.

Example Scenario

Imagine you have a credit risk model that predicts the probability of default for loan applicants. Initially, the model was trained on data from 2022. You're now evaluating the model's performance in 2024. You calculate the PSI for several input variables, such as "income" and "credit score." A PSI greater than 0.2 for the "income" variable suggests that the income distribution of loan applicants has changed significantly between 2022 and 2024, potentially affecting the model's accuracy.

Benefits of Using PSI

  • Early Warning System: Identifies potential model decay due to data drift.
  • Objective Measurement: Provides a quantifiable measure of population stability.
  • Easy to Calculate and Interpret: Simple formula and clear thresholds make it accessible.
  • Model Monitoring: Essential component of ongoing model monitoring and maintenance.

Limitations of PSI

  • Univariate: Only considers individual variables and not interactions.
  • Binning Sensitivity: Results can be affected by the number and width of the bins.
  • Threshold Dependence: The interpretation of PSI values depends on the chosen thresholds, which may need to be adjusted for different applications.

In conclusion, PSI is a vital tool for assessing model stability by detecting distributional shifts in data, ensuring consistent and reliable model performance over time. By monitoring PSI values, data scientists and modelers can proactively identify and address potential issues caused by changing data patterns.