Supervised Machine Learning for Credit Default Prediction

Federico Mariani

3/24/20261 min read

This report by Sefirot Financial Research applies a supervised learning framework to one of the most operationally relevant problems in quantitative finance: predicting borrower default. Using the German Credit dataset - 1,000 observations, 19 predictors, and a binary outcome - the study builds and validates a logistic regression model through a rigorous pipeline of exploratory analysis, variable transformation, stepwise selection, and full diagnostic testing. The result is an interpretable, well-specified model that isolates the true drivers of creditworthiness from the noise.

Key topics covered:

  • Exploratory data analysis: distributional properties, bivariate relationships, and correlation structure across numerical and categorical predictors

  • Variable transformation via Box-Tidwell testing: polynomial terms for Duration and Age, logarithmic scaling for Credit Amount

  • Stepwise AIC-based predictor selection and logistic regression model fitting with odds ratio interpretation

  • Full model diagnostics: VIF multicollinearity assessment, Cook's distance, binned residual analysis, linearity checks, and overdispersion testing

  • Classification performance: confusion matrix, ROC curve, and AUC = 0.78

  • Key takeaway: credit risk is primarily driven by liquidity conditions and repayment history - not demographics - and age exhibits a non-linear risk profile that flat scoring models systematically fail to capture