Customer Loyalty Prediction with Python

I built this regression model as part of my Data Science Certificate at the University of Toronto.

The goal? Predict customer loyalty using age, income, region, and purchase behavior, and turn that insight into strategy for marketing and CRM teams.

This project blends what I love most; data analysis, business impact, and storytelling through visuals. With only 238 rows of data, the challenge was less about scale and more about extracting insights that matter.

I used Python, SKLearn, and good old data cleaning to train both linear and random forest models. The Random Forest model hit a 99.9% R² score, with annual income emerging as the top loyalty driver.

To me, this isn’t just a model, it’s a glimpse into how data can drive smarter targeting, better retention, and real business impact.

The Business Problem –

Companies invest heavily in loyalty programs, but which customers are truly worth the investment?

My goal was to build a model that helps business and marketing teams:

  • Identify loyalty-driving factors
  • Tailor retention strategies
  • Improve targeting and resource allocation

While the dataset was limited in size and scope (no date/time data, possible single-store bias), the approach can be adapted for larger-scale customer segmentation.

Stakeholders Involved –

  • Marketing Teams: to segment audiences and personalize outreach
  • CRM Managers: to optimize loyalty programs and offers
  • Business Strategists: to identify which regions or income segments yield higher LTV

Methodology –

1. Data Cleaning & Preprocessing

  • Checked for missing values with .info()

  • Encoded Region as a categorical variable

  • Used .describe() to identify outliers and get summary stats

2. Exploratory Data Analysis

  • Used Seaborn, Matplotlib, and Pandas for:

    • Visualized distributions (histograms, boxplots)

    • Pair Plot - Sharon JacobCorrelation matrix (heatmap)

    • Heat Map Correlation by Sharon JacobChecked for outliers and feature correlations

Box Plot Key Finding:

  • Annual Income and Purchase Amount showed strong positive correlation with Loyalty S

Model Building –

Used two regression techniques:

  • Linear Regression

  • Random Forest Regressor

Training/Testing Strategy:

  • Split dataset (train/test)

  • Scaled features

  • Evaluated with MSE and

Feature Importance

Annual income is the strongest loyalty predictor, followed by age, purchase amount, and frequency.

Feature Importance by Sharon JacobRisks & Limitations –

  • Only 238 entries, not scalable without more data
  • Region may be biased (could all be from one store or area)
  • No date info → we can’t tell if these purchases happened during holidays or sales

Despite these, the model provides a great framework for teams to start testing loyalty predictions with real business data.

To Conclude

This project reflects my strengths as a; Data Analyst, Visualizer, Data Scientist and Business Intelligence Strategist.

If I had more time, I’d expand the dataset, include time-based variables, and integrate LTV scoring. But even as a prototype, this model shows how a few features can tell a big story.

(Visited 1 times, 1 visits today)

Leave A Comment

Your email address will not be published. Required fields are marked *