Customer Loyalty Prediction with Python

I built this regression model as part of my Data Science Certificate at the University of Toronto.

The goal? Predict customer loyalty using age, income, region, and purchase behavior, and turn that insight into strategy for marketing and CRM teams.

This project blends what I love most; data analysis, business impact, and storytelling through visuals. With only 238 rows of data, the challenge was less about scale and more about extracting insights that matter.

I used Python, SKLearn, and good old data cleaning to train both linear and random forest models. The Random Forest model hit a 99.9% R² score, with annual income emerging as the top loyalty driver.

To me, this isn’t just a model, it’s a glimpse into how data can drive smarter targeting, better retention, and real business impact.

The Business Problem –

Companies invest heavily in loyalty programs, but which customers are truly worth the investment?

My goal was to build a model that helps business and marketing teams:

Identify loyalty-driving factors
Tailor retention strategies
Improve targeting and resource allocation

While the dataset was limited in size and scope (no date/time data, possible single-store bias), the approach can be adapted for larger-scale customer segmentation.

Stakeholders Involved –

Marketing Teams: to segment audiences and personalize outreach
CRM Managers: to optimize loyalty programs and offers
Business Strategists: to identify which regions or income segments yield higher LTV

Methodology –

1. Data Cleaning & Preprocessing

Checked for missing values with .info()
Encoded Region as a categorical variable
Used .describe() to identify outliers and get summary stats

2. Exploratory Data Analysis

Used Seaborn, Matplotlib, and Pandas for:
- Visualized distributions (histograms, boxplots)

- Correlation matrix (heatmap)

- Checked for outliers and feature correlations

Key Finding:

Annual Income and Purchase Amount showed strong positive correlation with Loyalty S

Model Building –

Used two regression techniques:

Linear Regression
Random Forest Regressor

Training/Testing Strategy:

Split dataset (train/test)
Scaled features
Evaluated with MSE and R²

Feature Importance

Annual income is the strongest loyalty predictor, followed by age, purchase amount, and frequency.

Risks & Limitations –

Only 238 entries, not scalable without more data
Region may be biased (could all be from one store or area)
No date info → we can’t tell if these purchases happened during holidays or sales

Despite these, the model provides a great framework for teams to start testing loyalty predictions with real business data.

To Conclude

This project reflects my strengths as a; Data Analyst, Visualizer, Data Scientist and Business Intelligence Strategist.

If I had more time, I’d expand the dataset, include time-based variables, and integrate LTV scoring. But even as a prototype, this model shows how a few features can tell a big story.

(Visited 1 times, 1 visits today)

Sharon Jacob