How to build a propensity model: A comprehensive guide

STEP 1: Create a dataset

STEP 2: Create a model

STEP 3: Explore your dataset

  • Is the sample size adequate? In most cases you will need at least a few thousand rows to produce meaningful results. In our example using the marketing campaign dataset, there are more than 2,000 records, so we should be fine.
  • Did you include known major business drivers in the dataset? In our current example, we know from prior experience that marital status and education affect response rates, and we included these in the dataset.
  • Are you aware of meaningful changes over time that would affect the learning process? In other words, are you confident we can infer today’s behavior using yesterday’s data? In our example, the answer is “yes,” because we do not expect customer behavior to change.
  • Does the outcome you are predicting occur frequently or not? In our example, since only about 15% of prospects respond to marketing campaigns, we will consider using Synthetic Minority Oversampling Technique (SMOTE) pre-processing at a later stage. SMOTE is a technique that focuses training on positive outcomes when the dataset consists mostly of negative outcomes.
  • Last but not least, did you ingest you’re the correct dataset properly? Does it look right in terms of number of rows, values appearing, etc.? It is fairly common for data to be corrupted if it is not handled using a production-grade process.

STEP 4: Configure and train your model

  • A dependent variable: This is the outcome you are trying to predict. For most propensity models, it should be a simple yes/no or 1/0 variable. The outcome either happened or it did not, it’s either positive or negative. In our example using the marketing campaign dataset, the outcome we are modeling is whether a prospect responded to a campaign. Our dependent variable will be “Response,” the variable in the dataset that captures this outcome and consists of ones and zeros.
  • Independent variables: These are your model inputs, sometimes called drivers or attributes, which will determine the outcome. These can be of any type (i.e., numerical, categorical, Boolean). You don’t have to know beforehand which variable is a relevant input, that’s what the training stage will determine. You only need to pick any variable that you think might be relevant and needs to be included.
  • An index variable: While this is technically not needed in theory, in practice it’s always an excellent idea to designate a record index that will positively and uniquely identify each record you are processing. It usually is the account ID or customer ID. This will allow you to audit your results and join them back to your original dataset.
  • What is the fill rate for the variables you selected? If you combine too many variables with low fill rates (i.e., a lot of missing values), the resulting dataset will have very few complete rows unless you fill in missing variables (see below).
  • If you have missing values, what in-fill strategy should you consider? The most common strategies are (i) exclude empty values, (ii) replace empty values with zeros, (iii) replace empty values with the median of the dataset if the variable is numerical, or (iv) replace empty values with the “Unknown” label if the variable is categorical. Using option (ii) or option (iii) usually requires considering what the variable actually represents. For instance, if the variable represents the number of times a customer contacted support, it would be reasonable to assume an empty value means no contact took place and replace empty values with zeros. If the variable represents the length of time a customer has been with the company, using a median might be more appropriate. You will need to use your domain knowledge for this.
  • Make sure you split your data into a training set and a testing set. You will train your model on the training set and validate results using the testing set. There are several cross-validation techniques to do so efficiently; these are beyond the scope of this guide. Modeling tools such as Analyzr will handle training and validation seamlessly and transparently.
  • Logistic regression. The Logistic Regression algorithm is a great place to start. It’s very efficient, and it’s an easy way to compute propensities quickly on any dataset. However, it tends to be less accurate with datasets that have a large number of variables, complex or non-linear relationships, or collinear variables. In those cases, try other algorithms such as Random Forest.
  • Random Forest. Use the Random Forest algorithm if you’ve already tried the Logistic Regression. While it tends to be a bit slower, Random Forest is a robust algorithm that does a great job classifying a variety of tabular datasets.
  • XGBoost. The XGBoost classifier algorithm performs similarly to Random Forest. In general, XGBoost will be less robust in terms of over-fitting data and/or handling messy data, but it will do better when the dataset is unbalanced (i.e., when the outcome you are trying to predict is infrequent).
  • Gradient Boosting classifier. In most cases, the Gradient Boosting classifier will not perform as well as Random Forest. It is slower and more sensitive to over-fitting. However, it occasionally does better in cases where Random Forest may be biased or limited (e.g., with categorical variables with many levels).
  • AdaBoost classifier. The AdaBoost (adaptive boosting) classifier is both slower and more sensitive to noise than Random Forest or XGBoost. However, it can occasionally perform better with high-quality datasets when over-fitting is a concern.
  • Extra trees. The Extra Trees (extremely randomized trees) classifier is similar to Random Forest. It tends to be significantly faster but will not do as well as Random Forest with noisy datasets and a large number of variables.
  • Accuracy: The accuracy score tells you how often the model was correct overall. It is defined as the number of correct predictions divided by the total number of predictions made. It is usually a meaningful number when the dataset is balanced (i.e., the number of positive outcomes is roughly equal to the number of negative outcomes). However, let’s take a dataset with only 10% of positive outcomes. A dumb model predicting the same zero outcome every time would have an accuracy of 95% (half the 10% positives are predicted correctly, all the negatives are predicted correctly). Clearly accuracy has its limitations.
  • Precision: The precision score tells you how good your model is at predicting positive outcomes. It is defined as the number of correct positive predictions divided by the total number of positive predictions. It will help you understand how reliable your positive predictions are.
  • Recall: The recall score tells you how often your model can detect a positive outcome. It is defined as the number of correct positive predictions divided by the total number of actual positive outcomes. It will help you understand how good you are at actually detecting positive outcomes, and it is especially helpful with unbalanced datasets, for example, datasets for which the positive outcome occurs rarely (think 10% of the time or less).
  • F1 score: The F1 score is the harmonic mean of the precision and recall scores. Think of it as a composite of precision and recall. It is also a good overall metric to understand your model’s performance. Generally it will be a better indicator than accuracy.
  • AUC: The Area Under Curve refers to the area under the Receiver Operating Characteristic (ROC) curve. It will be 1 for a perfect model and 0.5 for a random (terrible) model.
  • Gini coefficient: The Gini coefficient is a scaled version of the AUC that ranges from -1 to +1. It is defined as 2 times the AUC minus 1.

STEP 5: Predict using your model

How can we help?



Analyzr makes machine learning analytics simple and secure for midmarket and enterprise customers that may not have a full-fledged data science team

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Analyzr: Insights, Achieved

Analyzr: Insights, Achieved


Analyzr makes machine learning analytics simple and secure for midmarket and enterprise customers that may not have a full-fledged data science team