Civis Data Sampling and Modeling Training 101

So You Want to Build a Model?

We're excited to share some best practices for how to clean, sample, and merge data to build great models. We assume if you're at the model building stage, you're already familiar with Civis imports, queries, and data structures. If not, brush up on the necessary steps in the help center.

This introduction focuses on lookalike modeling, which is determining the likelihood of a person who is not using your services to use your services based on their behavioral and demographic attributes. We do this by setting a target variable and building a sparse logistic model on the known data. Once that model is built, we score the Civis National File to show how much the rest of the US population looks like the ideal target outcome (e.g. are they likely to be a voter, a customer, or really loyal). Reach out to support@civisanalytics.com to request more information about Civis National File access. Get started with these 7 steps:

Step 1: Identify your data and the target you’d like to predict

For a basic lookalike model, your target will be binary (0 or 1) or categorical (e.g. 0/1/2/3). For this overview, we’ll use the customer vs. non-customer example, but this could apply to voters, loyalty, or any type of preference.

Import your file containing your customers into Civis. If this file does not have a voterbase_id column serving as a shared identifier between your file and the Civis base file, standardize the variable names, addresses and other contact information (CASS/NCOA), and perform a people match on Civis’ base file. For more guidance on people matching check out an explanation of our people matching algorithm here.

Step 2: Append your target variable to the base file

Once you have completed the people match and the crosswalk is established between your file and the base file, create a new table for modeling by joining your table to the modeling file. This appends your target variable to the modeling set. This unlocks a modeling-ready data set of over 700 attributes about your customers. This additional demographic and behavioral information about your customers is crucial when building highly accurate and descriptive lookalike models. Make sure you don’t append text fields from your file that are not formatted for modeling. Your variables should be numeric or categorical in nature, and have as few null values as possible.

Example query:

CREATE TABLE [your_new_modeling_table] DISTKEY (voterbase_id) AS SELECT (SELECT [your_target_variable] FROM [your_matched_table ] WHERE voterbase_id IS NOT NULL) a LEFT JOIN [noncommercial/commercial_modeling_new] b on a.voterbase_id = b.voterbase_id

Step 3: Create a well-distributed data set

To build a lookalike model, you want the distribution of your data set to look like the national distribution of that target. Based on what you know about your target variable, you can update the number of records to achieve the ideal distribution. If you imported, matched, and joined a table of your customers, you want to make sure you include a good proportion of non-customers in the model building process as well so that the modeling file reflects your current market share of customers and non customers. To do this, select a randomly ordered list of Civis base file records who did not match as a customer and add them to your modeling table. Records that are your target variable (1s or customers) should be no less than 5% of your records in your file. Also make sure that you have only unique values in your set. If there are duplicates, you should identify which one to remove based on your understanding of the data. If there are duplicates, you cannot use the file for modeling.

If you notice your target variable is less than 5% of your file, you can balance the distribution by randomly removing non-target records (0s). Or you can randomly duplicate your target records (1s) to achieve a sample greater than 5% of the full file. If you are using a large file, you may also want to slim down the size of the modeling file by randomly selecting 25-50% of the original file. Civis only needs ~ 100,000 records to build an accurate model so running millions of records might create an unnecessarily long modeling process.

Example query:

SELECT * FROM [your modeling table] ORDER BY Random() LIMIT (.25 * count(voterbase_id))

Step 4: Finalize your table for modeling

Start modeling your workflow in Civis by selecting your modeling table. Civis will analyze the table as one final check point to make sure it is fit for modeling. Once the table has been checked, step through the process of selecting a target variable, predictors, and parameters for your model. Select the algorithm that best suits your modeling goal. For a lookalike model, sparse logistic is the Civis standard. We’ll go into more depth about Civis models and how to create custom models in the Civis Sampling and Modeling Training 102 and 103 documents.

Step 5: Building the best model

If you’re not sure how the predictors will impact the model accuracy, go ahead and build the model to see these relationships. However, if you know certain predictors are highly correlated with your target variable, you should deselect these before building the model. For example, if you know that a person’s home state strongly correlates with their likelihood to be a customer, you should remove the home state variable as a predictor to create a more accurate and unbiased model.

Once you’ve built your model, review the accuracy summary and predictor strength list (ROC curve, Confusion Matrix, and Decile Plot). Notice which predictors are strongest, and if any are unexpectedly biasing the result. While all of these statistics are valuable for assessing model performance (we’ll go into more detail about how to read these in Civis Sampling and Modeling Training 102), your primary indicator of model accuracy is AUC, Area Under the Curve. A good model should have an AUC of .75-.95 depending on what you’re modeling. If the AUC is too high, chances are you’ve included too many highly correlated predictors in your data set, which is biasing the outcome. At Civis, our data scientists are suspicious of the “perfect model”. Learn more about evaluating your model here.

Step 6: Iterate on your modeling file

While we recommend following the best practices in steps 1-5, modeling is both a science and an art. Based on the result of your first lookalike model, you can fine-tune your model by experimenting with your data distribution, variable selection, and population size. This fine-tuning is where our Civis data scientists specialize and experiment. Cutting edge data science requires mastery of the data, the algorithms, and the modeling objectives. If you find that you can’t quite get your model to run the way you need it to, our client success team is here to help.

Step 7: Score data

Once you’re satisfied with the accuracy of the model, score the Civis national consumer file. The outcome is a list of 240+ million people ranked from the most similar to your target to the least similar. From this list, you can determine which people/emails/addresses/phone numbers to include/purchase for your marketing campaigns, which cities have the most potential customers, and what types of content will resonate most with these people.

You built a model! Now what?

Now that you’ve built your lookalike model with Civis, please let us know your thoughts on the Civis Modeler. As you continue to build new models and fine-tune the existing ones, keep in mind that we refresh the Civis National Consumer base file regularly with recent population data as people’s contact information and preferences change. As a Civis client, you have access to the most recent base file to refresh your models and obtain the most accurate/recent list of scored targets. We believe that a person-level understanding of your market potential is invaluable for making informed business decisions.

Articles in this section

Step 1: Identify your data and the target you’d like to predict

Step 2: Append your target variable to the base file

Step 3: Create a well-distributed data set

Step 5: Building the best model

Step 6: Iterate on your modeling file

Step 7: Score data

Comments

Articles in this section

Step 1: Identify your data and the target you’d like to predict

Step 2: Append your target variable to the base file

Step 3: Create a well-distributed data set

Step 5: Building the best model

Step 6: Iterate on your modeling file

Step 7: Score data

Related articles