The purpose of this document is to introduce users to Civis Data, to describe best practices to consider when using our data, and to provide answers to common questions.
Overview
In order to facilitate a variety of analyses of individual-level data in Platform, Civis offers the “Civis Data” data product. Civis Data incorporates information about American adults from consumer files, voter registration files, and other government and public sources. It also provides state-of-the-art model scores for key demographic attributes such as ethnicity and education.
Data Sources
Civis Data is created from multiple data sources. These are listed in the “source” columns in the data dictionary. Below are descriptions of each.
- Consumer File: Civis licenses a consumer marketing file from TargetSmart that includes individual-level data about demographic and behavioral characteristics from multiple consumer data sources. While these attributes provide some coverage for a variety of potentially useful characteristics (e.g., “dog_enthusiast”, “business_owner”), there are some limitations to be aware of: a) coverage is incomplete, thus requiring imputation as noted in the data dictionary, b) collecting the data requires matching, which is never perfectly accurate, and c) the construct being represented may not be precisely defined (e.g., “dog_enthusiast” is based on “consumer purchase history from direct marketing and retail channels, surveys and websites”).
- State Voter File or Voterbase: Civis licenses a collection of state voter registration data from TargetSmart. This data is highly reliable since it comes from government sources, but most attributes from this source are only available in the noncommercial version of Civis Data.
- Civis Model: Civis produces and maintains state-of-the-art models for key demographic attributes and provides various columns based on scores from these models (e.g., “race_pr_asian_commercial”). See “Civis Demographic Models” below for more details.
- Consumer Files / Civis Model: Some columns are based on Consumer File data and Civis models, depending on what is available for each individual record. For example, “coalesced_noncommercial_age” uses consumer file age or date of birth data when it is available but falls back on the score from Civis’s age model if needed.
- Census 2020: Civis Data includes some useful aggregate-level statistics from the 2020 decennial census.
- County Business Patterns: This includes economic data from the County Business Patterns survey from the Census Bureau.
- American Community Survey: This includes data from the wide-ranging, annually-conducted American Community Survey from the Census Bureau.
- IRS: This includes tax return data from the Internal Revenue Service.
- Geocoding: Some fields are computed by geocoding street addresses to latitude and longitude and then identifying which area the address is located in (e.g., which Census tract). These fields can be based on either the Consumer File address or the State Voter File address.
- Additional Data Source: Civis has collected data from various public sources to cover additional demographic and behavioral attributes not covered by other sources.
Civis Demographic Models
While many data sources are included in Civis Data, none of them provide complete, individual-level data about key demographic variables such as race and educational attainment. Therefore, Civis builds and maintains state-of-the-art, proprietary statistical models to infer a number of demographic attributes. Statistical models provide a way to impute missing attributes while also conveying uncertainty. For example, the education model might assign an "education_pr_advanced_degree_commercial" score of 0.95 to an individual if known attributes about them and the area they live in strongly indicate that they have an advanced degree.
Civis leverages its own survey data about self-reported demographics to train these models, and it has been building and improving these models since 2013.
See this page for a list of the available models along with model cards for each.
Data Preparation
Many variables are subject to pre-processing steps so as to facilitate downstream analyses:
- Imputing missing values to ensure full coverage
- Transforming certain data into numeric values
- Bucketing certain variables into useful ranges
The data dictionary describes these steps in more detail for individual columns.
File Structure
Civis Data is provided in two files, the “basic” file and the “modeling” file. Both files contain records for a majority of U.S. adults, but they are intended for different use cases.
To copy the latest versions of these files to your Redshift cluster, use the Basefile Delivery Tool.
The Modeling File
Civis data scientists and engineers curate and design modeling features in order to maximize predictive accuracy and streamline the modeling process for data scientists. The modeling file (e.g., “ts.modeling_commercial_client”) is commonly used to fuel a variety of individual-level analytics projects such as model-based ad targeting, look-alike modeling, and segmentation.
The Civis modeling file has complete coverage for a majority of fields, which is done by imputing missing values for gender, age, and race using Civis models. The few exceptions are fields such as state codes and a few other geographic identifiers.
An example: Civis provides individuals’ reported age when available but imputes a modeled value for records where age doesn’t exist on file.
Columns are labeled so users can better understand what's driving their models, but are not guaranteed to sum to values that match population counts.
An example: College education is listed at an individual level (1/0) but when you sum all the 1’s, you may see it doesn’t add up to be true for the number of people who actually have college education in that area.
Many features of the Civis modeling file are provided in the format of a probability score. This takes into account the variance of the underlying data due to missing or imputed data, matching errors, or stale underlying data. This data format may not always be appropriate for other applications, especially for literal interpretation or aggregation counts.
Example: If an individual's age is reported as 61 on the voter/consumer files, the column age_pr_50_64 will equal 1 and the is_modeled field will be set to 0. However, if an individual's age is not reported and the Civis age model predicts it to be 34 then each of the age_pr columns will have a predicted score to capture variance in the model. These scores will approximately sum to 1.
While many columns in the Modeling File are best reserved for use in predictive models, many are useful for reporting as well. For example, one can analyze demographic distributions for a population of interest by averaging the demographic model scores in the Modeling File. This can be particularly effective with models like education that are shifted to align well with aggregate benchmark data from the Census Bureau. See the model cards for more details.
Basic File
Civis also provides the Basic File (e.g., “ts.basic_commercial_client”) for convenience in non-modeling applications such as reporting, cross-tabbing, mapping, and other descriptive analytics. Columns in these files are formatted for easier interpretation and discrete demographic features like race are provided in categorical form instead of as raw scores. Other features like income, net worth, and home value are bucketed. It is often better to use a bucket or a range rather than give an exact estimate in order to increase the accuracy of results produced from this data.
An example: “Our high value targets are likely have an income between $60k and $80k” rather than “Our high value targets likely make $68k”
Some of the columns in the Basic File are computed from Civis’s demographic model scores. For example, “race5way_commercial” has string values (E.g., “Asian”) indicating the category that the model assigned the highest probability to for an individual. In these cases, we generally recommend using the probabilistic scores from the Modeling File instead (e.g., “race_pr_hispanic_commercial”) since those scores provide more precision and will be less biased in the aggregate.
The basic file also has additional features pertaining to various geography levels and districts for spatial analysis and mapping. These include county name/FIPS, DMA name/id, zip code, and districts which can be used when joining to another dataset using a geographic feature. This file also has other ids for householding and tracking individuals over time.
Data Permissions and Access
Access to Civis Data is controlled by respective client database administrators via Redshift groups. A standard ‘civis_data_access’ group now exists on each cluster. With each data refresh, Civis will grant SELECT access on the updated data to this group. Civis strongly recommends that all database administrators review the list of users in this group regularly to ensure accessibility and security standards. For more information about managing groups to add/remove users, please see Redshift ALTER GROUP documentation.
Example Use Cases
This section describes some example use cases for Civis Data, along with some simple examples of SQL queries that could be used as starting points for certain types of analyses.
Individual-level Outreach
One can also use Civis Data to focus marketing campaigns on particular groups of interest. One could, for example, use Civis Data and Civis Platform to build a model to predict who might be a high-value customer, score all records within a geography of interest, and then use voterbase_id identifiers for outreach with the Data Appends tool.
As a simpler example, one might focus on individuals likely to be in particular demographic groups using Civis demographic model scores or other fields. One might, for example, want to select men without Bachelor’s degrees in Illinois, as in the following SQL query.
SELECT voterbase_id,
education_pr_less_than_bachelors_commercial AS pr_less_than_bachelors,
(1 - gender_pr_commercial) AS pr_gender_male
FROM ts.modeling_commercial
WHERE state_code = 'IL'
ORDER BY education_pr_less_than_bachelors_commercial * (1 - gender_pr_commercial) DESC;
Outreach by Location or ZIP Code
One can also use Civis Data to identify areas of interest for marketing or other operations. Again, one might use a custom model built with Civis Data and Civis Platform, but here is a simple example using demographic fields directly to identify Illinois ZIP codes with high proportions of individuals who are under age 40 and prefer to communicate in Spanish.
SELECT zip,
AVG(spanish_pr_commercial
* CASE WHEN modeling_commercial.coalesced_commercial_age < 40
THEN 1 ELSE 0 END) AS pr_under_40_and_prefers_spanish
FROM ts.modeling_commercial
JOIN ts.basic_commercial USING (voterbase_id)
WHERE modeling_commercial.state_code = 'IL'
GROUP BY 1
ORDER BY 2 DESC;
Reporting on Aggregate Demographics
Civis Data can be used to analyze the demographic attributes of a target audience and compare that audience to a broader audience. For example, one might match a member or customer database to Civis data and compare the demographics of members to Civis Data. For a simpler example, one might compare the demographics of Illinois residents with Bachelor’s degrees to the population at large, as in the following SQL query.
WITH illinois_bachelors AS (
SELECT *, (1-education_pr_less_than_bachelors_commercial) AS pr_bachelors_plus
FROM ts.modeling_commercial WHERE state_code = 'IL'
), illinois_all AS (
SELECT *
FROM ts.modeling_commercial WHERE state_code = 'IL'
)
SELECT
'Illinois adults with bachelor''s degrees' AS dataset,
-- Take weighted averages with the probability of having a Bachelor's degree.
SUM(pr_bachelors_plus * race_pr_afam_commercial) / SUM(pr_bachelors_plus) AS race_afam,
SUM(pr_bachelors_plus * race_pr_asian_commercial) / SUM(pr_bachelors_plus) AS race_asian,
SUM(pr_bachelors_plus * race_pr_hispanic_commercial) / SUM(pr_bachelors_plus) AS race_hispanic,
SUM(pr_bachelors_plus * race_pr_native_commercial) / SUM(pr_bachelors_plus) AS race_native,
SUM(pr_bachelors_plus * race_pr_white_commercial) / SUM(pr_bachelors_plus) AS race_white,
SUM(pr_bachelors_plus * gender_pr_commercial) / SUM(pr_bachelors_plus) AS female,
SUM(pr_bachelors_plus * coalesced_commercial_age ) / SUM(pr_bachelors_plus) AS age,
SUM(pr_bachelors_plus * ts_homeowner) / SUM(pr_bachelors_plus) AS homeowner,
SUM(pr_bachelors_plus * has_credit_card) / SUM(pr_bachelors_plus) AS has_credit_card
FROM illinois_bachelors
UNION ALL
SELECT
'Illinois adults' AS dataset,
AVG(race_pr_afam_commercial) AS race_afam,
AVG(race_pr_asian_commercial) AS race_asian,
AVG(race_pr_hispanic_commercial) AS race_hispanic,
AVG(race_pr_native_commercial) AS race_native,
AVG(race_pr_white_commercial) AS race_white,
AVG(gender_pr_commercial) AS female,
-- Integer fields need to be cast as FLOAT for AVG to work.
AVG(coalesced_commercial_age::FLOAT) AS age,
AVG(ts_homeowner::FLOAT) AS homeowner,
AVG(has_credit_card::FLOAT) AS has_credit_card
FROM illinois_all;
Note that aggregated model scores are estimates. While Civis works to ensure that the model scores are as accurate and well-calibrated as possible, and that the scores aggregate to estimates that agree with trusted benchmarks (e.g., from the Census Bureau), statistical models don't provide perfectly accurate estimates. See the model cards for additional details on the quality and the limitations of specific models.
FAQ
How do I match my data to Civis Data?
Civis provides two tools that can use personal information (name, address, etc.) to find out which records in your data correspond to the same people as records in Civis Data: Civis Data Match and Identity Resolution. Civis Data Match matches one input table to Civis Data. Identity Resolution supports matching multiple input tables to Civis Data and also to each other.
Can I get additional contact information from a VoterBase ID?
Civis Data uses “voterbase_id” as a unique identifier. The Data Appends tool can take these IDs as input and return contact information that can be used for marketing outreach, etc. as output.
Is there a data dictionary for the Basic and Modeling files?
Yes, there is and it can be found here. Each tab represents the commercial/noncommercial versions of the basic and modeling files. We include:
- Feature name
- Definition
- Source
- Date of when the definition was last updated
- SQL Data Type
- Other evaluation statistics (ex: min/max values)
Column definitions are also included in the table details page for each Basic and Modeling table in Platform.
Are there any known limitations of Civis data?
Yes, there are some known limitations based on the sources of our data:
- Poor coverage of young people
- Under-representation of people of color relative to their share of the population
- The presence of multiple records for some individuals. Empirically, the duplication rate may be as high as 20% in some states
- Unreliable birth date information, depending on the state (e.g., January 1st is a very common birth date)
Are there any guidelines on how to use the scores from Civis models?
Please see the individual model cards for each model.
Does Civis have recommendations for removing duplicates?
Yes, Civis included an is_dupe column on each file which indicated that a given record has been identified as a duplicate by our vendor. For a group of duplicates, the record which was most recently updated will be flagged as is_dupe=0. For queries that should exclude all duplicate records users can include the following filter:
WHERE is_dupe = 0
What does the 2 letter state abbreviation at the beginning of voterbase_id represent?
For registered records, this will correspond to the state they are registered on the voter file. For unregistered records, it is where they live.
What do ‘Inactive’ and ‘Unknown’ mean in the voter_status column? Are these related to eligibility/ineligibility? Registered/unregistered?
These are from Secretaries of State, and can mean different things from state-to-state. Most typically, inactive will mean the person hasn't voted in recent elections and/or the SOS has reason to believe the voter moved (they may have gotten mail returned, picked up a National Change of Address record, etc.). In some states, inactive voters can vote normally; in others, they can only vote via provisional ballot. To filter on registered voters, Civis recommends finding records where the voter_status column is not null.
Is there information on people in US territories (e,g, Guam, PR, etc) in the basefile?
Yes, but only people who have moved there who were previously in the TargetSmart file.
Do modeled scores change over time with each data refresh?
Yes, Civis modeled scores can fluctuate with each data refresh. Civis monitors this regularly to ensure that the scores remain accurate.
Are all people on the basefile registered voters?
No, people with NULL voter_status are not registered voters.
What is the difference between a Monthly and a Quarterly basefile?
There is no difference. A quarterly basefile is a monthly basefile that was produced in the month preceding the start of a quarter. For example, the Q2 quarterly basefile is the monthly file built in March. Access to the monthly or quarterly files are based on your Data Subscription.
What are the data sources for the age column?
The column coalesced_[non]commercial_age_source describes the upstream data source for values in the corresponding age column coalesced_[non]commercial_age. Here are what the values listed mean:
- Voterfile: The age data is provided on the voterfile record associated with this individual. This data is self reported by the individual, as such should be viewed as the gold standard for age data.
- TargetSmart Commercial Data Match: The age data is provided by TargetSmart, and is derived from their process of matching records to commercial/consumer datasets.
- TargetSmart Intellibase Match: The age data is provided by TargetSmart, and is derived from their process of matching records to their national marketing dataset.
- Targetsmart Infutor Match: The age data is provided by TargetSmart, and is derived from their process of matching to Infutor data.
- Civis Model: The age data is generated by a Civis model.
Is the Analytics file available to Data Subscribers?
The analytics file is a consumer marketing file from TargetSmart that includes individual-level data about demographic and behavioral characteristics from multiple consumer data sources. The raw analytics file can be made available to Civis Data Subscribers for purchase in the form of a Research file. It includes unprocessed fields from TargetSmart but does not include PII. Please note, this is not a Civis-curated product but rather a pass-through. Therefore, we are unable to provide support for this file.
Data Source Attribution
Some of the data sources used in Civis Data require attribution as part of the terms of use. See the model cards for additional data sources related specifically to the demographic models.
- This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.
- Data retrieved from Data.gov.
- Data retrieved from data.cms.gov.
- Data provided courtesy of FedsDataCenter.com.
Comments
0 comments
Please sign in to leave a comment.