This training video and document serves as a usage guide for Identity Resolution, which uses Civis Data and a combination of probabilistic and deterministic matching to connect your person-level data sources together.
Identity Resolution Training and Best Practices
Below you'll find written instructions that pair with our video training for utilizing Civis Identity resolution.
Prerequisite: Import your Data for Matching
Start by importing your data into the Civis Platform. You can select from multiple data sources under the "DATA" item in the top navigation menu. See our Imports documentation for further instructions on importing your data into the Civis Platform. If a necessary connection is not currently available, please contact our Client Success team at support@civisanalytics.com
Preparing your Data
Each of your sources will need a unique ID associated with each record. If you don’t have a unique identifier with no duplicates and no null values from each source, you will need to add one before proceeding. You can do this as part of the import process or separately as a SQL script. See the AWS Redshift documentation for how one might do this.
You may also want to clean your data before running it through IDR. See here for an example workflow showing how you can configure pre-processing jobs to run automatically in sequence with IDR.
Create an Identity Resolution Pipeline
Once you have your data imported into Civis Platform, you’re ready to begin building out your Identity Resolution pipeline! To create a new pipeline, click Identity Resolution in the top navigation menu under Tools, then click the “New Pipeline” button.
The first step in creating a new pipeline is to specify the person-level data you’d like to unify. These are your Sources.
To configure and add your first data source to your Identity Resolution pipeline click “Add Source”.
Enter in your Source description and details as instructed. Select a source name that will help you identify that table. This is how that table will be referred to moving forward.
Next, specify the type of information in the columns of the source you are configuring. This is the information that we will use to match records from your sources together. For example, one file may list someone’s address in a column labeled ‘Address’ while another may call it ‘Street.’ These mappings allow us to compare the right columns to each other.
Under the required fields, first enter your Unique ID. As discussed above, in the “Preparing your Data” section, the unique ID, or primary key, field for a source indicates which column is the source-level identifier for your records.
For the Secondary fields, map at least one of them. You may choose to enter the address, the state, or the birth day fields more flexibly by selecting the toggles above these fields. It is important to select fields for first and last name. If a record has a null first or last name, which could be due to missing data or to not selecting those fields, then the IDR system will not group that record with other records, rather it will receive its own resolved ID. Please note: Omitting a middle name field, or including null values for middle name is acceptable.
A Note about Primary Keys
The Primary Key or Unique ID field that you select for your Identity Resolution (IDR) sources should be a unique identifier of records about an individual person. If multiple records from the same source have the same primary key, the IDR system will automatically group them together, and the output Customer Graph will only have one record for that primary key. IDR will use the strongest matches to one of those records in order to cluster records together and assign resolved IDs.
This functionality can be used if you have multiple Personal Identifiable Information (PII) values for the same source record. For example, if you have two addresses for a record with an ID "FOO123" and name "John Smith", you can include two records with "FOO123" as the primary key. If another record with primary key "BAR456" has a matching name and address, then "FOO123" and "BAR456" would likely receive the same resolved ID even though only one of the addresses for "FOO123" matched the address for "BAR456".
Note that the IDR interface will provide a warning for non-unique primary keys to avoid the accidental inclusion of duplicate primary keys.
Formatting your columns
The Identity Resolution matching system expects column values to match certain formats for the given PII data categories. Please review our documentation on formatting columns for matching enhancements.
When you are finished formatting columns, select "Confirm". You’ll come to the Data Sources review page. Here, you can click "Add New Source" to add an additional source, select a Match Target from the dropdown, or click "Next" to further configure your pipeline.
Match Targets
A Match Target is data that has been optimized for matching. Match Targets are usually Civis Data, although custom targets can also be created. If you would like to match to a table with more than 50 million records, please reach out to your Civis Account Manager to learn more.
Links
Link Score Threshold
The Link Score Threshold is a value that determines the extent to which similar records get assigned the same resolved ID. The threshold defaults to 0.8, which is our recommended value.
Higher values may result in fewer cases where records about different individuals erroneously receive the same resolved ID, but also more cases where records about the same individual receive different resolved IDs.
Enforced Links (Optional)
Once you are done adding sources, you may optionally input known links between your sources. In situations where two columns from different sources share the same value, and you know that this indicates a match, this is a good configuration option.
For example, if you have an existing, known unique identifier between any two sources, such as a username, you can input it here. If you input more than one link between any two sources, then if either of those two columns is the same, a link between the two sources will be determined to exist. Not every row in the sources has to have a known enforced link for you to try to establish one. The algorithm will fall back to a probabilistic match for those rows that are missing data or do not otherwise match.
If you identify more than two columns that should exactly match between any two tables, it is treated as an OR condition. If either of those two fields match, then an exact match between the two records will be made.
Golden Table Configuration
In addition to the customer graph, IDR also optionally produces a golden PII table from your data. The Golden Table creates a targeting-ready table from your data by taking only the best pieces of contact information for each record. To configure your golden table, proceed to the golden table section of the configuration wizard.
The algorithm for choosing the best contact info is configurable between an automatic algorithm developed by our engineers and one which takes a prioritized list of sources into account.
Our automatic algorithm for choosing the best value for a piece of contact information takes into consideration validity (does this email have an @ and .), frequency of that value in the cluster, and completeness of information. We also place constraints on which fields have to come from the same record. These constraints ensure, for example, that an address doesn’t take the street from one record and the state from another.
The preferred sources option shares the validation and constraints with the automatic option, but rather than taking the most common of those values, it takes into account a rank-ordered list of sources. This allows you to specify which sources are most reliable for targeting and have that logic applied consistently across all records.
Output Destinations
Finally, you must specify destinations for your output tables. Select your cluster, schema, and table. This will be where your pipeline outputs its customer graph and, optionally, the golden table and/or link scores table. The link scores output can be useful for understanding why records get grouped together.
Review Your Configuration
Once you’re done, review your configuration. At this point, you may either save the pipeline configuration to return to it later or have it run on a pre-configured schedule, or select "Save & Run" to run your pipeline now.
Looking at and Understanding Identity Resolution Results
An Identity Resolution job may take anywhere from 30 minutes to several hours to run, depending on the size and number of your input tables. Once it completes, you can view your outputs (customer graph, golden table, and link scores table) in the destination(s) you specified earlier.
Customer Graph Structure
The Customer Graph contains a record of all IDs that have ever been resolved from Identity Resolution and are pointed to this table.
Your customer graph will include the following columns:
- resolved_id: Civis Identity Resolution adds this identifier for records across sources that represent the same person. For example, if two records from a single source and another record from a second source are determined to be the same person, these three records would be listed as three rows in the customer graph, with the same resolved ID. The resolved ID itself is based on one of the source record IDs. It is worth noting that it is possible for a resolved ID for a record to change from run-to-run, if the input data changes. In practice, this is rare, but if multiple records change substantially it is difficult to reason that they are in fact the same person. We are exploring with customers other strategies for ID assignment.
- source: The source name for this record, as specified during pipeline configuration.
- source_primary_key: The primary key from the specified source.
- job_id: An internal identifier used to keep track of identity resolution configuration.
- run_id: An internal identifier used to keep track of identity resolution runs.
- run_start_time: The time at which these records were resolved into a single ID. The most recent run time for any source record will hold the most up to date resolved ID.
The resolved identities in this table are determined by links identified because they had greater than the link score threshold specified in the configuration. The threshold provides guidance for how similar two records need to be to be considered a match. For example, when using a high threshold, two records will need to have very similar attributes to be considered a match. For more information on our methodology, please request a copy of our whitepaper through support@civisanalytics.com
Your golden table will contain one row per resolved ID, with a single, best, value of PII for any field that exists in any of your input sources.
Golden Table Structure
Your Golden Table will contain one row per resolved ID, with a single, best, value of PII for any field that exists in any of your input sources. The golden table will only contain records from the most recent run.
Link Scores Structure
The Link Scores table includes a list of records which were matched to each other.
Note that the link scores table only includes links in one direction since links are symmetrical. For example, it might include the link between records 'A' and 'B' with 'A' as the source record but not with 'B' as the source record.
Columns:
- source_name: The source name for this record, as specified during pipeline configuration.
- source_primary_key: The primary key from the specified source.
- match_name: The source or match target name for the matched record, as specified during pipeline configuration.
- match_primary_key: The primary key from the matched source in that row.
score: The match score assigned to the two records.
Sample Records Review
After your Identity Resolution Pipeline runs, your customer graph of resolved identities will exist in the destination table you specified during configuration.
Optionally, you can use our “Sample Records” tab to examine and QC the composition of their Resolved Identities.
We provide the query text that retrieves the sample records in case you want to build this query out further or create a table that stores sample records.
What Happens Next?
You can now use the customer graph and golden table as a crosswalk to combine your data sources as needed. For example, you can create a unique table of records for activation, reporting, customer engagement, and segmentation. Please work with your Civis representative for additional support turning your customer graph into another format for your needs. Consider automating your Identity Resolution pipeline by scheduling it or adding it to a workflow. Are you finding unexpected results in your output tables? Please contact support@civisanalytics.com, and we will be happy to take a look. Reports of errors will help us improve our matching algorithm.
Comments
0 comments
Please sign in to leave a comment.