The Person Data Standardization tool takes a table with personally identifiable information (PII; e.g., names, addresses, and dates of birth), and standardizes the PII so that you have higher quality PII data for tasks such as CASS/NCOA.
The Person Data Standardization tool can be found on the top navigation bar under the Tools > Enhancements > More Enhancements... tab. It can also be used via this Template.
Expected Input and Output
Your input PII table is either a Redshift table on Civis Platform or a Civis file.
If your input table is a Redshift table, then the output will be a new Redshift table. Your input table is not altered in any way.
If your input table is a Civis file, it has to be parsable as a CSV file that has a header with column names and uses the comma as the delimiter. The output table will be a new Civis file (in the CSV format, with the comma as the delimiter and all cells quoted by the double quote). Your input table is not altered in any way.
The output table contains only the fields specified in your field mapping (more on this below), plus the additional, derived fields from those you have mapped.
The column names of the output table are the recognized field names listed below.
If your input table has extra columns irrelevant for person data standardization, these columns are simply ignored, and are not found in the output table.
Field Mapping
The Person Data Standardization tool recognizes a set of PII field names. You will have to input the mapping from this set of recognized field names to the column names of your input table.
Recognized Fields
Here are the full set of the recognized PII field names:
Field | Notes |
primary_key
|
A unique identifier. This field is required |
full_name |
Composed of either first name + last name, or first name + middle name + last name |
first_name |
|
middle_name |
|
last_name |
|
full_address |
For a column that appears to combine house number, street, and apartment/suite number |
address_1 |
address line 1 |
address_2 |
address line 2 |
house_number |
|
street |
|
unit |
apartment/suite/unit number |
city |
|
zip |
Either the 5-digit or 9-digit form |
zip4 |
The zip+4 code. You may have it from the output of CASS/NCOA, for example |
state |
Full state name, e.g., Illinois |
state_code |
2-letter state code, e.g., IL |
birth_date |
Any common US date formats such as "19601230", "1960-12-30", "12-30-1960", "12/30/1960", "December 30, 1960", etc. |
birth_year |
|
birth_month |
Numerical forms such as "9", "09" or full/abbreviated forms such as "September", "Sept", "Sep" are all acceptable |
birth_day |
|
phone |
|
gender |
|
email
|
|
lat |
|
lon |
Conflicting Fields
Certain fields may potentially provide conflicting information, and therefore these combinations in the field mapping are disallowed and would fail the job immediately.
If you use this field... | You cannot use.... |
full_name |
first_name middle_name, and last_name |
full_address |
address_1 or address_2 |
full_address |
house_number, street, and unit |
address_1 or address_2 |
house_number, street, and unit |
state |
state_code |
birth_date |
birth_year, birth_month, and birth_day |
Inputting The Field Mapping
In the "FIELD MAPPING" box, input your field mapping in YAML format, where the keys are the recognized PII field names from the list above, and the values are the column names of your table. For instance:
primary_key: your_primary_key_column
first_name: your_first_name_column
last_name: your_last_name_column
You do not have to map all of the recognized fields. Simply map those you would like to use based on the columns of your input table.
Multiple Columns Per Field
It is possible to map multiple fields per field. Currently, this is supported by three additional recognized field names: phones, emails, and addresses.
phones and emails each accept an array of column names. For example, if you would like to map both "mobilephone" and "homephone" columns from your input table, your field mapping should contain the following (please note the use of a dash for each column name; the space between the dash and column name is important):
phones:
- mobilephone
- homephone
For this example, the fields in the output table will contain "phone_mobilephone" and "phone_homephone", following a template in the form of "<recognized_field_name>_<input_table_column_name>" for differentiation.
addresses must be associated with an array of address field mappings. Within each address field mapping, conflicting address-related fields (as explained above in the section "Conflicting Fields") cannot be used. Also, derived fields apply in each address field mapping (please see the section "Derived Fields" below). To illustrate, a common use case for addresses is that you have both billing addresses and mailing addresses. The field mapping should look like the following:
addresses:
- address_1: billing_addr_1
address_2: billing_addr_2
- address_1: mailing_addr_1
address_2: mailing_addr_2
In this example, there are two address field mappings under addresses, where each address field mapping has its own address_1 and address_2. Please note the two dashes (and the surrounding spaces) that signal this - each dash begins a new address field mapping. In the output table, because of derived fields as well as the template "<recognized_field_name>_<input_table_column_name>" for output field names, the output fields for this example will include:
- Based on the input ones:
"address_1_billing_addr_1", "address_2_billing_addr_2",
"address_1_mailing_addr_1", "address_2_mailing_addr_2"
- Address-related derived fields:
- For the 1st address field mapping (with suffix "_set1"):
"full_address_set1", "house_number_set1", "street_set1", "unit_set1", "other_address_components_set1"
- For the 2nd address field mapping (with suffix "_set2"):
"full_address_set2", "house_number_set2", "street_set2", "unit_set2", "other_address_components_set2"
What standardization steps are taken?
Basic Standardization
For all fields (except primary_key
-- it is untouched):
- Strip leading and trailing whitespace.
- Force everything into uppercase.
- Convert all non-ASCII characters (e.g., characters with accents) into their closest ASCII counterpart (essentially characters available in the US keyboard layout).
Derived Fields
Certain fields are parsed to create derived fields. For instance, if you use birth_date
, your output table has birth_date
as well as its derived fields birth_year
, birth_month
, and birth_day
. Here is the complete list of derived fields:
Field(s) | Derived Field(s) |
full_name |
first_name , middle_name , and last_name
|
first_name and last_name
|
full_name and middle_name
|
state |
state_code |
birth_date |
birth_year , birth_month , and birth_day |
birth_year , birth_month , and birth_day
|
birth_date |
zip |
zip4 |
For example, if the address is a USPS PO Box, then very likely the address fields will all be empty strings, but other_address_components will be something like '{"USPSBoxType": "PO BOX", "USPSBoxID": "1234"}'. The Redshift function JSON_EXTRACT_PATH_TEXT may be useful for this field in the output table.
|
Field-specific Standardization
-
zip
: All non-digit characters are removed for a 5-digit form; zeros are left-padded if there are fewer than 5 digits. If a 9-digit form is detected, thenzip
only takes the first 5 digits, andzip4
will take the last 4 digits (ifzip4
is not already available). -
phone
: All non-digit characters are removed for a 10-digit form. If a 11-digit form with an initial "1" (= country code for the US and Canada) is detected, then the initial "1" is removed, resulting in a 10-digit form. -
gender
: For how gender is encoded in the sources of the Civis data assets, "F" or "Female" (case-insensitive) is converted to "F", and "M" or "Male" (case-insensitive) to "M", and all other non-empty strings are converted to "U". -
email
: If a Gmail address is detected, all periods "." in the local-part (in "local-part@domain") are removed. A plus sign followed by an alphabetic string in the local-part is also removed (e.g., "foo+bar@gmail.com" becomes "foo@gmail.com"). - All fields except
primary_key
,email
,unit
,address_2
,lat
, andlon
: Leading and trailing punctuation characters (i.e.,!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
) are stripped.
Data Quality Statistics
When your job finishes, data quality statistics (together with your job's metadata) are shown in the run log pane. These statistics summarize the standardization operations and flags by field.
For flags in the form of "not a X", "invalid X", "parsing error", and "unsupported value", the data is left unchanged in the output table (so not replaced by an empty string or anything), unless it is unreasonably long and is therefore truncated for storage optimization. For instance, a PO Box as the address triggers the "not a street address" flag.
Anything gibberish as the address likely triggers the "parsing error" flag; the unparsable address data is unaltered and kept in the output table as-is, while other non-address fields of the records in question are still standardized if possible.
The flag "number of records skipped" indicates the numbers of rows that were skipped due to row parsing errors or the primary key being empty. The relevant rows are shown in the run log pane.
FAQs
- Q: Should Person Data Standardization be used before Civis Data Match or Identity Resolution (IDR)?
A: Civis Data Match and IDR automatically perform standardization, so we do not generally recommend running Person Data Standardization before those jobs. However, if one has a workflow that performs CASS/NCOA before matching, then Person Data Standardization can be used before the CASS/NCOA job to improve the CASS/NCOA results. -
Q: Does the Person Data Standardization job also perform CASS/NCOA and geocoding?
A: No, as far as address standardization goes, the Person Data Standardization job attempts to parse the addresses for the various components (especiallyhouse_number
,street
, andunit
) to the extent possible, but does not run CASS/NCOA or geocoding. -
Q: Is it possible to run a Person Data Standardization job through the API?
A: Yes. You can use the POST /scripts/custom API endpoint, which is documented here in Platform, along with the Template ID 11154 to create a new Person Data Standardization Custom Script.
Comments
0 comments
Please sign in to leave a comment.