The Civis AI: Taxonomer script template uses AI to assign categories of interest to text data in your database in Civis Platform. Unlike traditional supervised machine learning methods for text categorization (a.k.a. text classification) that require labeling hundreds or thousands of examples, the approach used by the template only requires you to specify your categories of interest and provide brief descriptions of each.
The template works by embedding each category description into a vector space using an AI model. It then embeds each input text into the same vector space and finds which category description is most similar by comparing positions in the vector space. In machine learning parlance, this may be called “zero-shot learning,” a reference to the fact that no labeled training data is necessary.
When using Civis AI: Taxonomer, you agree to abide by Cohere’s Acceptable Use Policy and Usage Guidelines, which may be revised and/or updated.
Note that Civis provides another template (Civis AI: Core) to interact more directly with a generative AI model. You can use that template for text categorization since you are able to write whatever prompt you want, though Civis AI: Taxonomer is focused solely on text categorization and is arguably easier to use for this purpose. Since the two methods are different, you may consider trying both approaches if you have a text categorization task.
This script template is currently available as a trial release. You may request access by emailing support@civisanalytics.com.
Usage
To use the script template, click the New Custom Script button from this page.
On the resulting custom script page, specify the following parameters:
- Categories: Category labels and descriptions for your task, one per line, like “label: description”. Technically, this parameter expects a YAML-formatted dictionary (i.e., associative array). See the example below.
- Model ID: The ID for the embedding model to use. If your input data is predominantly in English, use the default value for this. If your data includes a substantial amount of non-English text, you may want to try the multilingual model.
- Database Name: Specify the name of the database (i.e., the name that shows up in the dropdown menu on the right of the top navigation bar) for the input and output tables.
- Input Schema and Table: Specify the table where input texts should be read from.
- Input Text Column: Specify a column in the input table that has text values to categorize.
- Input ID Column: Specify a column with unique ID values so that you can join back the output to your input table. If you don’t have an ID column, see below.
- Output Schema and Table: Specify where the output should go (e.g., “myschema.mytable”). The output table will contain the ID column you specified as well as “category” and “confidence” columns.
- If Output Table Exists: What to do if the output table exists. The options are “fail”, “append” (to add rows to the table), “drop” (to drop and recreate the table), and “truncate” (to remove all existing rows before adding the current run’s output). The default is “fail”.
- Sample Size: Specify the number of texts to randomly sample from the table. By default, not all the texts will be categorized, so that you can test your category descriptions on a small sample before processing a whole table. To categorize all of the texts, set this to 0.
Here is an example value for the “Categories” parameter for a hypothetical task of categorizing responses to a question about supporting funding for national parks:
positive: A positive response in support of funding of national parks. negative: A negative response in support of funding of national parks. other: An unclear answer that doesn't state a clear position about funding national parks, or is nonsense. |
You are likely to get better results by writing category descriptions that give some context about the data rather than just, e.g., “a positive text”. Also, you may also want to include a category like “other” as in the example above. We recommend trying a few versions of the categories and descriptions first on a sample of data (see the “Sample Size'' parameter) before processing all your data.
The outputs include both the predicted category as well as an estimate of confidence: low, medium, or high. Confidence levels provide a measure of how uncertain the model was in choosing the category for a text. Depending on your application, consider excluding outputs with low or medium confidence to avoid using results where the model’s predictions are less trustworthy. Note: high confidence predictions can still be wrong!
If you don’t have an ID column, you can create a table with one using a SQL statement like the following:
create table as schema123.table_with_ids as |
Limitations
There are some important limitations of this tool to consider.
- Usage is capped for your organization. Each month, you will have a specific number of credits available to use across Civis AI tools. As an example, we estimate that using Taxonomer to categorize 10,000 short texts will take about 15 credits, though the exact number will depend on your data. The job logs include information about the usage of the current run as well as your organization’s current usage level. Please reach out to support@civisanalytics.com if you are interested in increasing the limit.
- AI models can occasionally generate inaccurate information. We advise caution when using AI tools, especially for high-stakes tasks, and recommend human review of output.
- The maximum length for category descriptions is 2,048 characters, which is about a page. Input texts can be longer, and in practice will be limited by the maximum length for a VARCHAR field in the database (about 65,000 characters for Redshift).
Comments
0 comments
Please sign in to leave a comment.