If you prefer not to use the R or Python API clients, you can build and score CivisML models using the script templates. This template will be named Model Training v#.# or Model Prediction v#.#. Unless otherwise specified, you'll want to use the most recent version of both templates. This document describes the parameters on the training and prediction templates and the ways in which users can interact with them.
For more information on how to use CivisML, visit the documentation here.
Training Template
- MODEL NAME (required)
- This can either be an integer corresponding to a platform file ID of a serialized scikit-learn estimator or a string corresponding to a model type. A list of prespecified models can be found here, along with a more detailed description and links to scikit-learn documentation. Note, users can use `pickle` or `joblib` to serialize estimators.
- DEPENDENT VARIABLE COLUMN NAME (required)
- This is either a single column name in the data or a list of whitespace separated strings corresponding to multiple target columns. Note that nulls will be removed from single target models.
- COLUMN NAME OF PRIMARY KEY (optional)
- Primary key column of the data. This will be added to the out-of-sample scores. If no primary key is passed, an incrementing integer index will be used.
- TRAINING DATA SCHEMA.TABLE (optional)
- Name of table to use for modeling, in the format `schema.table`.
- DATABASE WITH TRAINING DATA (optional)
- Dropdown menu used to select the database name where table is stored. Note, the option for a credential ID is ignored.
- WHERE CLAUSE (optional)
- SQL string to limit the table export to certain rows. Note, omit the `WHERE` in the string (i.e., `state_code = ‘IL’`).
- LIMIT ON TRAINING DATA (optional)
- Integer used to limit the number of rows in the table export. Note that the same limit statement (i.e., `1000`) will result in the same export, but it is best not to rely on this behavior.
- CIVIS FILE ID (optional)
- Integer corresponding to a platform file ID for the training data. CSVs and feather format files are supported for training data.
- ESTIMATOR PARAMETERS (optional)
- JSON parameter dictionary for the estimator. For example, you could pass {‘n_estimators’: 500} to `random_forest_classifer`. See the scikit-learn documentation to better understand the parameter options.
- OPTIONS FOR GRID SEARCH (optional)
- Takes different options for hyperparameter tuning. To use hyperband, pass the string “hyperband”. This is generally faster and more efficient than grid search, but can only be used with ensemble models and MLPs. To use standard grid search, pass a JSON parameter grid dictionary for hyperparameter searching. For example, you could pass {'n_estimators': [100, 200, 500]} to a `random_forest_classifier`. Again, see the scikit-learn documentation to better understand the cross-validation parameter options.
- COLUMNS TO EXCLUDE FROM TRAINING (optional)
- Whitespace separated list of columns to drop from the data for modeling.
- CALIBRATION METHOD FOR CLASSIFICATION (optional)
- If specified, one of ‘sigmoid’ or ‘isotonic’. This option is only available for classifiers and ensures that scores reflect the actual probability of the event occurring and not simply ranking measures (i.e., you want to guarantee that scores of 0.1 will be in the positive case 10% of the time).
- ESTIMATOR FIT PARAMETERS (optional)
- JSON dictionary of data-dependent parameters, corresponding to column names in the data, to pass to the `fit()` method of the estimator. For example, {‘sample_weight’: ‘survey_weight’}.
- OOS PREDICTION OUTPUT SCHEMA.TABLE (optional)
- Redshift table name for out-of-sample scores. Only written to Redshift if argument is provided.
- DATABASE FOR OOS PREDICTIONS (optional)
- Dropdown menu of Redshift database name for out-of-sample scores. Will default to the database holding the training data (above) if not specified. Note, the option for a credential ID is ignored.
- ACTION IF OOS TABLE EXISTS (optional)
- Behaviour to adopt in case out-of-sample scores table already exists in specified Redshift database. One of “fail”, “append”, “drop”, “truncate”.
- CUSTOM ETL ESTIMATOR (optional)
- If overriding the default CivisML ETL, an integer file ID pointing to a scikit-learn-compatible ETL pipeline or transformer. See the scikit-learn documentation for more details on transformers.
- WHERE TO GET VALIDATION DATA (optional)
- Takes options for handling validation data. Defaults to “train”, to cross validate over the training data. Pass the string “skip” to skip the validation step entirely.
- PACKAGES TO INSTALL FROM PYPI OR GITHUB (optional)
- A whitespace-separated list of package names or URLs. These packages may either be hosted on PyPI or on GitHub, and may be public or private. See the Python API client docs for more details.
- PLATFORM CREDENTIAL FOR API TOKEN FOR REMOTE GIT HOSTING SERVICE (optional)
- The name of the custom Civis Platform credential containing the API token needed for private repositories. The token should be stored in the password field of the credential.
- MAXIMUM CONCURRENT JOBS (optional)
- The maximum number of child jobs to run concurrently during training and validation. Different hyperparameter combinations and cross-validation folds will be run in separate child jobs. Current default is 500 for large (>20 CPUs available) EC2 clusters and 1 for smaller EC2 clusters. Training a model with no hyperparameter tuning will use up to 4 jobs. A model with hyperparameter tuning will use up to 4 jobs per hyperparameter combination.
- CPU SHARES REQUIRED (optional)
- CPU shares required for the training job. Note, 1024 shares = 1 full process.
- RAM REQUIRED (optional)
- RAM (in MiB) required for training job. Increasing the available RAM for a training job will be necessary for larger datasets.
- DISK SPACE REQUIRED (optional)
- Disk space (in GB) required to store training data in container. RAM is the limiting resource for most jobs, so this parameter is likely not necessary to change.
- VERBOSE LOG OUTPUTS (optional)
- Boolean flag to show hidden jobs (i.e., exports) and print debug-level logs in platform. Note, all debug-level logs will be written to the `log.txt` file and stored as a run output in any event.
- ACTIVE MODEL BUILD FOR SCORING
- The ‘active model build’ that should be used for scoring. This is the training run used to link to scoring jobs. Defaults to “latest” (i.e., the most recent training run); however, individual run IDs can be set.
Prediction Template
- TRAINING JOB ID (required)
- This is the ID of the training job.
- TRAINING RUN ID (optional)
- This is the training job run ID. Can pass “active” to use the ‘active model build’ set on the training job (this is the default), “latest” to use the latest training job run, or pass the run ID directly.
- COLUMN NAME OF PRIMARY KEY (optional)
- The column in the data corresponding to the primary key. If no primary key is provided CivisML will default to the primary key provided to the training job.
- PREDICTION DATA SCHEMA.TABLE (optional)
- Name of table to predict on. Note, should be `schema.table`.
- DATABASE WITH PREDICTION DATA (optional)
- Dropdown menu used to select the database name where table is stored. Note, the option for a credential ID is ignored.
- FILE ID OF MANIFEST FROM MULTIPART EXPORT (optional)
- Civis Platform file ID for manifest stored at the files endpoint that will be used for scoring. The manifest can either be an AWS manifest file or the outputs of Civis Platform exports with `csvSettings["forceMultifile"]=true` (see Civis API documentation details)
- CIVIS FILE ID (optional)
- Integer corresponding to a Civis Platform file ID for the input scoring data. CSVs and feather format files are supported for prediction.
- PREDICTION OUTPUT SCHEMA.TABLE (optional)
- Output table on Redshift to write scores to. Will use primary key if passed. If left blank, scores will only be stored as Civis Files.
- DATABASE FOR PREDICTION OUTPUTS (optional)
- Redshift database for OUTPUT_TABLE. If not provided, OUTPUT_TABLE will default to the database specified in DB.
- WHERE CLAUSE (optional)
- SQL string to limit the table export to certain rows. Note, omit the `WHERE` in the string (i.e., `state_code = ‘IL’`)
- LIMIT ON DATA FOR PREDICTION (optional)
- Integer used to limit the number of rows in the table export. Note that the same limit statement (i.e., `1000`) will result in the same export, but it is best not to rely on this behavior.
- ACTION IF OUTPUT TABLE EXISTS (optional)
- Behaviour to adopt in case where the scores table already exists in specified Redshift database. One of “fail”, “append”, “drop”, “truncate”.
- CPU SHARES TO ALLOCATE TO ONE JOB (optional)
- CPU shares to allocate to each child job (1024 shares = 1 full process). CivisML dynamically determines these resources by default, so this argument is only useful if dynamically calculated resources are not sufficient.
- RAM TO ALLOCATE TO ONE JOB, IN MiB (optional)
- MiB of RAM to allocate to each child job. CivisML dynamically determines these resources by default, so this argument is only useful if dynamically calculated resources are not sufficient.
- DISK SPACE TO ALLOCATE TO ONE JOB, IN GB (optional)
- GB of disk space to allocate to each child job. CivisML dynamically determines these resources by default, so this argument is only useful if dynamically calculated resources are not sufficient.
- CREDENTIAL FOR GITHUB API TOKEN (optional)
- Platform credential storing your GitHub API token. The token should be stored as your password.
- MAXIMUM CONCURRENT JOBS (optional)
- Maximum number of concurrent subjobs to launch for distributed scoring. Defaults to as many jobs as possible without overloading the cluster. The average number of jobs launched is ~100 for a compressed Civis National Consumer Database table.
- VERBOSE LOG OUTPUTS (optional)
- Boolean flag to show hidden jobs (i.e., prediction sub-jobs & exports) and print debug-level logs in platform. Note, all debug-level logs will be written to the `log.txt` file and stored as a run output in any event.
- AWS (optional)
- AWS credential to allow prediction from AWS manifest files.
Comments
0 comments
Please sign in to leave a comment.