Terminology
Workload: A Python, R, or Container Script run, or a Notebook or Service deployment.
Memory Request: How much in-memory storage your workload will be given. Civis measures memory requests in megabytes (MB).
CPU Request: How much compute processing capacity your workload will be given. Civis measures CPU requests in millicores (m), where 1000 m = 1 core = 1 cpu (central processing unit).
Resources: Memory and CPU.
Memory Requests & Usage
The memory request which you set for your workload is a hard limit - your workload will be terminated if it attempts to use more than this amount of memory.
This limit will affect how much data your workload can pull into memory (e.g. into a pandas dataframe) and process at once.
If your workload is failing with a 137 exit code, then try increasing the memory limit of your workload, even if the memory usage logs indicate that the current memory limit is greater than the amount used. It may be that memory usage is spiking so quickly that the workload fails before we can collect resource metrics about it.
CPU Requests & Usage
The CPU value which you set for your workload is a soft request - your workload will not be terminated if it attempts to use more than this amount of CPU. If nothing else is using the CPU, the operating system may allow your workload to use more than its original request. However, if other workloads are using the CPU, then your workload will not be allowed to exceed its request and may run more slowly than it might otherwise.
CPU requests affect the speed at which your workload will run, but they are not the only factor. In order for your workload to use more than 1000 m (1 core) of cpu, it will need to use some form of parallel processing. This leads to the following guidelines:
For workloads without parallel processing, set CPU <= 1000.
For workloads with parallel processing, set CPU > 1000.
If you have not explicitly implemented parallel processing, your workload is probably not using it. If and how to implement parallel processing will be dependent on the details of what you are trying to do. Some resources for starting:
Python Resources
- Civis Python Client parallel computation features
- Numpy/Scipy parallel features
- Dask
R Resources
- Civis R Client concurrency & parallel computing features
- The future package
- The doParallel package
Max Resource Usage Metrics
While workloads are running, Civis collects resource usage metrics about them every 15 seconds. When a workload finishes, that is, when a Python, R, and Container script completes or when a Notebook or Service is terminated, Civis queries the collected metrics to determine the maximum memory and cpu usage of the workload over its lifetime.
Notes:
- Since resource usage metrics from workloads are collected every 15 seconds, they may not reflect brief spikes in usage.
- Metrics are only available up to 30 days in the past, so max resource usage for long-lived services will cover only up to the last 30 days.
How to View
Maximum memory and cpu usage statistics are logged to run/deployment logs when the workload finishes. They are also available in the API under the GET and LIST endpoints for each workload type as well as the general GET deployments endpoint, via the maxMemoryUsage and maxCpuUsage attributes.
Resource usage metrics are not available for running workloads at this time.
Historical Resource Usage
If you want to get an idea of your workload’s typical resource usage over time, you can use the Civis API. Resource usage statistics are available under the GET and LIST endpoints for the various workload types.
Example of fetching resource usage metrics for a python3 script, using the Civis Python Client:
# See resource usage for most recent run
run_id = client.scripts.get_python3(<job_id>).last_run.id
run_data = client.scripts.get_python3_runs(job_id, run_id)
print(f'Max memory usage: {run_data.max_memory_usage} MB')
print(f'Max CPU usage: {run_data.max_cpu_usage} millicores')
# Print resource usage stats for last X number of runs
runs = client.scripts.list_python3_runs(<job_id>, limit=<X>)
for run in runs:
print(f'Run {run.id} used {run.max_memory_usage} MB of memory and {run.max_cpu_usage} m of CPU.')
Comments
0 comments
Please sign in to leave a comment.