The Compute Usage Overview provides the user with a summary of compute resource utilization in their organization. Compute resources refer to the EC2 instances that run Jobs, Notebooks, and Services in Platform. This page will help you answer questions like:
- How many compute hours has my organization used over a given month?
- How many compute resources are available to my organization?
- How many resources are currently available?
- How many resources are currently in use?
- What does resource usage look like over time?
Key Terms
- EC2 - (Elastic Cloud Compute), cloud compute used to run everything on Platform, including Scripts, Notebooks, and Services.
- Workload - A Job, Notebook, or Service which runs on EC2.
- Compute Instance - The type of EC2 used to run a Job/Notebook/Service (Workload). Each EC2 instance type has CPU, memory, and storage limits which constrain the number of workloads that can run on one instance. If users are running enough workloads in Platform, they will need multiple instances to run everything in parallel. Note - a single workload cannot run on multiple instances, thus the maximum CPU and memory of an instance dictates the maximum CPU and memory of a Job, Notebook, or Service.
- Partition - A collection of compute instances dedicated to running a particular workload type. See Platform Compute Partitions.
- Compute Hours - The number of EC2 hours used by an organization across all partitions and instance types. Instances that run in parallel will count as separate hours. For example, if an organization is using 10 instances in parallel for 100 hours, they will use a total of 1,000 compute hours.
- Compute Instance Maximum - the maximum number of EC2 instances an organization can use in parallel. This is specified for each partition and instance type.
- Jobs - Refers to Compute Jobs, that is, Python, R, and Container scripts, as well as script-based Imports/Exports and most Custom scripts.
- Note that this category does not include ‘Classic’ Jobs, such as SQL and Javascript scripts and non-script based Imports/Exports.
- Script-based Imports/Exports - An easy way to check if an import or export uses compute resources is to check for /scripts/ in the URL. For example, Salesforce Imports/Exports use compute resources, but Google Sheets Imports/Exports do not.
- Notebooks - Interactive coding environments in Python or R
- Services - User defined applications, a R Shiny application for example.
Compute Usage Summary
The Compute Usage summary is the landing page when you navigate to the Platform Usage Overview:
Monthly Compute Hours - This displays the cumulative number of compute hours used by your organization for the current month across all partition and instance types. See “Key Terms” above for more details. This total will reset on the first of each month. This cumulative count may have up to a 24 hour lag. Information about the previous month's usage can be viewed by clicking on “Hours Used”.
Partitions - Resources for each of your organization’s partitions are shown beneath a partition header, ex: “Jobs Compute Instances”. Note that partitions are mutually exclusive, meaning that if one partition is maxed out, work queued to run on that partition will queue until there is space available, but other partitions will be unaffected.
Instance Types - For each partition, an organization can have multiple instance types. Users select the instance type they want to use via the settings of an individual Job, Notebook, or Service. Each box displayed on the landing page represents a different partition and instance type. Some organizations may have only one instance type available.
Max Instances - For each instance type, users can see the maximum number of instances that their organization can run in parallel.
Status Bar - The status bar in each box displays the current utilization of the specific partition and instance type. For example, if the status bar shows 70% usage for a specific instance type in the Jobs partition, then 70% of that instance type is currently being used to run Jobs and there is only 30% available to run additional jobs. However, if this 30% is spread out across multiple instances, there might not be enough contiguous space for additional jobs to run. New Jobs will queue until currently running Jobs either complete or are canceled. This status bar is dynamic and shows what is currently running on the cluster (it is not cumulative). This data can be manually refreshed by clicking the “Refresh” button on the top of the page.
To see more details about a specific instance and partition type, including which workloads are running and who they belong to, click on the instance type.
Monthly Compute Hours
The Monthly Compute Hours page displays the last 12 months of compute hours used by your organization. This page serves to help you answer questions such as:
- What is our average monthly compute usage?
- Do we have enough compute instances contracted?
Where to Find It
- On the Compute page of Platform Usage Overview, click on “Hours Used"
- This will take you to the Monthly Compute Hours page:
Current Monthly Usage: This shows your organization’s total compute usage for the current month and updates daily. This total reflects your usage of compute workloads (as described in Key Terms). When you kick off a workload, Platform requests an EC2 node with capacity to run the task from those available to your organization. “Compute usage hours” reflect hours for which a node is “on”.
Usage Table: The table shows total compute usage for each of the last 12 months. If your organization has access to more than one instance type, totals reflect usage in normalized hours across all instance types.
Normalization: Civis instance types have different hourly rates. To display a meaningful total of usage for a given month, we normalize all hours with respect to the cost of an hour of m4.xlarge instance usage. We then aggregate all instance types' hours with that consistent unit (“m4.xlarge hours”). For example, if you used 2 hours on a m4.2xlarge instance, which costs twice as much as an m4.xlarge instance, that would add 4 “m4.xlarge hours” to your total monthly compute. For more information on how Civis calculates usage, please contact your Account Manager.
Not Available: It can take a little while for data to become available on the first of the month. In this scenario, compute data may show as “Not Available” when looking at the Monthly Compute Hours page. If it does not update by the next day or you consistently see this on a day other than the first of the month, reach out to Support@CivisAnalytics.com.
Instance Detail Page
The Instance Detail Page provides compute usage information for a specific instance type.
Where to Find It
- Click on one of the instance type boxes. For instance, the m4.xlarge instance type within “Notebooks/Services” partition:
Current Activity
This page shows a breakdown of workloads currently running on the selected partition and instance type combination.
Partition and Instance Type - The title of the page shows which partition and instance type you are currently viewing.
Memory and CPU Status Bars - Each instance type has specific memory and CPU limits. These status bars show the current utilization of memory and CPU across all available instances for the selected partition and instance type. The utilization of memory and CPU are not necessarily correlated. The memory across all instances of a specific instance type may be maxed out but there might be available CPU (and vice versa). If one of these two resources is maxed out, the entire instance type is maxed out. These status bars are updated every time the page is refreshed. Users can manually refresh the status bars by clicking “Refresh” in the top right corner.
Running - The total number of workloads currently running across all instances for the specific partition and instance type. This information is updated every time the page is refreshed or can be manually refreshed by clicking “Refresh” located in the top right corner.
Pending - The total number of queued workloads for the selected partition and instance type. Jobs/Notebooks/Services that are queued will be stuck with a “dedicating resources” log message in Platform and will not be able to run until resources become available. This information is updated every time the page is refreshed. Users can manually refresh the status bars by clicking “Refresh” in the top right corner.
Top User Activity - Table that gives insight into the users who are consuming the most resources across the organization. This information can be manually refreshed by clicking “Refresh” located in the top right corner.
- Running Jobs - Total number of jobs currently running by a specific user.
- Queued Jobs - Total number of jobs queued by a specific user.
- Memory Used - Total memory used by a specific user, compared to the amount of memory available for the partition and instance type at max capacity.
- CPU Used - Total CPU used by a specific user, compared to the amount of CPU available for the partition and instance type at max capacity.
Active Workloads - Table that identifies all active workloads utilizing resources in the given partition and instance type. Certain fields (e.g. workload name, ID) may be hidden to users if they don’t have permission to access the workload. Users have the ability to cancel any active workloads which they either own or have been shared on as an editor/manager.
- Name - Platform workload's name.
- ID - Platform workload's ID.
- Type - Platform workload’s type (i.e. python/r/container script, notebook, service).
- User - User who is running the workload.
- Requested CPU (M) - CPU requested by the workload.
- Requested Memory (MB) - Memory requested by the workload.
- State - Current status of the workload (i.e. running, canceling).
Over Time
These graphs show both CPU and memory usage over time. Users can toggle between last day and last week to look at the data with different levels of granularity. Last Day is the past 24 hours (rolling) and Last Week is the past 7 days (rolling).
Reading the Graph
- Resource Requested - the total amount of resources requested by all workloads on that partition and instance type at a certain period in time. Resource requested is the memory or CPU set on each workload under Settings. These numbers constrain the scheduling of new workloads.
- Resource Used - the actual amount of resources used by the workloads running on that partition and instance type at a certain period in time.
- Resource Capacity - the total resource amount available for a particular partition and instance type at that period in time. This capacity can scale up based on Platform autoscaling logic if an organization does not have their maximum number of instances set to always on.
Interpreting the Graph
- If the Resources Used line is much lower than the Resources Requested line, users may be over-provisioning their workloads. To be more efficient, users can reduce the amount of memory or CPU which Jobs, Notebooks, or Services request to leave more resources for other workloads.
- If the Resource Requested line is near or above the Resource Capacity line, the organization is at its maximum capacity. New workloads will be queued until running Jobs have finished or other Notebooks and Services are turned off.
Comments
0 comments
Article is closed for comments.