User Guide

Welcome to the Metrics Operator user guide! If you come here, we are assuming you have a cluster with the Metrics Operator installed and are interested to submit your own custom resource to create a MetricSet, or that someone has already done it for you. If you are a developer wanting to work on new functionality or features, see our Developer Guides instead.



Our “MetricSet” is mirroring the design of a JobSet, which can combine multiple different things (i.e., metrics) into a cohesive unit. When you create a MetricSet using this operator, we assume that you are primarily interested in measuring an application performance, collecting storage metrics, or using a custom metric provided by the operator that has less stringent requirements.

Logic Flow of Metrics Operator

Given the above assumption, the logic flow of the operator works as follows:

  1. You write a metrics.yaml file that optionally includes an application OR storage description or neither for a custom metric. Typically, you’d provide an application for performance metrics, and storage for IO/filesystem metrics, and neither for a custom metric. In simpler terms, we have three types of MetricSet - Application Metric Sets, Storage Metric Sets, and Standalone Metric Sets.

  2. You also include a list of metrics. Each metric you choose is associated with a type (internal to the operator) that can match to an Application, Storage, or Standalone Metric set. Don’t worry, this is checked for you, and you can use our TBA metrics registry to find metrics of interest.

  3. The operator will create a JobSet for your metrics set. The structure of this JobSet depends on the type (see more below under metrics). Generally:

  • Application metrics create a JobSet with each metric as a sidecar container sharing the process namespace to monitor (they can be given volumes if needed)

  • Storage metrics deploy the metrics as containers and give them access to the volume

  • Standalone metrics can do any custom design needed, and do not require application or storage (but can be provided storage volumes)

Generally, you’ll be defining an application container with one or more metrics to assess performance, or a storage solution with the same, but metrics to assess IO. There are several modes of operation, depending on your choice of metrics.


Quick Install

This works best for production Kubernetes clusters, and comes down to first installing JobSet (not yet part of Kubernetes)

kind create cluster
kubectl apply --server-side -f$VERSION/manifests.yaml

and then downloading the latest Metrics Operator yaml config, and applying it.

kubectl apply -f

Note that from the repository, this config is generated with:

$ make build-config

and then saved to the main branch where you retrieve it from.

Helm Install

We optionally provide an install with helm, which you can do either from the charts in the repository:

$ git clone
$ cd metrics-operator
$ helm install ./chart

Or directly from GitHub packages (an OCI registry):

# helm prior to v3.8.0
$ helm pull oci://

And install!

$ helm install chart-0.1.0.tgz
NAME: metrics-operator
LAST DEPLOYED: Fri Mar 24 18:36:18 2023
NAMESPACE: default
STATUS: deployed

Getting Started

Let’s first review how this works.

  1. We provide metrics here to assess performance, storage, networking, and other custom cases (called standalone).

  2. You can choose one or more metrics to run alongside your application or storage (volumes) and measure.

  3. The metric output is printed in pod logs with a standard packaging (e.g., sections and headers) to distinguish output sections.

  4. We provide a Python module metricsoperator that can help you run an experiment, applying the metrics.yaml and then retrieving and parsing logs.

For the last step, this is important because every metric tool is a special snowflake, outputting some custom format that is hard to parse and then plot. By providing a parser paired with each metric, we hope to provide an easy means to do this so you can go from data collection to results more quickly. Now let’s review a suggested set of steps for you as a new user! You can:

  1. First choose one or more metrics, request a metric be added, or start with a pre-created examples. Often if you want to measure an application or storage or “other” (e.g., networking) we already have a metrics.yaml and associated parser suited for your needs.

  2. Run the metric directly from the metrics.yaml, or use the Python module metricsoperator to run and collect output.

  3. Plot the results, and you’re done!

For step 2, you can always store output logs and then parse them later if desired. For a quick start, you can likely explore our examples directory, which has both examples we use in testing along with full Python examples that will use the associated metrics.yaml files, but also submit and parse the output. Once you are comfortable with the basics, you can browse our available metrics and either design your own, or request a metric be added. Our goals are to make these easy to deploy with minimal complexity for you, so we are happy to help. We also encourage you to share examples and experiments that you put together here for others to use.


For all metric types, the following applies:

  1. You can create more than one pod (scale the metric) as you see fit.

  2. There is always a headless service provided for metrics within the JobSet to make use of.

  3. The definition of metrics in your metrics.yaml file is consistent across types.

  4. Each metric type in the list can take a rate, completions, and custom options.

For another overview of these designs, please see the developer docs.

Application Metric Set

An application metric set includes one or more metrics for measuring application performance. We take two strategies:

  • Share the process namespace, giving access of the metric container to the process space of the application

  • Share a volume on the filesystem, allowing content from the metrics container to be used in the application container

Let’s walk through an example. In the image below, you want to run one or more custom metrics to measure performance for your application of choice.


You’ll do this by writing a metrics.yaml file (left panel) that defines the application of interest, which in this case in LAMMPS. This will be handed to the metrics operator (middle panel) that will validate your MetricSet and prepare to deploy, and the result is a JobSet (right panel) that includes a Job with one or more containers alongside your application. Let’s look at this process in more detail. Here is what the metrics.yaml file might look like. Note that the image above defines two metrics, but the YAML file below only shows a list of one.

kind: MetricSet
  labels: metricset metricset-sample
  name: metricset-sample
    command: mpirun lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite
    - name: perf-sysstat

It was a design choice that using an application container in this context requires no changes to the container itself. You simply need to know what the entrypoint command is, and this will allow the metric sidecar containers to monitor it. In our case, for our command we are issuing mpirun, and that’s what we want to monitor. Thus, the image and command attributes are the only two required for a basic application setup. For the next section, “metrics” we’ve found an application metric (so it can be used for an Application Metric Set) that we like called perf-sysstat, and we add it to the list. We could easily have added more, because one application run can be monitored by several tools, but we will keep it simple for this example. Next, let’s submit this to the metrics operator.

$ kubectl apply -f metrics.yaml

When the operator receives the custom resource, it will do some obvious validation, like did you provide application metrics for an application? Did you provide all the required fields? Did you only provide a definition for one metric type? Any errors will result in not creating the MetricSet, and an error in the operator logs. Given that you’ve provided a valid custom resource YAML and one or more application metrics that the operator knows, it will then select your metrics from the set it has internally defined (middle panel). This panel shows that the operator knows about several application (green), storage (yellow), and standalone (red) metrics, and it’s going to combine them into a JobSet that includes your application container to allow each metric to assess performance.


Focusing on the third panel, the way this works is that we create a JobSet with a single replicated job with multiple containers. One container is for your application, and the others are sidecar containers that are running the metrics. Again, because of this design you don’t need to customize or tweak your application container! By way of the shared process namespace and knowing the command you’ve executed, the sidecar containers can easily “peek in” to your application container to monitor the process and save metrics. For this application metric set design, all containers should complete to determine success of the JobSet, and we currently rely on pod logs to get output, however we hope to have a more solid solution for this in the future.

Storage Metric

A storage metric set includes one or more metrics for assessing one or more volumes

If you are interested in measuring the goodness of different kinds of volumes, you might be interested in creating a storage metric set! The design is similar to an application metrics set, however instead of an application of interest, you provide one or more storage volumes of interest. Here is a small example that assumes a host volume:

kind: MetricSet
  labels: metricset metricset-sample
  name: metricset-sample
      # This is the path on the host (e.g., inside kind container)
      hostPath: /tmp/workflow

      # This is the path in the container
      path: /workflow

    - name: io-sysstat
      rate: 10
      completions: 2

In the above, we want to use the storage metric called “io-sysstat” to assess a host volume at /tmp/workflow that is mounted to /workflow in the container. Since a volume could last forever (hypothetically) we ask for two completions 10 seconds apart each. This means we will get data for two timepoints from the metric, and after that, the assessment will be complete. We can also look at this visually:


In the above, we are providing storage metrics (the image has two despite the yaml above showing one) that the operator knows about, along with a storage volume that we want to test. The operator will prepare a JobSet with one replicated job and several containers, where one container is created per storage metric, and the volume bound to each.


In simple terms, a storage metric set will use the volume of interest that you request, and run the tool there. Read/write is important here - e.g., if the metric needs to write to the volume, a read only volume won’t work. Setting up storage is complex, so it’s typically up for you to create the PVC and then the operator will create the volume for it. Keep in mind that you should honor RWX (read write many) vs just RW (read write) depending on the design you choose. Also note that by default, we only create one pod, but if appropriate you can scale up to more.

Standalone Metric

A custom, standalone metric that doesn’t abide by any rules!

The standalone metric is the most interesting of the set, as it doesn’t have a strict requirement for a storage or application definition. We currently have a few flavors of standalone metrics that include:

  • applications that are timed (e.g., LAMMPS)

  • networking tools (e.g., OSU benchmarks and netmark)

By definition, it is “standalone” because it’s going to create a custom JobSet setup for a metric of interest. Because we cannot be certain of how to combine different jobs within this JobSet, we currently only allow one standalone metric to be defined at once. This means that in the diagram below, you see online one standalone metric in the metrics.yaml


As an example, we can look at a standalone metric to run a tool called netmark.

kind: MetricSet
  labels: metricset metricset-sample
  name: metricset-sample
  # Number of indexed jobs to run netmark on
  pods: 4
   - name: network-netmark

     # Custom options for netmark
     # see pkg/metrics/network/netmark.go
       tasks: 4

This is a standalone metric because it creates a JobSet with not one replicated job, but two! There is a launcher container to issue an mpirun command, and one or more worker containers that interact via MPI. This is a simple example, but any design for a JobSet could work here, and hence why the metric is standalone. However, it’s neat that the interface presented to you is consistent - it’s simply a matter of asking for the metric that is known to the operator to be a standalone. The image below also demonstrates that this standalone metric (along with storage or application) can be scaled to more than one pod, if appropriate.


For more detail about this design, see the developer docs.

Containers Available

All containers are provided under The latest tag is the current main branch, a “bleeding edge” version, and we will provide releases when the operator is more stable.

Last update: Sep 17, 2023