Automatic cleanup of completed and idling workloads

Automatic cleanup of completed and idling workloads

Learn how to configure automatic cleanup of completed and idle workloads created by Kaptain components or the Kaptain SDK.

Prerequisites

  • A Provisioned Konvoy cluster running Konvoy v1.7.0 or above.

Automatically cleanup idle Notebooks

Kaptain Notebooks are the primary interface for end users to interact with the platform. A notebook is a long-running application deployed as a Kubernetes StatefulSet with an attached volumes for persisting your working directory. Although notebooks are not supposed to be garbage collected, in some cases, they can have significant amounts of cluster resources reserved to run local (in-notebook) training. Once the training is complete, these resources are not available for other workloads to use and should be cleaned up.

The Notebook Controller provides functionality called “notebook culling” which can scale down idle notebooks. Scaling down notebooks frees up the resources allocated to those notebooks and makes them available to other workloads.

Jupyter Notebook exposes an endpoint reporting the last activity within a notebook; if the last activity exceeds the configured limit, the controller scales the underlying StatefulSet to 0 replicas.

When a notebook is up and running, it is displayed as active in the UI and has one replica in the StatefulSet:

notebook-running.png

kubectl get notebooks.kubeflow.org && kubectl get statefulsets

NAME       AGE
notebook   9m36s
NAME       READY   AGE
notebook   1/1     9m37s

After the notebook has idled longer than the specified culling time, it is scaled down:

notebook-stopped.png

kubectl get notebooks.kubeflow.org && kubectl get statefulsets

NAME       AGE
notebook   12m
NAME       READY   AGE
notebook   0/0     13m

You can resume the notebook from the UI later, and the corresponding StatefulSet is scaled back to 1 replica if there are sufficient resources on the cluster. The workspace volume is automatically attached to the resumed notebook.

The notebook culling feature is disabled by default. To enable it, set the notebookEnableCulling parameter to true during the installation or update the existing Kaptain instance using the following command:

kubectl kudo update --instance kaptain --namespace kubeflow -p notebookEnableCulling=true

See the Configuration Reference for additional parameters for this functionality.

Automatic cleanup of completed Pipeline Runs (Workflows)

Overview

Kubeflow Pipelines rely on Argo Workflows for running workloads. Starting with Kaptain 1.1, Kubeflow Pipelines schedule the workflows in the user namespace, providing better multi-tenant isolation and workload locality. Once all the steps in the pipeline are complete, the Pods corresponding to the pipeline terminate, but the Argo Workflow custom resources (workflow.argoproj.io) remain in the namespace:

kubectl get workflows.argoproj.io

NAME                                          STATUS      AGE
data-passing-btwn-componefjdf8-1-3068851699   Running     17s
dsl-control-structures-rugqkrh-1-2276733026   Succeeded   111s
dsl-control-structures-rugqkrh-2-2259955407   Succeeded   51s
end-to-end-mnist-pipeline-mnrr6

Each step of the pipeline is implemented using a Pod. Pipeline pods are not deleted as long as the workflow that created them is present. Without cleanup, your namespace can become filled with completed pods:

kubectl get pods -l workflows.argoproj.io/workflow=dsl-control-structures-rugqkrh-1-2276733026

NAME                                                     READY   STATUS      RESTARTS   AGE
dsl-control-structures-rugqkrh-1-2276733026-2018045073   0/2     Completed   0          4m11s
dsl-control-structures-rugqkrh-1-2276733026-2405487652   0/2     Completed   0          3m40s
dsl-control-structures-rugqkrh-1-2276733026-3461867059   0/2     Completed   0          3m51s
dsl-control-structures-rugqkrh-1-2276733026-4042755208   0/2     Completed   0          4m1s

Using Python DSL for setting Pipeline TTL

Kubeflow Pipelines provide a Python Domain Specific Language (DSL) that allows you to specify a time-to-live (TTL) for the submitted Pipeline. Here is an excerpt from the Pipeline tutorial:

@dsl.pipeline(
    name="End-to-End MNIST Pipeline",
    description="A sample pipeline to demonstrate multi-step model training, evaluation, export, and serving",
)
def mnist_pipeline(
    input_bucket: str = "tutorial",
    ...
):
    train_and_serve(
        input_bucket=input_bucket,
        ...
    )
    ...
    # TTL for the workflow to persist after completion (1 hour)
    dsl.get_pipeline_conf().set_ttl_seconds_after_finished(60 * 60)

This setting specifies the ttlSecondsAfterFinished property in the Argo Workflow definition specifying the amount of time the workflow will persist before it is cleaned up by the dedicated controller.

Setting global TTL for completed Pipelines

Kaptain has a global configuration property that allows you to set the default TTL for all created Pipelines; Workflow objects are deleted after the specified amount of time. The default value for this property is 24 hours. However, this property only affects the Pipeline API Server; the Argo Workflow controller does not use this property. While Notebook users can set this property via the DSL, they cannot specify a longer interval than the global property setting. The Pipeline component always uses the smaller of the two specified TTL values between the DSL config and the global property. Because workflow objects can be useful in debugging, we recommend choosing a conservative value for the global property value.

NOTE: The metadata information that belongs to the Pipeline run is available in the Pipeline UI after the Argo Workflow custom resource is deleted. However, the Pipeline task (step)-level Pod information and logs are not available after the workflow deletion because these are retrieved directly from the workflow custom resource.

To set the default TTL for all Pipelines, install or update Kaptain instance with the following parameter:

kubectl kudo update --instance kaptain --namespace kubeflow -p workflowsTTLSecondsAfterFinish="<ttl seconds>"

Automatic cleanup for resources created by KFServing

Overview

KFServing serves models over HTTP(s) using the Knative Serving component. When a model is deployed to serving, KFServing creates a set of Knative resources such as Service,Route, and Revision.

There is always one Knative Service per model deployment, however, the number of Revisions can grow with time because every new deployment (a new model version with a new image name) has its own Revision.

When a new Revision is deployed, the older one scales the associated deployment to zero replicas, but it does not delete it. Over time, the number of Revisions and associated deployments can grow significantly; to avoid the undesirable overhead it is recommended to garbage collect the outdated Revisions.

For example:

$> kubectl get revisions
NAME                          CONFIG NAME              K8S SERVICE NAME               GENERATION   READY
dev-mnist-predictor-c5kzr   dev-mnist-predictor   dev-mnist-predictor-c5kzr                1       True
dev-mnist-predictor-d6tdr   dev-mnist-predictor   dev-mnist-predictor-d6tdr                2       True
dev-mnist-predictor-tqzqw   dev-mnist-predictor   dev-mnist-predictor-tqzqw                3       True

$> kubectl get deployments
NAME                                                  READY   UP-TO-DATE   AVAILABLE   AGE
dev-mnist-predictor-c5kzr-deployment                   0/0        0            0       33m
dev-mnist-predictor-d6tdr-deployment                   0/0        0            0       18m
dev-mnist-predictor-tqzqw-deployment                   1/1        1            1       5m53s

Configure Knative addon cleanup

KFServing itself doesn’t provide controls for garbage collection of stale Revisions, however, the underlying Knative Addon that ships with Kaptain has a set of parameters to control the garbage collection of stale revisions:

Parameter Default Description
minNonActiveRevisions 20 Minimum number of non-active revisions to retain. If the number of revisions for a service is less than provided value, the cleanup will not be triggered disregarding other settings.
retainSinceCreateTime 48h Duration since a revision was created before considering it for the cleanup. The revision must be non-active to be considered for the cleanup.
retainSinceLastActiveTime 15h Duration since a revision was active before considering it for the cleanup. An active revision is the one that has the service network traffic routed to it.
maxNonActiveRevisions 1000 Maximum number of non-active revisions to retain. If the maximum number of revisions reached, the oldest non-active revision will be deleted disregarding the other settings.

Update Knative addon configuration

To specify or update the Knative addon configuration, edit the cluster.yaml section and specify the values for the garbage collection settings:

- configRepository: https://github.com/mesosphere/kubeaddons-kaptain
  configVersion: stable-1.20-1.3.0
  addonsList:
    - name: knative
      enabled: true
      values: |
        serving:
          gc:
            retainSinceCreateTime: "48h"
            retainSinceLastActiveTime: "15h"
            minNonActiveRevisions: "20"
            maxNonActiveRevisions: "1000"

After updating the settings, run konvoy deploy addons to apply the changes.

NOTE: The Knative Controller automatically reloads configuration changes when the addon is updated. However, this doesn't trigger an automatic revision cleanup. The cleanup will be performed upon the next service version deployment.

Example configurations

If you only need to keep the latest revision of each model, the following settings can be used:

- configRepository: https://github.com/mesosphere/kubeaddons-kaptain
  configVersion: stable-1.20-1.3.0
  addonsList:
    - name: knative
      enabled: true
      values: |
        serving:
          gc:
            minNonActiveRevisions: "0"
            retainSinceCreateTime: "1s"
            retainSinceLastActiveTime: "1s"

Example configuration that retains the last ten non-active revisions:

- configRepository: https://github.com/mesosphere/kubeaddons-kaptain
  configVersion: stable-1.20-1.3.0
  addonsList:
    - name: knative
      enabled: true
      values: |
        serving:
          gc:
            minNonActiveRevisions: "10"
            retainSinceCreateTime: "1s"
            retainSinceLastActiveTime: "1s"