Kaptain SDK Troubleshooting Guide
Kubernetes resources created by Kaptain SDK
This section covers the resources created by the Kaptain Python SDK when building, training, tuning, and serving.
Building
The following are the Kubernetes resources created on the building model’s Docker image are:
Secretwith Docker credentials for Docker registry authorization.Secretwith S3/Minio credentials.Secretwith a Docker registry certificate for secure communications with a Docker registry.job.batchbuilds the image with a model training code and dependencies.
All the resources listed above are removed upon successful build completion or notebook cell interruption.
Training
The following are the Kubernetes resources created on model training:
- Resources created by the build step if image rebuilding is needed.
- Either a
tfjob.kubeflow.orgorpytorchjob.kubeflow.orgfor running a distributed training.
By default, training jobs are not removed after completion for troubleshooting purposes, unless the force_cleanup parameter of the Model.train() method or the KAPTAIN_SDK_FORCE_CLEANUP environment variable is set to True.
All the resources listed above are removed upon notebook cell interruption.
Hyperparameter Tuning
The following are the Kubernetes resources created on model tuning:
- Resources created by the build step if image rebuilding is needed.
experiment.kubeflow.orgfor orchestrating the tuning of eithertfjobs.kubeflow.orgorpytorchjobs.kubeflow.org.
If the delete_experiment flag is set to True in the Model.tune() function, the experiment.kubeflow.org will be cleaned up on successful completion of the tuning step. All the created resources listed above are removed upon notebook cell interruption.
Serving
The serving machine learning models is implemented by KFServing, which is the component responsible for model serving over HTTP(s) and relies on Knative Serving. When a model is deployed to serving, KFServing creates a set of Knative resources such as Service, Route, and Revision. There is always one Knative Service per model deployment. However, the number of Revisions can grow with time because every new deployment, for example a new model version with a new image name, has its Revision. When a new Revision is deployed, the older one scales the associated deployment to zero replicas and keeps it.
The following are the Kubernetes resources created on model deployment:
Secretwith S3/Minio credentials to access MinIO bucket with a stored model.- Knative resources:
service.serving.knative.dev,route.serving.knative.devandrevision.serving.knative.dev. - Inference service
serving.kubeflow.org/InferenceService.
The Secret with S3/Minio credentials will be removed on successful completion or cell interruption.
All created the resources listed above are removed upon notebook cell interruption.
Kaptain Documentation