The tensorboard controller doesn’t provide a way to configure resource quotas for tensorboard deployments. So if you have resourcequotas setup for your cluster, then tensorboard deployments will fail with the below error :
4m56s Warning FailedCreate replicaset/sample-v0214-55f5485c5b
Error creating: pods "sample-v0214-55f5485c5b-mb2lz" is forbidden: failed quota: kf-resource-quota: must specify cpu for: tensorboard; memory for: tensorboard
One workaround is to dynamically patch the deployment requests received by the kube-api server using an admission controller.
Quick overview of an admission controller phases :
sequenceDiagram participant User participant APIServer as Kubernetes API Server participant Auth as Authentication + Authorization participant Mutate as Mutating Webhook(s) participant ValidatePolicies as Validating Admission Policies participant ValidateWebhooks as Validating Webhook(s) User->>APIServer: Request (e.g., create a pod) APIServer->>Auth: Authenticate and authorize Auth-->>APIServer: Result loop For each mutating webhook APIServer->>Mutate: Invoke webhook Mutate-->>APIServer: Modify or reject object end loop For each validating policy APIServer->>ValidatePolicies: Invoke policy ValidatePolicies-->>APIServer: Reject if needed end par All validating webhooks APIServer->>ValidateWebhooks: Invoke webhook ValidateWebhooks-->>APIServer: Reject if needed end APIServer-->>User: Response (success or error)
We can use a policy engine named kyverno which basically runs as a dynamic admission controller.
It receives validating and mutating admission webhook HTTP callbacks from kube api server and applies the matching policies to return the results that enforce admission policies or reject requests before those are registered with the cluster.
Following is the cluster wide policy we can use for patching tensorboard deployments :
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: inject-tensorboard-defaults
annotations:
policies.kyverno.io/title: TensorBoard Default Resources
policies.kyverno.io/category: Kubeflow
policies.kyverno.io/severity: medium
spec:
validationFailureAction: enforce
background: true
rules:
- name: inject-default-resources
match:
any:
- resources:
kinds:
- tensorboard.kubeflow.org/v1alpha1/Tensorboard
namespaces:
- sample-ns
preconditions:
any:
- key: "{{ request.operation }}"
operator: In
value: ["CREATE", "UPDATE"]
mutate:
patchStrategicMerge:
spec:
resources:
requests:
+(cpu): "500m"
+(memory): "1Gi"
limits:
+(cpu): "1"
+(memory): "2Gi"
Make sure to add the full tensorboard deployment name(API Version + Kind) under resources/kinds.
Kyverno basically listens for new tensboard deployments and applies this policy which injects cpu and memory requests if all the conditions in match
argument are succesfully validated.
Check the logs of kyverno admission contoller whether it was able to succesfully patch the requests or not.
If you see such similar logs :
INF github.com/kyverno/kyverno/pkg/auth/auth.go:83 >
disallowed operation evaluationError= gvr="tensorboard.kubeflow.org/v1alpha1, Resource=tensorboards" kind=Tensorboard logger=auth namespace= reason= v=0 verb=get
then it means the request is blocked due to lack of permission and we need to update the kyverno clusterole :
kubectl patch clusterrole kyverno:admission-controller:core --type=json -p='[
{
"op": "add",
"path": "/rules/-",
"value": {
"apiGroups": ["tensorboard.kubeflow.org"],
"resources": ["tensorboards"],
"verbs": ["get", "list", "watch", "update", "patch"]
}
}
]'
and then restart the admission controller deployment.
Describe the tensorboard pod to check whether you see the resource requests/limits or not.