logo Veloxpack

Troubleshooting

Common issues, debugging steps, and solutions for the CSI driver.

Overview

This guide covers common issues, debugging techniques, and solutions for the rclone CSI driver.

Check Driver Status

Verify Driver Pods

# Check controller pods
kubectl get pods -n veloxpack -l app=csi-rclone-controller

# Check node pods
kubectl get pods -n veloxpack -l app=csi-rclone-node

# Check pod status
kubectl describe pod -n veloxpack -l app=csi-rclone-controller
kubectl describe pod -n veloxpack -l app=csi-rclone-node

Check CSIDriver Resource

# Check CSIDriver
kubectl get csidriver rclone.csi.veloxpack.io

# Get detailed information
kubectl describe csidriver rclone.csi.veloxpack.io

Verify Driver Functionality

# Check if the driver is working correctly
kubectl exec -n veloxpack -l app=csi-rclone-node -- /rcloneplugin --help

# Check driver version information
kubectl logs -n veloxpack -l app=csi-rclone-node --tail=10 | grep "DRIVER INFORMATION" -A 10

Common Issues

1. Driver Pods Not Starting

Symptoms:

  • Pods stuck in Pending or CrashLoopBackOff
  • Driver not responding to CSI calls

Causes:

  • FUSE not installed on nodes
  • Insufficient permissions
  • Resource constraints
  • Image pull issues

Solutions:

# Check node capabilities
kubectl describe node <node-name>

# Check if FUSE is available
kubectl exec -n veloxpack -l app=csi-rclone-node -- ls /dev/fuse

# Check resource limits
kubectl describe pod -n veloxpack -l app=csi-rclone-controller

# Check image pull
kubectl describe pod -n veloxpack -l app=csi-rclone-controller | grep -i image

2. Volume Mount Failures

Symptoms:

  • PVC stuck in Pending
  • Mount operations failing
  • Pods can't access mounted volumes

Causes:

  • Invalid rclone configuration
  • Network connectivity issues
  • Authentication failures
  • Permission errors

Solutions:

# Check PVC events
kubectl describe pvc <pvc-name>

# Check pod events
kubectl describe pod <pod-name>

# Check driver logs
kubectl logs -n veloxpack -l app=csi-rclone-node --tail=100

# Verify secret contents
kubectl get secret <secret-name> -o yaml
kubectl get secret <secret-name> -o jsonpath='{.data.configData}' | base64 -d

3. Authentication Failures

Symptoms:

  • Mount operations fail with authentication errors
  • Driver logs show credential issues

Causes:

  • Invalid credentials in secrets
  • Expired tokens
  • Incorrect configuration format

Solutions:

# Check secret data
kubectl get secret rclone-secret -o jsonpath='{.data.configData}' | base64 -d

# Verify credentials manually
kubectl exec -n veloxpack -l app=csi-rclone-node -- /rcloneplugin --help

# Test configuration
kubectl exec -n veloxpack -l app=csi-rclone-node -- sh -c 'echo "[s3]
type = s3
provider = AWS
access_key_id = YOUR_KEY
secret_access_key = YOUR_SECRET
region = us-east-1" > /tmp/test.conf && rclone lsd s3:test-bucket --config /tmp/test.conf'

4. Performance Issues

Symptoms:

  • Slow file operations
  • High memory usage
  • Timeout errors

Causes:

  • Inadequate VFS cache configuration
  • Network latency
  • Resource constraints

Solutions:

# Check VFS cache settings
kubectl describe pv <pv-name> | grep -i mount

# Monitor resource usage
kubectl top pods -n veloxpack -l app=csi-rclone-node

# Adjust cache settings
# Add to StorageClass mountOptions:
# - vfs-cache-mode=writes
# - vfs-cache-max-size=10G
# - dir-cache-time=30s

5. Network Connectivity Issues

Symptoms:

  • Timeout errors
  • Connection refused
  • Slow operations

Causes:

  • Network policies blocking access
  • DNS resolution issues
  • Firewall rules

Solutions:

# Test connectivity from driver pod
kubectl exec -n veloxpack -l app=csi-rclone-node -- nslookup s3.amazonaws.com

# Check network policies
kubectl get networkpolicies

# Test from node
kubectl debug node/<node-name> -it --image=busybox -- nslookup s3.amazonaws.com

Debug Commands

Check Driver Logs

# Controller logs
kubectl logs -n veloxpack -l app=csi-rclone-controller --tail=100

# Node logs
kubectl logs -n veloxpack -l app=csi-rclone-node --tail=100

# Follow logs
kubectl logs -n veloxpack -l app=csi-rclone-node -f

# Previous container logs
kubectl logs -n veloxpack -l app=csi-rclone-node --previous

Check Mount Points

# List mount points
kubectl exec -n veloxpack -l app=csi-rclone-node -- mount | grep rclone

# Check mount options
kubectl exec -n veloxpack -l app=csi-rclone-node -- cat /proc/mounts | grep rclone

# Check FUSE mounts
kubectl exec -n veloxpack -l app=csi-rclone-node -- ls -la /dev/fuse

Check Volume Status

# Check PVC status
kubectl get pvc <pvc-name> -o yaml

# Check PV status
kubectl get pv <pv-name> -o yaml

# Check pod volume mounts
kubectl describe pod <pod-name> | grep -A 10 "Volumes:"

Check Events

# All events
kubectl get events --sort-by=.metadata.creationTimestamp

# Events for specific resource
kubectl get events --field-selector involvedObject.name=<resource-name>

# Recent events
kubectl get events --sort-by=.metadata.creationTimestamp --field-selector type=Warning

Enable Debug Logging

Driver Logging

# In controller deployment
args:
  - "--v=5"
  - "--logtostderr=true"
  - "--stderrthreshold=INFO"

# In node deployment
args:
  - "--v=5"
  - "--logtostderr=true"
  - "--stderrthreshold=INFO"

FUSE Debugging

# Add to StorageClass mountOptions
mountOptions:
  - debug-fuse
  - v=5

Performance Tuning

VFS Cache Configuration

# High performance configuration
mountOptions:
  - vfs-cache-mode=full
  - vfs-cache-max-size=50G
  - vfs-cache-max-age=24h
  - dir-cache-time=5m
  - vfs-read-ahead=1M

Resource Limits

# Controller resources
resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "1Gi"
    cpu: "1000m"

# Node resources
resources:
  requests:
    memory: "512Mi"
    cpu: "200m"
  limits:
    memory: "2Gi"
    cpu: "2000m"

Monitoring

Key Metrics to Monitor

  1. Driver Health

    • Pod status and restarts
    • Memory and CPU usage
    • Log error rates
  2. Volume Operations

    • Mount/unmount success rates
    • Operation latency
    • Error rates
  3. Storage Backend

    • API call latency
    • Error rates
    • Throughput

Prometheus Metrics

# Add to driver deployment
args:
  - "--metrics-address=:8080"
  - "--metrics-path=/metrics"

Recovery Procedures

Restart Driver Pods

# Restart controller
kubectl rollout restart deployment/csi-rclone-controller -n veloxpack

# Restart node daemonset
kubectl rollout restart daemonset/csi-rclone-node -n veloxpack

Clean Up Corrupted Mounts

# Force unmount on specific node
kubectl exec -n veloxpack -l app=csi-rclone-node -- umount -f /var/lib/kubelet/pods/*/volumes/kubernetes.io~csi/*/mount

# Restart node daemonset
kubectl rollout restart daemonset/csi-rclone-node -n veloxpack

Reset Driver State

# Delete CSIDriver resource
kubectl delete csidriver rclone.csi.veloxpack.io

# Recreate
kubectl apply -f deploy/csi-rclone-driverinfo.yaml

Getting Help

Log Collection

# Collect logs for debugging
kubectl logs -n veloxpack -l app=csi-rclone-controller > controller.log
kubectl logs -n veloxpack -l app=csi-rclone-node > node.log
kubectl get events --sort-by=.metadata.creationTimestamp > events.log

Support Resources

How is this guide?