Moving data
Into the cluster
If you want to send data between the cluster storage and some other storage outside (including your local computer), you have several options, and choosing one depends on the volume of your data, number of files, where you data is now (and whether there's access to this location from the cluster).
Kubernetes pods have local addresses and are not accessible from outside, so your options are pretty much limited to either accessing the cluster storage itself (S3), or pulling the data into the pod. If you're not using the services provided by the cluster (S3 or Nextcloud), you'll have to mount a persistent storage volume into your pod.
"kubectl cp" command
The most straightforward way to copy data is using the kubectl
tool.
You should NOT use this method for any large amount of data. The data is going through our api (management) server, which is not having any fast connection, and you'll be affecting the cluster performance if you send more than a couple megabytes through it.
Using S3 object storage provided by the cluster
This is the most scalable way, and you can transfer the largest volume and number of files using it. Refer to our S3 documentation on how to request an account, and setup one of clients outside and inside to access the data.
Using the Nextcloud instance
Using our Nextcloud provides a convenient way to sync the data from your local machine, but this way is not too scalable and fast. Also you'll still have to copy the data to the pod from Nextcloud to use it. (The page provides a setup example for rclone which uses the Nextcloud WebDAV interface).
Pulling data from inside the pod
If your data is located in a storage which can be accessed from outside (any cloud provider, a server, etc), it might be easier to pull the data into your pod. Depending on the data size, you can either:
- Run an idle pod,
kubectl exec
into it and manually run a command. You'll have to set up the credentials to access the remote storage. - Run a batch job which will do this for you. In this case the pod should have credentials set up at the time it starts. This should be better for large datasets, since you don't have to keep your shell open, and will auto restart if the pod is killed for some reason.
The tools you can use include scp (needs you to set up an ssh key or type the password by hand), rclone (supports MANY data storages, and you can copy the config file generated locally), wget/curl (for pulling data from HTTP servers), any other tool for accessing your dataset you might find.
Using secrets
If you need to provide credentials to your data puller as a file, the best way is to use the kubernetes Secret. Create a secret in your namespace from a file:
Check it's created:
Then mount one to your pod as a folder:
Or use as a file:
volumeMounts:
- mountPath: /secrets/secret.key
subPath: secret.key
name: sec
volumes:
- name: sec
secret:
secretName: my-secret
Or even use as an environmental variable:
Inside the cluster
To copy the data from one volume to another (which can be located in a different region), you can use a pod (interactively) or a job (in batch mode).
Here's the example of a job using the gsutil image to copy data in parallel between two mounted PVCs:
apiVersion: batch/v1
kind: Job
metadata:
name: copy-data
spec:
template:
spec:
containers:
- image: gitlab-registry.nrp-nautilus.io/prp/gsutil:latest
command:
- bash
- -c
- "gsutil -m rsync -e -r -P /from /to/"
imagePullPolicy: Always
name: backup
volumeMounts:
- mountPath: /from
name: source
- mountPath: /to
name: target
resources:
limits:
cpu: "4"
memory: 4G
requests:
cpu: "4"
memory: 4G
nodeSelector:
topology.kubernetes.io/region: us-central
restartPolicy: Never
volumes:
- name: target
persistentVolumeClaim:
claimName: <claim to>
- name: source
persistentVolumeClaim:
claimName: <claim from>
readOnly: true
backoffLimit: 1