Airbyte allows scaling sync workloads horizontally using Kubernetes. The core components (api server, scheduler, etc) run as deployments while the scheduler launches connector-related pods on different nodes.
If you don't want to configure your own K8s cluster and Airbyte instance, you can use the free, open-source project Plural to bring up a K8s cluster and Airbyte for you. Use this guide to get started.
For local testing we recommend following one of the following setup guides:
- Docker Desktop (Mac)
- Minikube
- NOTE: Start Minikube with at least 4gb RAM with
minikube start --memory=4000
- NOTE: Start Minikube with at least 4gb RAM with
- Kind
For testing on GKE you can create a cluster with the command line or the Cloud Console UI.
For testing on EKS you can install eksctl and run eksctl create cluster
to create an EKS cluster/VPC/subnets/etc. This process should take 10-15 minutes.
For production, Airbyte should function on most clusters v1.19 and above. We have tested support on GKE and EKS. If you run into a problem starting Airbyte, please reach out on the #troubleshooting
channel on our Slack or create an issue on GitHub.
If you do not already have the CLI tool kubectl
installed, please follow these instructions to install.
Configure kubectl
to connect to your cluster by using kubectl use-context my-cluster-name
.
- For GKE
-
Configure
gcloud
withgcloud auth login
. -
On the Google Cloud Console, the cluster page will have a
Connect
button, which will give a command to run locally that looks likegcloud container clusters get-credentials CLUSTER_NAME --zone ZONE_NAME --project PROJECT_NAME
. -
Use
kubectl config get-contexts
to show the contexts available. -
Run
kubectl use-context <gke context>
to access the cluster fromkubectl
.
-
- For EKS
- Configure your AWS CLI to connect to your project.
- Install eksctl
- Run
eksctl utils write-kubeconfig --cluster=<CLUSTER NAME>
to make the context available tokubectl
- Use
kubectl config get-contexts
to show the contexts available. - Run
kubectl use-context <eks context>
to access the cluster withkubectl
.
Both dev
and stable
versions of Airbyte include a stand-alone Minio
deployment. Airbyte publishes logs to this Minio
deployment by default. This means Airbyte comes as a self-contained Kubernetes deployment - no other configuration is required.
So if you just want logs to be sent to the local Minio
deployment, you do not need to change the values of any environment variables from what is currently on master.
Alternatively, if you want logs to be sent to a custom location, Airbyte currently supports logging to Minio
, S3
or GCS
. The following instructions are for users wishing to log to their own Minio
layer, S3
bucket or GCS
bucket.
The provided credentials require both read and write permissions. The logger attempts to create the log bucket if it does not exist.
To write to a custom minio log location, replace the following variables in the .env
file in the kube/overlays/stable
directory:
S3_LOG_BUCKET=<your_minio_bucket_to_write_logs_in>
AWS_ACCESS_KEY_ID=<your_minio_access_key>
AWS_SECRET_ACCESS_KEY=<your_minio_secret_key>
S3_MINIO_ENDPOINT=<endpoint_where_minio_is_deployed_at>
The S3_PATH_STYLE_ACCESS
variable should remain true
. The S3_LOG_BUCKET_REGION
variable should remain empty.
To write to a custom S3 log location, replace the following variables in the .env
file in the kube/overlays/stable
directory:
S3_LOG_BUCKET=<your_s3_bucket_to_write_logs_in>
S3_LOG_BUCKET_REGION=<your_s3_bucket_region>
AWS_ACCESS_KEY_ID=<your_aws_access_key_id>
AWS_SECRET_ACCESS_KEY=<your_aws_secret_access_key>
# Set this to empty.
S3_MINIO_ENDPOINT=
# Set this to empty.
S3_PATH_STYLE_ACCESS=
See here for instructions on creating an S3 bucket and here for instructions on creating AWS credentials.
Create the GCP service account with read/write permission to the GCS log bucket.
1) Base64 encode the GCP json secret.
# The output of this command will be a Base64 string.
$ cat gcp.json | base64
2) Populate the gcs-log-creds secrets with the Base64-encoded credential. This is as simple as taking the encoded credential from the previous step and adding it to the secret-gcs-log-creds,yaml
file.
apiVersion: v1
kind: Secret
metadata:
name: gcs-log-creds
namespace: default
data:
gcp.json: <base64-encoded-string>
3) Replace the following variables in the .env
file in the kube/overlays/stable
directory:
GCS_LOG_BUCKET=<your_GCS_bucket_to_write_logs_in>
# The path the GCS creds are written to. Unless you know what you are doing, use the below default value.
GOOGLE_APPLICATION_CREDENTIALS=/secrets/gcs-log-creds/gcp.json
See here for instruction on creating a GCS bucket and here for instruction on creating GCP credentials.
Run the following commands to launch Airbyte:
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
kubectl apply -k kube/overlays/stable
After 2-5 minutes, kubectl get pods | grep airbyte
should show Running
as the status for all the core Airbyte pods. This may take longer on Kubernetes clusters with slow internet connections.
Run kubectl port-forward svc/airbyte-webapp-svc 8000:80
to allow access to the UI/API.
Now visit http://localhost:8000 in your browser and start moving some data!
- Core container pods
- Instead of launching Airbyte with
kubectl apply -k kube/overlays/stable
, you can run withkubectl apply -k kube/overlays/stable-with-resource-limits
. - The
kube/overlays/stable-with-resource-limits/set-resource-limits.yaml
file can be modified to provide different resource requirements for core pods.
- Instead of launching Airbyte with
- Connector pods
- By default, connector pods launch without resource limits.
- To add resource limits, configure the "Docker Resource Limits" section of the
.env
file in the overlay folder you're using.
- Volume sizes
- You can modify
kube/resources/volume-*
files to specify different volume sizes for the persistent volumes backing Airbyte.
- You can modify
The number of simultaneous jobs (getting specs, checking connections, discovering schemas, and performing syncs) is limited by a few factors. First of all, the SUBMITTER_NUM_THREADS
(set in the .env
file for your Kustimization overlay) provides a global limit on the number of simultaneous jobs that can run across all worker pods.
The number of worker pods can be changed by increasing the number of replicas for the airbyte-worker
deployment. An example of a Kustomization patch that increases this number can be seen in airbyte/kube/overlays/dev-integration-test/kustomization.yaml
and airbyte/kube/overlays/dev-integration-test/parallelize-worker.yaml
. The number of simultaneous jobs on a specific worker pod is also limited by the number of ports exposed by the worker deployment and set by TEMPORAL_WORKER_PORTS
in your .env
file. Without additional ports used to communicate to connector pods, jobs will start to run but will hang until ports become available.
You can also tune environment variables for the max simultaneous job types that can run on the worker pod by setting MAX_SPEC_WORKERS
, MAX_CHECK_WORKERS
, MAX_DISCOVER_WORKERS
, MAX_SYNC_WORKERS
for the worker pod deployment (not in the .env
file). These values can be used if you want to create separate worker deployments for separate types of workers with different resource allocations.
Airbyte writes logs to two directories. App logs, including server and scheduler logs, are written to the app-logging
directory. Job logs are written to the job-logging
directory. Both directories live at the top-level e.g., the app-logging
directory lives at s3://log-bucket/app-logging
etc. These paths can change, so we recommend having a dedicated log bucket, and to not use this bucket for other purposes.
Airbyte publishes logs every minute. This means it is normal to see minute-long log delays. Each publish creates it's own log file, since Cloud Storages do not support append operations. This also mean it is normal to see hundreds of files in your log bucket.
Each log file is named {yyyyMMddHH24mmss}_{podname}_{UUID}
and is not compressed. Users can view logs simply by navigating to the relevant folder and downloading the file for the time period in question.
See the Known Issues section for planned logging improvements.
After Issue #3605 is completed, users will be able to configure custom dbs instead of a simple postgres
container running directly in Kubernetes. This separate instance (preferable on a system like AWS RDS or Google Cloud SQL) should be easier and safer to maintain than Postgres on your cluster.
As we improve our Kubernetes offering, we would like to point out some common pain points. We are working on improving these. Please let us know if there are any other issues blocking your adoption of Airbyte or if you would like to contribute fixes to address any of these issues.
- Some UI operations have higher latency on Kubernetes than Docker-Compose. (#4233)
- Logging to Azure Storage is not supported. (#4200)
- Large log files might take a while to load. (#4201)
- UI does not include configured buckets in the displayed log path. (#4204)
- Logs are not reset when Airbyte is re-deployed. (#4235)
- File sources reading from and file destinations writing to local mounts are not supported on Kubernetes.
We use Kustomize to allow overrides for different environments. Our shared resources are in the kube/resources
directory, and we define overlays for each environment. We recommend creating your own overlay if you want to customize your deployments. This overlay can live in your own VCS.
Example kustomization.yaml
file:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- https://github.com/airbytehq/airbyte.git/kube/overlays/stable?ref=master
For a specific overlay, you can run kubectl kustomize kube/overlays/stable
to view the manifests that Kustomize will apply to your Kubernetes cluster. This is useful for debugging because it will show the exact resources you are defining.
Check out the Helm Chart Readme
kubectl logs deployments/airbyte-server
to view real-time logs. Logs can also be downloaded as a text file via the Admin tab in the UI.
kubectl logs deployments/airbyte-scheduler
to view real-time logs. Logs can also be downloaded as a text file via the Admin tab in the UI.
Although all logs can be accessed by viewing the scheduler logs, connector container logs may be easier to understand when isolated by accessing from the Airbyte UI or the Airbyte API for a specific job attempt. Connector pods launched by Airbyte will not relay logs directly to Kubernetes logging. You must access these logs through Airbyte.
See Upgrading K8s.
To resize a volume, change the .spec.resources.requests.storage
value. After re-applying, the mount should be extended if that operation is supported for your type of mount. For a production deployment, it's useful to track the usage of volumes to ensure they don't run out of space.
See the documentation for kubectl cp
.
kubectl exec -it airbyte-scheduler-6b5747df5c-bj4fx ls /tmp/workspace/8
kubectl exec -it airbyte-scheduler-6b5747df5c-bj4fx cat /tmp/workspace/8/0/logs.log
Running Airbyte on GKE regional cluster requires enabling persistent regional storage. To do so, enable CSI driver on GKE. After enabling, add storageClassName: standard-rwo
to the volume-configs yaml.
volume-configs.yaml
example:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: airbyte-volume-configs
labels:
airbyte: volume-configs
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Mi
storageClassName: standard-rwo
If you run into any problems operating Airbyte on Kubernetes, please reach out on the #issues
channel on our Slack or create an issue on GitHub.