Kubeflow Pipelines: A Step-by-Step Guide
Kubeflow Pipelines is a platform for building, deploying, and managing end-to-end machine learning workflows. It streamlines the process of creating and executing ML pipelines, making it easier for data scientists and engineers to collaborate on model development and deployment. In this tutorial, we will guide you through the process of setting up Kubeflow Pipelines on your local machine using MiniKF and running a simple pipeline in Python.
- A computer with at least 8GB RAM and 50GB of free disk space
- VirtualBox installed (Download from https://www.virtualbox.org/wiki/Downloads)
- MiniKF Vagrant box downloaded (Download from https://www.vagrantup.com/docs/boxes)
Step 1: Install Vagrant
First, you need to install Vagrant on your machine. Follow the installation instructions for your operating system here: https://www.vagrantup.com/docs/installation
Step 2: Set up MiniKF
Now, let’s set up MiniKF (Mini Kubeflow) on your local machine. MiniKF is a lightweight version of Kubeflow that runs on top of VirtualBox using Vagrant. It is perfect for testing and development purposes.
Create a new directory for your MiniKF setup and navigate to it in your terminal:
mkdir minikf cd minikf
Initialize the MiniKF Vagrant box by running:
vagrant init arrikto/minikf
Start the MiniKF virtual machine:
This process will take some time, as Vagrant downloads the MiniKF box and sets up the virtual machine.
Step 3: Access the Kubeflow Dashboard
After the virtual machine is up and running, you can access the Kubeflow dashboard in your browser. Open the following URL:
http://10.10.10.10. You will be prompted to log in with a username and password. Use
admin as both the username and password.
Step 4: Create a Simple Pipeline in Python
Now, let’s create a simple pipeline in Python that reads some data, processes it, and outputs the result. First, install the Kubeflow Pipelines SDK:
pip install kfp
Create a new Python script (e.g.,
simple_pipeline.py) and add the following code:
import kfp from kfp import dsl def read_data_op(): return dsl.ContainerOp( name="Read Data", image="python:3.7", command=["sh", "-c"], arguments=["echo 'Reading data' && sleep 5"], ) def process_data_op(): return dsl.ContainerOp( name="Process Data", image="python:3.7", command=["sh", "-c"], arguments=["echo 'Processing data' && sleep 5"], ) def output_data_op(): return dsl.ContainerOp( name="Output Data", image="python:3.7", command=["sh", "-c"], arguments=["echo 'Outputting data' && sleep 5"], ) @dsl.pipeline( name="Simple Pipeline", description="A simple pipeline that reads, processes, and outputs data." ) def simple_pipeline(): read_data = read_data_op() process_data = process_data_op().after(read_data) output_data = output_data_op().after(process_data) if __name__ == "__main__": kfp.compiler.Compiler().compile(simple_pipeline, "simple_pipeline.yaml")
This Python script defines a simple pipeline with three steps: reading data, processing data, and outputting data. Each step is defined as a function that returns a
ContainerOp object, which represents a containerized operation in the pipeline. The
@dsl.pipeline decorator is used to define the pipeline, and the
kfp.compiler.Compiler().compile() function is used to compile the pipeline into a YAML file.
Step 5: Upload and Run the Pipeline
Now that you have created a simple pipeline in Python, let’s upload and run it on the Kubeflow Pipelines platform.
- Go to the Kubeflow dashboard (
http://10.10.10.10) and click on the “Pipelines” tab in the left-hand sidebar.
- Click the “Upload pipeline” button in the upper right corner.
- In the “Upload pipeline” dialog, click “Browse” and select the
simple_pipeline.yamlfile generated in the previous step.
- Click “Upload” to upload the pipeline to the Kubeflow platform.
- Once the pipeline is uploaded, click on its name to open the pipeline details page.
- Click the “Create run” button to start a new run of the pipeline.
- On the “Create run” page, you can give your run a name and choose a pipeline version. Click “Start” to begin the pipeline run.
Step 6: Monitor the Pipeline Run
After starting the pipeline run, you will be redirected to the “Run details” page. Here, you can monitor the progress of your pipeline, view the logs for each step, and inspect the output artifacts.
- The pipeline graph will show the status of each step in the pipeline, with different colors indicating success, failure, or in-progress status.
- To view the logs for a specific step, click on the step in the pipeline graph and then click the “Logs” tab in the right-hand pane.
- To view the output artifacts, click on the step in the pipeline graph and then click the “Artifacts” tab in the right-hand pane.
Congratulations! You have successfully set up Kubeflow Pipelines on your local machine, created a simple pipeline in Python, and executed it using the Kubeflow platform. You can now experiment with more complex pipelines, integrate different components, and optimize your machine learning workflows.
With Kubeflow Pipelines, you can automate your machine learning workflows, making it easier to build, deploy, and manage complex ML models. Now that you have a basic understanding of how to create and run pipelines in Kubeflow, you can explore more advanced features and build more sophisticated pipelines for your own projects.