Kubernetes | Towards Data Science

Kubernetes — Understanding and Utilizing Probes Effectively

Kiril Vodenicharov — Thu, 06 Mar 2025 03:59:54 +0000

Introduction

Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits.

Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container lifecycle management. This is exactly why proper configuration — and implementation — of Kubernetes probes is vital for any critical deployment. They assist your cluster to make intelligent decisions about traffic routing, restarts, and resource allocation.

Properly configured probes dramatically improve your application reliability, reduce deployment downtime, and handle unexpected errors gracefully. In this article, we’ll explore the three types of probes available in Kubernetes and how utilizing them alongside each other helps configure more resilient systems.

Quick refresher

Understanding exactly what each probe does and some common configuration patterns is essential. Each of them serves a specific purpose in the container lifecycle and when used together, they create a rock-solid framework for maintaining your application availability and performance.

Startup: Optimizing start-up times

Start-up probes are evaluated once when a new pod is spun up because of a scale-up event or a new deployment. It serves as a gatekeeper for the rest of the container checks and fine-tuning it will help your applications better handle increased load or service degradation.

Sample Config:

startupProbe:
  httpGet:
    path: /health
    port: 80
  failureThreshold: 30
  periodSeconds: 10

Key takeaways:

Keep periodSeconds low, so that the probe fires often, quickly detecting a successful deployment.
Increase failureThreshold to a high enough value to accommodate for the worst-case start-up time.

The Startup probe will check whether your container has started by querying the configured path. It will additionally stop the triggering of the Liveness and Readiness probes until it is successful.

Liveness: Detecting dead containers

Your liveness probes answer a very simple question: “Is this pod still running properly?” If not, K8s will restart it.

Sample Config:

livenessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 10
  failureThreshold: 3

Key takeaways:

Since K8s will completely restart your container and spin up a new one, add a failureThreshold to combat intermittent abnormalities.
Avoid using initialDelaySeconds as it is too restrictive — use a Start-up probe instead.

Be mindful that a failing Liveness probe will bring down your currently running pod and spin up a new one, so avoid making it too aggressive — that’s for the next one.

Readiness: Handling unexpected errors

The readiness probe determines if it should start — or continue — to receive traffic. It is extremely useful in situations where your container lost connection to the database or is otherwise over-utilized and should not receive new requests.

Sample Config:

readinessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 3
  failureThreshold: 1
  timeoutSeconds: 1

Key takeaways:

Since this is your first guard to stopping traffic to unhealthy targets, make the probe aggressive and reduce the periodSeconds .
Keep failureThreshold at a minimum, you want to fail quick.
The timeout period should also be kept at a minimum to handle slower Containers.
Give the readinessProbe ample time to recover by having a longer-running livenessProbe .

Readiness probes ensure that traffic will not reach a container not ready for it and as such it’s one of the most important ones in the stack.

Putting it all together

As you can see, even if all of the probes have their own distinct uses, the best way to improve your application’s resilience strategy is using them alongside each other.

Your startup probe will assist you in scale up scenarios and new deployments, allowing your containers to be quickly brought up. They’re fired only once and also stop the execution of the rest of the probes until they successfully complete.

The liveness probe helps in dealing with dead containers suffering from non-recoverable errors and tells the cluster to bring up a new, fresh pod just for you.

The readiness probe is the one telling K8s when a pod should receive traffic or not. It can be extremely useful dealing with intermittent errors or high resource consumption resulting in slower response times.

Additional configurations

Probes can be further configured to use a command in their checks instead of an HTTP request, as well as giving ample time for the container to safely terminate. While these are useful in more specific scenarios, understanding how you can extend your deployment configuration can be beneficial, so I’d recommend doing some additional reading if your containers handle unique use cases.

Further reading:
Liveness, Readiness, and Startup Probes
Configure Liveness, Readiness and Startup Probes

The post Kubernetes — Understanding and Utilizing Probes Effectively appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 5: The Training

David Martin — Thu, 13 Feb 2025 21:04:32 +0000

In this fifth part of my series, I will outline the steps for creating a Docker container for training your image classification model, evaluating performance, and preparing for deployment.

AI/ML engineers would prefer to focus on model training and data engineering, but the reality is that we also need to understand the infrastructure and mechanics behind the scenes.

I hope to share some tips, not only to get your training run running, but how to streamline the process in a cost efficient manner on cloud resources such as Kubernetes.

I will reference elements from my previous articles for getting the best model performance, so be sure to check out Part 1 and Part 2 on the data sets, as well as Part 3 and Part 4 on model evaluation.

Here are the learnings that I will share with you:

Infrastructure overview
Building your Docker container
Executing your training run
Deploying your model

Infrastructure overview

First, let me provide a brief description of the setup that I created, specifically around Kubernetes. Your setup may be entirely different, and that is just fine. I simply want to set the stage on the infrastructure so that the rest of the discussion makes sense.

Image management system

This is a server you deploy that provides a user interface to for your subject matter experts to label and evaluate images for the image classification application. The server can run as a pod on your Kubernetes cluster, but you may find that running a dedicated server with faster disk may be better.

Image files are stored in a directory structure like the following, which is self-documenting and easily modified.

Image_Library/
  - cats/
    - image1001.png
  - dogs/
    - image2001.png

Ideally, these files would reside on local server storage (instead of cloud or cluster storage) for better performance. The reason for this will become clear as we see what happens as the image library grows.

Cloud storage

Cloud Storage allows for a virtually limitless and convenient way to share files between systems. In this case, the image library on your management system could access the same files as your Kubernetes cluster or Docker engine.

However, the downside of cloud storage is the latency to open a file. Your image library will have thousands and thousands of images, and the latency to read each file will have a significant impact on your training run time. Longer training runs means more cost for using the expensive GPU processors!

The way that I found to speed things up is to create a tar file of your image library on your management system and copy them to cloud storage. Even better would be to create multiple tar files in parallel, each containing 10,000 to 20,000 images.

This way you only have network latency on a handful of files (which contain thousands, once extracted) and you start your training run much sooner.

Kubernetes or Docker engine

A Kubernetes cluster, with proper configuration, will allow you to dynamically scale up/down nodes, so you can perform your model training on GPU hardware as needed. Kubernetes is a rather heavy setup, and there are other container engines that will work.

The technology options change constantly!

The main idea is that you want to spin up the resources you need — for only as long as you need them — then scale down to reduce your time (and therefore cost) of running expensive GPU resources.

Once your GPU node is started and your Docker container is running, you can extract the tar files above to local storage, such as an emptyDir, on your node. The node typically has high-speed SSD disk, ideal for this type of workload. There is one caveat — the storage capacity on your node must be able to handle your image library.

Assuming we are good, let’s talk about building your Docker container so that you can train your model on your image library.

Building your Docker container

Being able to execute a training run in a consistent manner lends itself perfectly to building a Docker container. You can “pin” the version of libraries so you know exactly how your scripts will run every time. You can version control your containers as well, and revert to a known good image in a pinch. What is really nice about Docker is you can run the container pretty much anywhere.

The tradeoff when running in a container, especially with an Image Classification model, is the speed of file storage. You can attach any number of volumes to your container, but they are usually network attached, so there is latency on each file read. This may not be a problem if you have a small number of files. But when dealing with hundreds of thousands of files like image data, that latency adds up!

This is why using the tar file method outlined above can be beneficial.

Also, keep in mind that Docker containers could be terminated unexpectedly, so you should make sure to store important information outside the container, on cloud storage or a database. I’ll show you how below.

Dockerfile

Knowing that you will need to run on GPU hardware (here I will assume Nvidia), be sure to select the right base image for your Dockerfile, such as nvidia/cuda with the “devel” flavor that will contain the right drivers.

Next, you will add the script files to your container, along with a “batch” script to coordinate the execution. Here is an example Dockerfile, and then I’ll describe what each of the scripts will be doing.

#####   Dockerfile   #####
FROM nvidia/cuda:12.8.0-devel-ubuntu24.04

# Install system software
RUN apt-get -y update && apg-get -y upgrade
RUN apt-get install -y python3-pip python3-dev

# Setup python
WORKDIR /app
COPY requirements.txt
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install -r requirements.txt

# Pythong and batch scripts
COPY ExtractImageLibrary.py .
COPY Training.py .
COPY Evaluation.py .
COPY ScorePerformance.py .
COPY ExportModel.py .
COPY BulkIdentification.py .
COPY BatchControl.sh .

# Allow for interactive shell
CMD tail -f /dev/null

Dockerfiles are declarative, almost like a cookbook for building a small server — you know what you’ll get every time. Python libraries benefit, too, from this declarative approach. Here is a sample requirements.txt file that loads the TensorFlow libraries with CUDA support for GPU acceleration.

#####   requirements.txt   #####
numpy==1.26.3
pandas==2.1.4
scipy==1.11.4
keras==2.15.0
tensorflow[and-cuda]

Extract Image Library script

In Kubernetes, the Docker container can access local, high speed storage on the physical node. This can be achieved via the emptyDir volume type. As mentioned before, this will only work if the local storage on your node can handle the size of your library.

#####   sample 25GB emptyDir volume in Kubernetes   #####
containers:
  - name: training-container
    volumeMounts:
      - name: image-library
        mountPath: /mnt/image-library
volumes:
  - name: image-library
    emptyDir:
      sizeLimit: 25Gi

You would want to have another volumeMount to your cloud storage where you have the tar files. What this looks like will depend on your provider, or if you are using a persistent volume claim, so I won’t go into detail here.

Now you can extract the tar files — ideally in parallel for an added performance boost — to the local mount point.

Training script

As AI/ML engineers, the model training is where we want to spend most of our time.

This is where the magic happens!

With your image library now extracted, we can create our train-validation-test sets, load a pre-trained model or build a new one, fit the model, and save the results.

One key technique that has served me well is to load the most recently trained model as my base. I discuss this in more detail in Part 4 under “Fine tuning”, this results in faster training time and significantly improved model performance.

Be sure to take advantage of the local storage to checkpoint your model during training since the models are quite large and you are paying for the GPU even while it sits idle writing to disk.

This of course raises a concern about what happens if the Docker container dies part-way though the training. The risk is (hopefully) low from a cloud provider, and you may not want an incomplete training anyway. But if that does happen, you will at least want to understand why, and this is where saving the main log file to cloud storage (described below) or to a package like MLflow comes in handy.

Evaluation script

After your training run has completed and you have taken proper precaution on saving your work, it is time to see how well it performed.

Normally this evaluation script will pick up on the model that just finished. But you may decide to point it at a previous model version through an interactive session. This is why have the script as stand-alone.

With it being a separate script, that means it will need to read the completed model from disk — ideally local disk for speed. I like having two separate scripts (training and evaluation), but you might find it better to combine these to avoid reloading the model.

Now that the model is loaded, the evaluation script should generate predictions on every image in the training, validation, test, and benchmark sets. I save the results as a huge matrix with the softmax confidence score for each class label. So, if there are 1,000 classes and 100,000 images, that’s a table with 100 million scores!

I save these results in pickle files that are then used in the score generation next.

Score Performance script

Taking the matrix of scores produced by the evaluation script above, we can now create various metrics of model performance. Again, this process could be combined with the evaluation script above, but my preference is for independent scripts. For example, I might want to regenerate scores on previous training runs. See what works for you.

Here are some of the sklearn functions that produce useful insights like F1, log loss, AUC-ROC, Matthews correlation coefficient.

from sklearn.metrics import average_precision_score, classification_report
from sklearn.metrics import log_loss, matthews_corrcoef, roc_auc_score

Aside from these basic statistical analyses for each dataset (train, validation, test, and benchmark), it is also useful to identify:

Which ground truth labels get the most number of errors?
Which predicted labels get the most number of incorrect guesses?
How many ground-truth-to-predicted label pairs are there? In other words, which classes are easily confused?
What is the accuracy when applying a minimum softmax confidence score threshold?
What is the error rate above that softmax threshold?
For the “difficult” benchmark sets, do you get a sufficiently high score?
For the “out-of-scope” benchmark sets, do you get a sufficiently low score?

As you can see, there are multiple calculations and it’s not easy to come up with a single evaluation to decide if the trained model is good enough to be moved to production.

In fact, for an image classification model, it is helpful to manually review the images that the model got wrong, as well as the ones that got a low softmax confidence score. Actually looking at what the model got wrong, you can get a gut-feel for how well it will perform in the real world.

Check out Part 3 for more in-depth discussion on evaluation and scoring.

Export Model script

All of the heavy lifting is done by this point. Since your Docker container will be shutdown soon, now is the time to copy the model artifacts to cloud storage and prepare them for being put to use.

The example Python code snippet below is more geared to Keras and TensorFlow. This will take the trained model and export it as a saved_model. Later, I will show how this is used by TensorFlow Serving in the Deploy section below.

# Increment current version of model and create new directory
next_version_dir, version_number = create_new_version_folder()

# Copy model artifacts to the new directory
copy_model_artifacts(next_version_dir)

# Create the directory to save the model export
saved_model_dir = os.path.join(next_version_dir, str(version_number))

# Save the model export for use with TensorFlow Serving
tf.keras.backend.set_learning_phase(0)
model = tf.keras.models.load_model(keras_model_file)
tf.saved_model.save(model, export_dir=saved_model_dir)

This script also copies the other training run artifacts such as the model evaluation results, score summaries, and log files generated from model training. Don’t forget about your label map so you can give human readable names to your classes!

Bulk Identification script

Your training run is complete, your model has been scored, and a new version is exported and ready to be served. Now is the time to use this latest model to assist you on trying to identify unlabeled images.

As I described in Part 4, you may have a collection of “unknowns” — really good pictures, but no idea what they are. Let your new model provide a best guess on these and record the results to a file or a database. Now you can create filters based on closest match and by high/low scores. This allows your subject matter experts to leverage these filters to find new image classes, add to existing classes, or to remove images that have very low scores and are no good.

By the way, I put this step inside the GPU container since you may have thousands of “unknown” images to process and the accelerated hardware will make light work of it. However, if you are not in a hurry, you could perform this step on a separate CPU node, and shutdown your GPU node sooner to save cost. This would especially make sense if your “unknowns” folder is on slower cloud storage.

Batch Control script

All of the scripts described above perform a specific task — from extracting your image library, executing model training, performing evaluation and scoring, exporting the model artifacts for deployment, and perhaps even bulk identification.

One script to rule them all

To coordinate the entire show, this batch control script gives you the entry point for your container and an easy way to trigger everything. Be sure to produce a log file in case you need to analyze any failures along the way. Also, be sure to write the log to your cloud storage in case the container dies unexpectedly.

#!/bin/bash
# Main batch control script

# Redirect standard output and standard error to a log file
exec > /cloud_storage/batch-logfile.txt 2>&1

/app/ExtractImageLibrary.py
/app/Training.py
/app/Evaluation.py
/app/ScorePerformance.py
/app/ExportModel.py
/app/BulkIdentification.py

Executing your training run

So, now it’s time to put everything in motion…

Start your engines!

Let’s go through the steps to prepare your image library, fire up your Docker container to train your model, and then examine the results.

Image library ‘tar’ files

Your image management system should now create a tar file backup of your data. Since tar is a single-threaded function, you will get significant speed improvement by creating multiple tar files in parallel, each with a portion of you data.

Now these files can be copied to your shared cloud storage for the next step.

Start your Docker container

All the hard work you put into creating your container (described above) will be put to the test. If you are running Kubernetes, you can create a Job that will execute the BatchControl.sh script.

Inside the Kubernetes Job definition, you can pass environment variables to adjust the execution of your script. For example, the batch size and number of epochs are set here and then pulled into your Python scripts, so you can alter the behavior without changing your code.

#####   sample Job in Kubernetes   #####
containers:
  - name: training-job
    env:
      - name: BATCH_SIZE
        value: 50
      - name: NUM_EPOCHS
        value: 30
    command: ["/app/BatchControl.sh"]

Once the Job is completed, be sure to verify that the GPU node properly scales back down to zero according to your scaling configuration in Kubernetes — you don’t want to be saddled with a huge bill over a simple configuration error.

Manually review results

With the training run complete, you should now have model artifacts saved and can examine the performance. Look through the metrics, such as F1 and log loss, and benchmark accuracy for high softmax confidence scores.

As mentioned earlier, the reports only tell part of the story. It is worth the time and effort to manually review the images that the model got wrong or where it produced a low confidence score.

Don’t forget about the bulk identification. Be sure to leverage these to locate new images to fill out your data set, or to find new classes.

Deploying your model

Once you have reviewed your model performance and are satisfied with the results, it is time to modify your TensorFlow Serving container to put the new model into production.

TensorFlow Serving is available as a Docker container and provides a very quick and convenient way to serve your model. This container can listen and respond to API calls for your model.

Let’s say your new model is version 7, and your Export script (see above) has saved the model in your cloud share as /image_application/models/007. You can start the TensorFlow Serving container with that volume mount. In this example, the shareName points to folder for version 007.

#####   sample TensorFlow pod in Kubernetes   #####
containers:
  - name: tensorflow-serving
    image: bitnami/tensorflow-serving:2.18.0
    ports:
      - containerPort: 8501
    env:
      - name: TENSORFLOW_SERVING_MODEL_NAME
        value: "image_application"
    volumeMounts:
      - name: models-subfolder
        mountPath: "/bitnami/model-data"

volumes:
  - name: models-subfolder
    azureFile:
      shareName: "image_application/models/007"

A subtle note here — the export model script should create a sub-folder to models/007, also named 007, with the saved model export. This may seem a little redundant, but TensorFlow Serving will mount this share folder as /bitnami/model-data and detect the numbered sub-folder inside it for the version to serve. This will allow you to query the API for the model version (good for logging) as well as the identification.

Conclusion

As I mentioned at the start of this article, this setup has worked for my situation. This is certainly not the only way to approach this challenge, and I invite you to customize your own solution.

I wanted to share my hard-fought learnings as I embraced cloud services in Kubernetes, with the desire to keep costs under control. Of course, doing all this while maintaining a high level of model performance is an added challenge, but one that you can achieve.

I know that the information above is mostly an outline of the process, but I hope it provided enough information here so that you can fill in the rest, specific to your own application.

Happy learnings!

The post Learnings from a Machine Learning Engineer — Part 5: The Training appeared first on Towards Data Science.

Maximizing the Utility of Scarce AI Resources: A Kubernetes Approach

Chaim Rand — Tue, 13 Feb 2024 16:06:47 +0000

In the ever-evolving landscape of AI development, nothing rings truer than the old saying (attributed to Heraclitus), "the only constant in life is change". In the case of AI, it seems that change is indeed constant, but the pace of change is forever increasing. Staying relevant in these unique and exciting times amounts to an unprecedented test of the capacity of AI teams to consistently adapt and adjust their development processes. AI development teams that fail to adapt, or are slow to adapt, may quickly become obsolete.

One of the most challenging developments of the past few years in AI development has been the increasing difficulty to attain the hardware required to train AI models. Whether it be due to an ongoing crisis in the global supply chain or a significant increase in the demand for AI chips, getting your hands on the GPUs (or alternative training accelerators) that you need for AI development, has gotten much harder. This is evidenced by the huge wait time for new Gpu orders and by the fact that cloud service providers (CSPs) that once offered virtually infinite capacity of GPU machines, now struggle to keep up with the demand.

The changing times are forcing AI development teams that may have once relied on endless capacity of AI accelerators to adapt to a world with reduced accessibility and, in some cases, higher costs. Development processes that once took for granted the ability to spin up a new GPU machine at will, must be modified to meet the demands of a world of scarce AI resources that are often shared by multiple projects and/or teams. Those that fail to adapt risk annihilation.

In this post we will demonstrate the use of Kubernetes in the orchestration of AI-model training workloads in a world of scarce AI resources. We will start by specifying the goals we wish to achieve. We will then describe why Kubernetes is an appropriate tool for addressing this challenge. Last, we will provide a simple demonstration of how Kubernetes can be used to maximize the use of a scarce AI compute resource. In subsequent posts, we plan to enhance the Kubernetes-based solution and show how to apply it to a cloud-based training environment.

Disclaimers

While this post does not assume prior experience with Kubernetes, some basic familiarity would certainly be helpful. This post should not, in any way, be viewed as a Kubernetes tutorial. To learn about Kubernetes, we refer the reader to the many great online resources on the subject. Here we will discuss just a few properties of Kubernetes as they pertain to the topic of maximizing and prioritizing resource utilization.

There are many alternative tools and techniques to the method we put forth here, each with their own pros and cons. Our intention in this post is purely educational; Please do not view any of the choices we make as an endorsement.

Lastly, the Kubernetes platform remains under constant development, as do many of the frameworks and tools in the field of AI development. Please take into account the possibility that some of the statements, examples, and/or external links in this post may become outdated by the time you read this and be sure to take into account the most up-to-date solutions available before making your own design decisions.

Adapting to Scarce AI Compute Resources

To simplify our discussion, let’s assume that we have a single worker node at our disposal for training our models. This could be a local machine with a GPU or a reserved compute-accelerated instance in the cloud, such as a p5.48xlarge instance in AWS or a TPU node in GCP. In our example below we will refer to this node as "my precious". Typically, we will have spent a lot of money on this machine. We will further assume that we have multiple training workloads all competing for our single compute resource where each workload could take anywhere from a few minutes to a few days. Naturally, we would like to maximize the utility of our compute resource by ensuring that it is in constant use and that the most important jobs get prioritized. What we need is some form of a priority queue _ and an associated priority-based [scheduling](https://en.wikipedia.org/wiki/Scheduling(computing)) algorithm. Let’s try to be a bit more specific about the behaviors that we desire.

Scheduling Requirements

Maximize Utilization: We would like for our resource to be in constant use. Specifically, as soon as it completes a workload, it will promptly (and automatically) start working on a new one.
Queue Pending Workloads: We require the existence of a queue of training workloads that are waiting to be processed by our unique resource. We also require associated APIs for creating and submitting new jobs to the queue, as well as monitoring and managing the state of the queue.
Support Prioritization: We would like each training job to have an associated priority such that workloads with higher priority will be run before workloads with a lower priority.
Preemption: Moreover, in the case that an urgent job is submitted to the queue while our resource is working on a lower priority job, we would like for the running job to be preempted and replaced by the urgent job. The preempted job should be returned to the queue.

One approach to developing a solution that satisfies these requirements could be to take an existing API for submitting jobs to a training resource and wrap it with a customized implementation of a priority queue with the desired properties. At a minimum, this approach would require a data structure for storing a list of pending jobs, a dedicated process for choosing and submitting jobs from the queue to the training resource, and some form of mechanism for identifying when a job has been completed and the resource has become available.

An alternative approach and the one we take in this post, is to leverage an existing solution for priority-based scheduling that fulfils our requirements and align our training development workflow to its use. The default scheduler that comes with Kubernetes is an example of one such solution. In the next sections we will demonstrate how it can be used to address the problem of optimizing the use of scarce AI training resources.

ML Orchestration with Kubernetes

In this section we will get a bit philosophical about the application of Kubernetes to the Orchestration of ML training workloads. If you have no patience for such discussions (totally fair) and want to get straight to the practical examples, please feel free to skip to the next section.

Kubernetes is (another) one of those software/technological solutions that tend to elicit strong reactions in many developers. There are some that swear by it and use it extensively, and others that find it overbearing, clumsy, and unnecessary (e.g., see here for some of the arguments for and against using Kubernetes). As with many other heated debates, it is the author’s opinion that the truth lies somewhere in between – there are situations where Kubernetes provides an ideal framework that can significantly increase productivity, and other situations where its use borders on an insult to the SW development profession. The big question is, where on the spectrum does ML development lie? Is Kubernetes the appropriate framework for training ML models? Although a cursory online search might give the impression that the general consensus is an emphatic "yes", we will make some arguments for why that may not be the case. But first, we need to be clear about what we mean by "ML training orchestration using Kubernetes".

While t[[[here](https://blog.realvarez.com/get-more-out-of-gpus-with-nvidia-multi-instance-gpu/)](https://medium.com/pytorch/a-step-by-step-guide-to-building-a-distributed-spot-based-training-platform-on-aws-using-b54acd06ecb2)](https://aws.amazon.com/blogs/machine-learning/introducing-amazon-sagemaker-operators-for-kubernetes/) are many online resources that address the topic of ML using Kubernetes, it is important to be aware of the fact that they are not always referring to the same mode of use. Some resources (e.g., here) use Kubernetes only for deploying a cluster; once the cluster is up and running they start the training job outside the context of Kubernetes. Others (e.g., here) use Kubernetes to define a pipeline in which a dedicated module starts up a training job (and associated resources) using a completely different system. In contrast to these two examples, many other resources define the training workload as a Kubernetes [Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/) artifact that runs on a Kubernetes Node. However, they too vary greatly in the particular attributes on which they focus. Some (e.g., here) emphasize the auto-scaling properties and others (e.g., here) the Multi-Instance GPU (MIG) support. They also vary greatly in the details of implementation, such as the precise artifact (Job extension) for representing a training job (e.g., ElasticJob, TrainingWorkload, JobSet, VolcanoJob, etc.). In the context of this post, we too will assume that the training workload is defined as a Kubernetes Job. However, in order to simplify the discussion, we will stick with the core Kubernetes objects and leave the discussion of Kubernetes extensions for ML for a future post.

Arguments Against Kubernetes for ML

Here are some arguments that could be made against the use of Kubernetes for training ML models.

Complexity: Even its greatest proponents have to admit that Kubernetes can be hard. Using Kubernetes effectively, requires a high level of expertise, has a steep learning curve, and, realistically speaking, typically requires a dedicated devops team. Designing a training solution based on Kubernetes increases dependencies on dedicated experts and by extension, increases the risk that things could go wrong, and that development could be delayed. Many alternative ML training solutions enable a greater level of developer independence and freedom and entail a reduced risk of bugs in the development process.
Fixed Resource Requirements: One of the most touted properties of Kubernetes is its scalability – its ability to automatically and seamlessly scale its pool of compute resources up and down according to the number of jobs, the number of clients (in the case of a service application), resource capacity, etc. However, one could argue that in the case of an ML training workload, where the number of resources that are required is (usually) fixed throughout training, auto-scaling is unnecessary.
Fixed Instance Type: Due to the fact that Kubernetes orchestrates containerized applications, Kubernetes enables a great deal of flexibility when it comes to the types of machines in its node pool. However, when it comes to ML, we typically require very specific machinery with dedicated accelerators (such as GPUs). Moreover, our workloads are often tuned to run optimally on one very specific instance type.
Monolithic Application Architecture: It is common practice in the development of modern-day applications to break them down into small elements called microservices. Kubernetes is often seen as a key component in this design. ML training applications tend to be quite monolithic in their design and, one could argue, that they do not lend themselves naturally to a microservice architecture.
Resource Overhead: The dedicated processes that are required to run Kubernetes requires some system resources on each of the nodes in its pool. Consequently, it may incur a certain performance penalty on our training jobs. Given the expense of the resources required for training, we may prefer to avoid this.

Granted, we have taken a very one-sided view in the Kubernetes-for-ML debate. Based solely on the arguments above, you might conclude that we would need a darn good reason for choosing Kubernetes as a framework for ML training. It is our opinion that the challenge put forth in this post, i.e., the desire to maximize the utility of scarce AI compute resources, is exactly the type of justification that warrants the use of Kubernetes despite the arguments made above. As we will demonstrate, the default scheduler that is built-in to Kubernetes, combined with its support for priority and preemption makes it a front-runner for fulfilling the requirements stated above.

Toy Example

In this section we will share a brief example that demonstrates the priority scheduling support that is built in to Kubernetes. For the purposes of our demonstration, we will use Minikube (version v1.32.0). Minikube is a tool that enables you to run a Kubernetes cluster in a local environment and is an ideal playground for experimenting with Kubernetes. Please see the official documentation on installing and getting started with Minikube.

Cluster Creation

Let’s start by creating a two-node cluster using the Minikube start command:

minikube start --nodes 2

The result is a local Kubernetes cluster consisting of a master ("control-plane") node named minikube, and a single worker node, named minikube-m02, which will simulate our single AI resource. Let’s apply the label my-precious to identify it as a unique resource type:

kubectl label nodes minikube-m02 node-type=my-precious

We can use the Minikube dashboard to visualize the results. In a separate shell run the command below and open the generated browser link.

minikube dashboard

If you press on the Nodes tab on the left-hand pane, you should see a summary of our cluster’s nodes:

Nodes List in Minikube Dashboard (Captured by Author)

PriorityClass Definitions

Next, we define two PriorityClasses, low-priority and high-priority, as in the priorities.yaml file displayed below. New jobs will receive the low-priority assignment, by default.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 0
globalDefault: true

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false

To apply our new classes to our cluster, we run:

kubectl apply -f priorities.yaml

Create a Job

We define a simple job using a job.yaml file displayed in the code block below. For the purpose of our demonstration, we define a Kubernetes Job that does nothing more than sleep for 100 seconds. We use busybox as its Docker image. In practice, this would be replaced with a training script and an appropriate ML Docker image. We define the job to run on our special instance, my-precious, using the nodeSelector field, and specify the resource requirements so that only a single instance of the job can run on the instance at a time. The priority of the job defaults to low-priority as defined above.

apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  template:
    spec:
      containers:
        - name: test
          image: busybox
          command: # simple sleep command
            - sleep
            - '100'
          resources: # require all available resources
            limits:
              cpu: "2"
            requests:
              cpu: "2"
      nodeSelector: # specify our unique resource
          node-type: my-precious
      restartPolicy: Never

We submit the job with the following command:

kubectl apply -f job.yaml

Create a Queue of Jobs

To demonstrate the manner in which Kubernetes queues jobs for processing, we create three identical copies of the job defined above, named test1, test2, and test3. We group the three jobs in a single file, jobs.yaml, and submit them for processing:

kubectl apply -f jobs.yaml

The image below captures the Workload Status of our cluster in the Minikube dashboard shortly after the submission. You can see that my-precious has begun processing test1, while the other jobs are pending as they wait their turn.

Cluster Workload Status (Captured by Author)

Once test1 is completed, processing of test2 begins:

Cluster Workload Status – Automated Scheduling (Captured by Author)

So long as no other jobs with higher priority are submitted, our jobs would continue to be processed one at a time until they are all completed.

Job Preemption

We now demonstrate Kubernetes’ built-in support for job preemption by showing what happens when we submit a fourth job, this time with the high-priority setting:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-p1
spec:
  template:
    spec:
      containers:
        - name: test-p1
          image: busybox
          command:
            - sleep
            - '100'
          resources:
            limits:
              cpu: "2"
            requests:
              cpu: "2"
      restartPolicy: Never
      priorityClassName: high-priority # high priority job
      nodeSelector:
          node-type: my-precious

The impact on the Workload Status is displayed in the image below:

Cluster Workload Status – Preemption (Captured by Author)

The test2 job has been preempted – its processing has been stopped and it has returned to the pending state. In its stead, my-precious has begun processing the higher priority test-p1 job. Only once test-p1 is completed will processing of the lower priority jobs resume. (In the case where the preempted job is a ML training workload, we would program it to resume from the most recent saved model model checkpoint).

The image below displays the Workload Status once all jobs have been completed.

Cluster Workload Status – Completion (Captured by Author)

Kubernetes Extensions

The solution we demonstrated for priority-based scheduling and preemption relied only on core components of Kubernetes. In practice, you may choose to take advantage of enhancements to the basic functionality introduced by extensions such as Kueue and/or dedicated, ML-specific features offered by platforms build on top of Kubernetes, such as Run:AI or Volcano. But keep in mind that to fulfill the basic requirements for maximizing the utility of a scarce AI compute resource all we need is the core Kubernetes.

Summary

The reduced availability of dedicated AI silicon has forced ML teams to adjust their development processes. Unlike in the past, when developers could spin up new AI resources at will, they now face limitations on AI compute capacity. This necessitates the procurement of AI instances through means such as purchasing dedicated units and/or reserving cloud instances. Moreover, developers must come to terms with the likelihood of needing to share these resources with other users and projects. To ensure that the scarce AI compute power is appropriated towards maximum utility, dedicated scheduling algorithms must be defined that minimize idle time and prioritize critical workloads. In this post we have demonstrated how the Kubernetes scheduler can be used to accomplish these goals. As emphasized above, this is just one of many approaches to address the challenge of maximizing the utility of scarce AI resources. Naturally, the approach you choose, and the details of your implementation will depend on the specific needs of your AI development.

The post Maximizing the Utility of Scarce AI Resources: A Kubernetes Approach appeared first on Towards Data Science.

From Data Platform to ML Platform

ming gao — Sun, 22 Oct 2023 16:17:19 +0000

Data/ML has been the most popular topic in our tech landscape. I want to share my understanding of Data/ML Platform and how would those platforms evolve from basic to complex. At last, I try my best to cover MLOps, a principle for managing ML projects.

About who-I-am, here is my LinkedIn.

Starting of the Journey: Online Service + OLTP + OLAP

At the starts, Data infrastructures could be fairly simple. Analytical queries might be sent to the read replica of a online OLTP database or setting up a OLAP database serve as data warehouse.

Here is the infrastructure might look like:

There is nothing wrong with those systems as long as it fulfil business requirements. All systems that fulfil our business need are good systems. If there are simple, it is even better.

At this stage, there are multiple ways of doing data analysis:

Simply submit queries to OLTP database’s replica node. (Not recommended).
Enabling CDC(Change Data Capture) of OLTP databse and ingest those data to OLAP database. Come to the option of ingestion service for CDC logs, you can choose based on the OLAP database you have selected. For example, Flink data streaming with CDC connectors is a way to handle this. Many enterprise services come with their own suggested solution, e.g. Snowpipe for Snowflake. It is also recommended to load data from replica node to preserve the CPU/IO bandwidth of master node for online traffic.

In this stage, ML workloads might be running in your local environment. You can set up a Jupyter notebook locally, and load structured data from OLAP Database, then train your ML model locally.

The potential challenges of this architecture are but not limited to:

It is hard to manage unstructured or semi-structured data with OLAP database.
OLAP might have performance regression when come to massive data processing. (more than TB data required for a single ETL task)
Lack of supporting for various compute engines, e.g. Spark or Presto. Most of compute engine do support connecting to OLAP with JDBC endpoint, but the parallel processing will be badly limited by the IO bottleneck of JDBC endpoint itself.
The cost of storing massive data in OLAP database is high.

You might know the direction to solve this already. Build a Data lake! Bringing in Data lake do not necessary mean you need to completely sunset OLAP Database. It is still common to see company having two systems co-exist for different use-cases.

Data lake: Storage-Compute Separation + Schema on Write

A data lake allows you to persist unstructured and semi-structure data, and performs schema-on-read. It allows you reduce cost by storing large data volume with specialised storage solution and spun up compute cluster based on your demand. It further allows you to manage TB/PB dataset effortlessly by scaling up the compute clusters.

There is how your infrastructure might look like:

This is an oversimplified graph indeed. The actually implementation of a data lake can be much more complicated.

Many cloud provider now have quite established store solution for Data lake, e.g. AWS S3 and Azure ADLS. There is still a lot of tasks need to be done on top of those storage solutions. For example, there should be a Hive metastore to manage your table metadata and a Datahub to provide data visibility. There are also challenging topics like fine-grain permission control in data lake and data lineage analysis(e.g. spline).

To maximum the value and efficiency of your data lake, we should carefully choose file format and average file sizes for each layers of your data lake.

The general tips are:

Avoid small files: Small files are one of major causes for high storage cost and poor performance in Data lake.
Balance between latency, compress ratio and performance: A low latency Data lake table with file format like Hudi might not give you the best compress ratio, and large ORC files with high compress ratio might give your performance nightmare. You might want to choose file format wisely based on the usage pattern of the table, latency requirement and table sizes.

There are some quite established SaaS/PaaS provider like Databricks which provide a decent Data lake(or LakeHouse now) solution. You also can explore ByteHouse to have a unified experience of big data analysis.

On the ML side, team might start exploring well established ML framework like Tensenflow and Pytorch in remote environment. Furthermore, trained ML models could been deployed to production environment for online model inferrence. Both Tensorflow and Pytorch come with serving solution, e.g. TensorFlow Serving and Pytorch Serving.

However, our journey will not stop here. We might have following challenges now:

Lack of realtime metric and features management which are critical for online ML model serving.
Lack of model performance monitoring.

Let’s level up our game further.

Realtime Data/ML Infra: Data River + Data Streaming + Feature Store + Metric Server

It usually a joint effort from multiple departments of companies to build realtime data infra. The initial rationale of building Data River usually is not for data/ML system but allowing micro-services to further scale up by removing synchronised call. Instead, micro-services will gain efficiency by communicating with a message broker like Kafka (at the cost of lower consistency level).

The overall architecture might look like this.

With data available in Data River (e.g. Kafka), we can build data streaming pipeline to process realtime data. Those data can be used directly in online feature store or sync to an metric server like Pinot. Metric server can further process/aggregate those metric point to more useful model performance metrics and business metrics. You can also adopt streaming database like RisingWave which can joining/aggregating streaming data with SQL syntax.

For building data streaming itself, Flink is quite popular. You can also use Flink with CDC connector to extract data from OLTP database and sink data to message brokers and data lake.

There should be a online feature store backed by key-value database like ScyllaDB or AWS Dynamo DB. Online feature store can help you enrich the the request sent to Model Serving service with a feature vector associated with certain reference ID (user-id, product uuid). It can greatly de-couple the dependency between backend service team who build micro-services and ML engineer team who build ML models. It allow ML engineers rollout new ML feature with new ML model independently (Your model serving API signature expose to micro-services will remain the same when you update feature vector).

In the book, Designing Machine Learning System, it has shared about Model Stacking (Jen Wadkin’s medium post about model stacking). It is quite common for people to use model stacking in model serving as well. An orchestrator is required when you want to to stack hererogenous models together, e.g. stacking pytorch and tensorflow model together. You potential can make your orchestrator even more complicated by having a dynamic weightage based on model performance when routing request to different models.

Now we have a complicated system. It looks pretty cool but it carry new challenges:

Debt of the system will soaring high if leave it unmanage.
High cognitive load for ML engineers.

That’s probably when you need to thinking how MLOps can help you.

MLOps: Abstraction, Observability and Scalability

MLOps is never a specific solution. It is more like a set of principles for managing ML system. Different with a typical software project. ML systems are greatly affected by data shifting, and data dependency management is not a easy task. Paper Hidden Technical Debt in Machine Learning Systems has described those challenges in details. Therefore, a MLOps driven ML platform must able to:

Data change monitoring & data quality monitoring.
Manage ML features across offline and online environment.
Reproducible ML Pipeline which fulfil experimental-operational symmetry.
Concise ML pipeline configuration which can abstract away infra details.

This article, MLOps: Continuous delivery and automation pipelines in machine learning, highlighted the importance of experimental-operational symmetric. It also described MLOps automation level from level-0, level-1 to finally level-2. I really like the graph from this doc and will just borrow it to explain what level-1 MLOps looks like.

To scale such MLOps practise in your organisation, you need to provide concise ML pipeline configuration which can abstract infrastructure implementation details away for ML Engineers. By doing this, platform engineers also gain flexibility for upgrading ML platform without causing too much disruption to platform users. You can consider using configuration files like yaml to describe ML pipelines and rely on your ML Pipeline controllers to translate them to actual workload.

So let’s re-organised realtime data/ML infra with following graph to highlight how MLOps shapes our platforms.

To give you a better ideal of what the ML pipelines might look like. Here are the possible abstraction examples for each stage in ML pipeline. The following graph only help your to further understand what is the configuration might look like. It does not represent any actual implementation. It does not cover all aspects required neither.

Kubernetes is a popular solution to orchestrate ML workload(or maybe all workload nowadays). You can use CRDs to provide concise interfaces between users and platforms. In article My thinking of Kubebuilder, I have shared some of my thinking when I build CRD with kubebuilder.

Of course, I didn’t cover many important sub-topics which include but not limited to:

Hyperparameter Optimization
Distributed Training architecture

What Next

You can see Mlops only give a known mission a proper name. It is far from a job done. What I shared is a opinionated strategy for implement ML Ops platform. Even with that, the bar of creating high quality ML product is still high, and the effort of collecting, processing, mining data is still heavy.

Besides those challenges remains, I also want to share the trends in ML landscape I have observed. It is surely not a completed list given how fast this domain evolves.

Serviceless: We have put ML’s value too far behind because the foundation a ML platform is usually a Data platform. It is like forcing users to buy computers to engage on social media platforms when we are already in mobile time. Serviceless data services and data engine are addressing this challenge. Many service providers explore their own serveless solution to lower the bar of adoption, e.g. Databricks, Snowflake , Bytehouse. Companies can start building their ML products after bootstrapping data warehouses, or data lakes, or lakehouses.
AI driven feature engineering: Well, AI can do everything now, can’t it?
MaaS trends: More powerful Model-as-a-Service will pop up. Companies can directly leverage on ML power without even building their own ML service to enjoy a great lifting to their business.

As we all have noticed, ML space evolves so fast. At this very moment, when I am typing this, this article might already been expired. More ideas had already popped up and been translated to reality. Please do let me know what do you think about of ML Ops, or where should I further my learning. Let’s keep up the pace together!

The post From Data Platform to ML Platform appeared first on Towards Data Science.

Deploying Falcon-7B Into Production

Het Trivedi — Fri, 07 Jul 2023 03:58:11 +0000

A step-by-step tutorial

Image by author-Created using Midjourney

Background

By now, we’ve seen the capabilities of ChatGPT and what it has to offer. However, for enterprise use, closed-source models like ChatGPT may pose a risk as enterprises have no control over their data. OpenAI claims that user data is not stored or used for training models, but there is no guarantee that data will not be leaked in some way.

To combat some of the issues related to closed-source models, researchers are rushing to build open-source LLMs that rival models like ChatGPT. With open-source models, enterprises can host the models themselves in a secure cloud environment, mitigating the risk of data leakage. On top of that, you get full transparency in the inner workings of the model, which helps build more trust with AI in general.

With the recent advancements in open-source LLMs, it’s tempting to try out new models and see how they stack up against closed-source models like ChatGPT.

However, running open-source models today poses significant barriers. It’s so much easier to call the ChatGPT API than to figure out how to run an open-source LLM.

In this article, my goal is to break these barriers by showing how you can run open-source models like the Falcon-7B model in the cloud in a production-like setting. We will be able to access these models via an API endpoint similar to ChatGPT.

Challenges

One of the significant challenges to running open-source models is the lack of computing resources. Even a "small" model like the Falcon 7B requires a GPU to run.

To solve this problem, we can leverage the GPUs in the cloud. But this poses another challenge. How do we containerize our LLM? How do we enable GPU support? Enabling GPU support can be tricky because it requires knowledge of CUDA. Working with CUDA can be a pain in the neck because you have to figure out how to install the proper CUDA dependencies and which versions are compatible.

So, to avoid the CUDA death trap, many companies have created solutions to easily containerize models while enabling GPU support. For this blog post, we’ll be using an open-source tool called Truss to help us easily containerize our LLM without much hassle.

Truss allows developers to easily containerize models built using any framework.

Why use Truss?

Truss – https://truss.baseten.co/e2e

Truss has a lot of useful features right out of the box such as:

Turning your Python model into a microservice with a production-ready API endpoint
Freezing dependencies via Docker
Supporting inference on GPUs
Simple pre-processing and post-processing for the model
Easy and secure secrets management

I’ve used Truss before to deploy machine learning models and the process is quite smooth and simple. Truss automatically creates your dockerfile and manages Python dependencies. All we have to do is provide the code for our model.

The main reason we want to use a tool like Truss is that it becomes much easier to deploy our model with GPU support.

Note: I have not received any sponsorship from Baseten to promote their content nor am I associated with them in any way. I’m under no influence in any way from Baseten or Truss to write this article. I simply found their open source project to be cool and useful.

The plan

Here’s what I’ll be covering in this blog post:

Set up Falcon 7B locally using Truss
Run the model locally if you have a GPU(I have an RTX 3080)
Containerize the model and run it using docker
Create a GPU-enabled Kubernetes cluster in Google Cloud to run our model

Don’t worry if you don’t have a GPU for step 2, you will still be able to run the model in the cloud.

Here is the Github repo containing the code if you want to follow along:

GitHub – htrivedi99/falcon-7b-truss

Let’s get started!

Step 1: Falcon 7B local setup using Truss

First, we’ll need to create a project with Python version ≥ 3.8

We’ll be downloading the model from hugging face and packaging it using Truss. Here are the dependencies we’ll need to install:

pip install truss

Inside your Python project create a script called main.py . This is a temporary script we’ll be using to work with truss.

Next, we’ll set up our Truss package by running the following command in the terminal:

truss init falcon_7b_truss

If you’re prompted to create a new truss, press ‘y’. Once that’s complete, you should see a new directory called falcon_7b_truss . Inside that directory, there will be some autogenerated files and folders. There are a couple of things we need to fill out: model.py which is nested under the package model and config.yaml .

├── falcon_7b_truss
│   ├── config.yaml
│   ├── data
│   ├── examples.yaml
│   ├── model
│   │   ├── __init__.py
│   │   └── model.py
│   └── packages
└── main.py

As I mentioned previously, Truss only needs our model’s code, it takes care of all the other things automatically. We’ll be writing the code inside model.py , but it has to be written in a specific format.

Truss expects each model to support at least three functions: __init__ , load , and predict .

__init__ is primarily used for creating class variables
load is where we’ll download the model from hugging face
predict is where we’ll call our model

Here’s the full code for model.py :

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from typing import Dict

MODEL_NAME = "tiiuae/falcon-7b-instruct"
DEFAULT_MAX_LENGTH = 128

class Model:
    def __init__(self, data_dir: str, config: Dict, **kwargs) -> None:
        self._data_dir = data_dir
        self._config = config
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print("THE DEVICE INFERENCE IS RUNNING ON IS: ", self.device)
        self.tokenizer = None
        self.pipeline = None

    def load(self):
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        model_8bit = AutoModelForCausalLM.from_pretrained(
            MODEL_NAME,
            device_map="auto",
            load_in_8bit=True,
            trust_remote_code=True)

        self.pipeline = pipeline(
            "text-generation",
            model=model_8bit,
            tokenizer=self.tokenizer,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            device_map="auto",
        )

    def predict(self, request: Dict) -> Dict:
        with torch.no_grad():
            try:
                prompt = request.pop("prompt")
                data = self.pipeline(
                    prompt,
                    eos_token_id=self.tokenizer.eos_token_id,
                    max_length=DEFAULT_MAX_LENGTH,
                    **request
                )[0]
                return {"data": data}

            except Exception as exc:
                return {"status": "error", "data": None, "message": str(exc)}

What’s happening here:

MODEL_NAME is the model we’ll be using which in our case is the falcon-7b-instruct model
Inside load, we download the model from hugging face in 8bit. The reason we want 8bit is that the model uses significantly less memory on our GPU when quantized.
Also, loading the model in 8-bit is necessary if you want to run the model locally on a GPU with less than 13GB VRAM.
The predict function accepts a JSON request as a parameter and calls the model using self.pipeline . The torch.no_grad tells Pytorch that we are in inference mode, not training mode.

Cool! That’s all we need to setup our model.

Step 2: Running the model locally (Optional)

If you have an Nvidia GPU with more than 8GB of VRAM you will be able to run this model locally.

If not, feel free to move on to the next step.

We’ll need to download some more dependencies to run the model locally. Before you download the dependencies you need to make sure you have CUDA and the right CUDA drivers installed.

Because we’re trying to run the model locally, Truss won’t be able to help us manage the CUDA madness.

pip install transformers
pip install torch
pip install peft
pip install bitsandbytes
pip install einops
pip install scipy

Next, inside main.py the script we created outside of the falcon_7b_truss directory, we need to load our truss.

Here is the code for main.py :

import truss
from pathlib import Path
import requests

tr = truss.load("./falcon_7b_truss")
output = tr.predict({"prompt": "Hi there how are you?"})
print(output)

What’s happening here:

If you recall, the falcon_7b_truss directory was created by truss. We can load that entire package, including the model and dependencies using truss.load
Once, we’ve loaded our package we can simply call the predict method to get the models output

Run main.py to get the output from the model.

This model files are ~15 GB in size so it might take 5–10 minutes to download the model. After running the script you should see an output like this:

{'data': {'generated_text': "Hi there how are you?nI'm doing well. I'm in the middle of a move, so I'm a bit tired. I'm also a bit overwhelmed. I'm not sure how to get started. I'm not sure what I'm doing. I'm not sure if I'm doing it right. I'm not sure if I'm doing it wrong. I'm not sure if I'm doing it at all.nI'm not sure if I'm doing it right. I'm not sure if I'm doing it wrong. I"}}

Step 3: Containerizing the model using docker

Usually, when people containerize a model, they take the model binary and the Python dependencies and wrap it all up using a Flask or Fast API server.

A lot of this is boilerplate and we don’t want to do it ourselves. Truss will take care of it. We’ve already provided the model, Truss will create the server, so the only thing left to do is provide the Python dependencies.

The config.yaml holds the configuration for our model. This is where we can add the dependencies for our model. The config file already comes with most of the things we need, but we’ll need to add a few things.

Here is what you need to add to config.yaml:

apply_library_patches: true
bundled_packages_dir: packages
data_dir: data
description: null
environment_variables: {}
examples_filename: examples.yaml
external_package_dirs: []
input_type: Any
live_reload: false
model_class_filename: model.py
model_class_name: Model
model_framework: custom
model_metadata: {}
model_module_dir: model
model_name: Falcon-7B
model_type: custom
python_version: py39
requirements:
- torch
- peft
- sentencepiece
- accelerate
- bitsandbytes
- einops
- scipy
- git+https://github.com/huggingface/transformers.git
resources:
  use_gpu: true
  cpu: "3"
  memory: 14Gi
secrets: {}
spec_version: '2.0'
system_packages: []

So the main thing we added is the requirements . All of the dependencies listed are required to download and run the model.

The other important thing we added is the resources . The use_gpu: true is essential, because this tells Truss to create a Dockerfile for us with GPU support enabled.

That’s it for the configuration.

Next up, we’ll containerize our model. If you don’t know how to containerize a model using Docker, don’t worry Truss has got you covered.

Inside the main.py file, we’ll tell Truss to package everything together. Here is the code you need:

import truss
from pathlib import Path
import requests

tr = truss.load("./falcon_7b_truss")
command = tr.docker_build_setup(build_dir=Path("./falcon_7b_truss"))
print(command)

What’s happening:

First, we load our falcon_7b_truss
Next, the docker_build_setup function handles all of the complicated stuff like creating the Dockerfile and setting up the Fast API server.
If you take a look at your falcon_7b_truss directory, you’ll see that many more files got generated. We don’t need to worry about how these files work because it will all get managed behind the scene.
At the end of the run, we get a docker command to build our docker image: docker build falcon_7b_truss -t falcon-7b-model:latest

If you want to build the docker image go ahead and run the build command. The image is ~ 9 GB in size , so it might take a while to build. If you don’t want to build it but want to follow along you can use my image: htrivedi05/truss-falcon-7b:latest .

If you’re building the image yourself, you will need to tag it and push it to docker hub so that our containers in the cloud can pull the image. Here are the commands you will need to run once the image is built:

docker tag falcon-7b-model /falcon-7b-model

docker push /falcon-7b-model

Awesome! We are ready to run our model in the cloud!

(Optional steps below for running the image locally with a GPU)

If you have an Nvidia GPU and want to run your containerized model locally with GPU support, you need to ensure docker is configured to use your GPU.

Open up a terminal and run the following commands:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

apt-get update
apt-get install -y nvidia-docker2

sudo systemctl restart docker

Now that your docker has been configured to access your GPU, here is how to run your container:

docker run --gpus all -d -p 8080:8080 falcon-7b-model

Again, it will take a while to download the model. To make sure everything is working correctly you can check the container logs and you should see "THE DEVICE INFERENCE IS RUNNING ON IS: cuda".

You can call the model via an API endpoint like so:

import requests

data = {"prompt": "Hi there, how's it going?"}
res = requests.post("http://127.0.0.1:8080/v1/models/model:predict", json=data)
print(res.json())

Step 4: Deploying the model into production

I’m using the word "production" quite loosely here. We’re going to run our model in kubernetes where it can easily scale and handle variable amounts of traffic.

With that being said, there are a TON of configurations kubernetes has such as network policies, storage, config maps, load balancing, secrets management, etc.

Even though kubernetes is built to "scale" and run "production" workloads, a lot of the production-level configurations you need don’t come out of the box. Covering those advanced kubernetes topics is out of the scope of this article and takes away from what we’re trying to accomplish here. So for this blog post, we’ll create a minimal cluster without all the bells and whistles.

Without further ado, let’s create our cluster!

Prerequisites:

Have a Google Cloud Account with a project
Have the gcloud CLI installed on your machine
Make sure you have enough quota to run a GPU enabled machine. You can check your quotas under IAM & Admin.

Creating our GKE cluster

We’ll be using Google’s kubernetes engine to create and manage our cluster. Ok, time for some IMPORTANT information:

Google’s kubernetes engine is NOT free. Google will not allow us to use a powerful GPU free of charge. With that being said, we are creating a single-node cluster with a less powerful GPU. It should not cost you more than $1–$2 for this experiment.

Here’s the configuration for the kubernetes cluster we’ll be running:

1 node, standard kubernetes cluster (not autopilot)
1 Nvidia T4 GPU
n1-standard-4 machine (4 vCPU, 15 GB memory)
All of this will run on a spot instance

Note: If you’re in another region and don’t have access to the exact same resources, feel free to make modifications.

Steps for creating the cluster:

Head over to google cloud console and search for the service called Kubernetes Engine

Click the CREATE button

Make sure you are creating a standard cluster, not an autopilot cluster. It should say Create a kubernetes cluster at the top.

Cluster basics

Inside the cluster basics tab, we don’t want to change much. Just give your cluster a name. You don’t need to change the zone or control plane.

Click on the default-pool tab and change the number of nodes to 1

Under default-pool click the Nodes tab in the left-hand sidebar

Change the Machine Configuration from General Purpose to GPU
Select the Nvidia T4 as the GPU type and set 1 for the quantity
Enable GPU timeshare (even though we won’t be using this feature)
Set the Max shared clients per GPU to 8
For the Machine Type, select the n1-standard-4 (4 vCPU, 15 GB memory)
Change the Boot disk size to 50
Scroll down to the very bottom and check-mark where it says: Enable nodes on spot VMs

Here is a screen-shot of the estimated price I got for this cluster:

Once you’ve configured the cluster go ahead and create it.

It’ll take a few minutes for Google to set everything up. After your cluster is up and running, we need to connect to it. Open up your terminal and run the following commands:

gcloud config set compute/zone us-central1-c

gcloud container clusters get-credentials gpu-cluster-1

If you used a different zone of cluster name, update those accordingly. To check that we’re connected run this command:

kubectl get nodes

You should see 1 node appear in your terminal. Even though our cluster has a GPU, it’s missing some Nvidia drivers which we’ll have to install. Thankfully, installing them is easy. Run the following command to install the drivers:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Sweet! We are finally ready to deploy our model.

Deploying the model

To deploy our model onto our cluster we need to create a kubernetes deployment. A kubernetes deployment allows us to manage instances of our containerized model. I won’t go too deep into kubernetes or how to write yaml files as it’s out of scope.

You need to create a file called truss-falcon-deployment.yaml . Open that file and paste in the following contents:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: truss-falcon-7b
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      component: truss-falcon-7b-layer
  template:
    metadata:
      labels:
        component: truss-falcon-7b-layer
    spec:
      containers:
      - name: truss-falcon-7b-container
        image: /falcon-7b-model:latest
        ports:
          - containerPort: 8080
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: truss-falcon-7b-service
  namespace: default
spec:
  type: ClusterIP
  selector:
    component: truss-falcon-7b-layer
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080

What’s happening:

We are telling kubernetes that we want to create pods with our falcon-7b-model image. Make sure you replace with your actual id. If you didn’t create your own docker image and want to use mine instead, replace it with the following: htrivedi05/truss-falcon-7b:latest
We are enabling GPU access for our container by setting a resource limit nvidia.com/gpu: 1 . This tells kubernetes to request only one GPU for our container
To interact with our model, we need to create a kubernetes service that will run on port 8080

Create the deployment by running the following command in your terminal:

kubectl create -f truss-falcon-deployment.yaml

If you run the command:

kubectl get deployments

You should see something like this:

NAME              READY   UP-TO-DATE   AVAILABLE   AGE
truss-falcon-7b   0/1     1            0           8s

It will take a few minutes for the deployment to change to the ready state. Remember the model has to get downloaded from hugging face each time the container restarts. You can check the progress of your container by running the following command:

kubectl get pods

kubectl logs truss-falcon-7b-8fbb476f4-bggts

Change the pod name accordingly.

There are a few things you want to look for in the logs:

Look for the print statement THE DEVICE INFERENCE IS RUNNING ON IS: cuda. This confirms that our container is properly connected to the GPU.
Next, you should see some print statements regarding the model files being downloaded

Downloading (...)model.bin.index.json: 100%|██████████| 16.9k/16.9k [00:00<00:00, 1.92MB/s]
Downloading (...)l-00001-of-00002.bin: 100%|██████████| 9.95G/9.95G [02:37<00:00, 63.1MB/s]
Downloading (...)l-00002-of-00002.bin: 100%|██████████| 4.48G/4.48G [01:04<00:00, 69.2MB/s]
Downloading shards: 100%|██████████| 2/2 [03:42<00:00, 111.31s/it][01:04<00:00, 71.3MB/s]

Once the model is downloaded and Truss has created the microservice you should see the following output at the end of your logs:

{"asctime": "2023-06-29 21:40:40,646", "levelname": "INFO", "message": "Completed model.load() execution in 330588 ms"}

From this message, we can confirm the model is loaded and ready for inference.

Model inference

We can’t call the model directly, instead, we have to call the model’s service

Run the following command to get the name of your service:

kubectl get svc

Output:

NAME                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
kubernetes                ClusterIP   10.80.0.1            443/TCP    46m
truss-falcon-7b-service   ClusterIP   10.80.1.96           8080/TCP   6m19s

The truss-falcon-7b-service is the one we want to call. To make the service accessible we need to port-forward it using the following command:

kubectl port-forward svc/truss-falcon-7b-service 8080

Output:

Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

Sweet, our model is available as a REST API endpoint at 127.0.0.1:8080 . Open up any python script such as main.py and run the following code:

import requests

data = {"prompt": "Whats the most interesting thing about a falcon?"}
res = requests.post("http://127.0.0.1:8080/v1/models/model:predict", json=data)
print(res.json())

Output:

 {'data': {'generated_text': 'Whats the most interesting thing about a falcon?nFalcons are known for their incredible speed and agility in the air, as well as their impressive hunting skills. They are also known for their distinctive feathering, which can vary greatly depending on the species.'}}

Whoo-hoo! We have successfully containerized our Falcon 7B model and deployed it as a microservice in production!

Feel free to play with different prompts to see what the model returns.

Winding down the cluster

Once you’ve had your fun messing with Falcon 7B you can delete your deployment by running this command:

kubectl delete -f truss-falcon-deployment.yaml

Next, head over to the kubernetes engine in google cloud and delete the kubernetes cluster.

Note: All images unless otherwise noted are by the author

Conclusion

Running and managing a production-grade model like ChatGPT is not easy. However, with time the tooling will get better for developers to deploy their own model into the cloud.

In this blog post, we touched upon all the things needed to deploy an LLM into production at a basic level. We packaged the model using Truss, containerized it using Docker, and deployed it in the cloud using kubernetes. I know it’s a lot to unpack and it wasn’t the easiest thing to do in the world, but we did it anyway.

I hope you learned something interesting from this blog post. Thanks for reading!

Peace.

The post Deploying Falcon-7B Into Production appeared first on Towards Data Science.

KServe: Highly scalable machine learning deployment with Kubernetes

Lloyd Hamilton — Mon, 29 May 2023 13:59:44 +0000

Image sourced from KServe

In the wake of the release of chatGPT, it is becoming increasingly difficult to avoid technologies that leverage machine learning. From text prediction on your messaging app to facial recognition on your smart door bell, machine learning (ML) can be found in almost every piece of tech we use today.

How machine learning technologies are delivered to consumers is one of the many challenges organisations will have to address during development. The deployment strategies of ML products have a significant impact on the end users of your product. This can mean the difference between Siri on your iPhone or chatGPT in your web browser.

Beyond the sleek user interface and overly assertive chat dialogues of ChatGPT, hides the complex mechanisms required to deploy the large language ML model. ChatGPT is built on a highly scalable framework that is designed to deliver and support the model during its exponential adoption. In reality, the actual ML model will only make up a small proportion of the whole project. Such projects are often cross disciplinary and require expertise in data engineering, data science and software development. Therefore, frameworks that simplifies the model deployment processes are becoming increasingly vital in delivering models to production, helping organisations save time and money.

Without the proper operational framework to support and manage ML models, organisations will often meet bottle necks when attempting to scale the number of ML model in production.

While no single tool has emerged as a clear winner in the highly saturated market of MLOps toolkits, KServe is becoming an increasingly popular tool to help organisations meet scalability requirements of ML models.

Note: I am not affiliated with KServe nor have been sponsored to write this article.

What is KServe?

KServe is a highly scalable machine learning deployment toolkit for Kubernetes. It is an orchestration tool that is built on top of Kubernetes and leverages two other open sourced projects, Knative-Serving and Istio; more on this later.

Image sourced from KServe

KServe significantly simplifies the deployment process of ML Models into a Kubernetes cluster by unifying the deployment into a single resource definition. It makes the machine learning deployment part of any ML project easy to learn and ultimately decreases the barrier to entry. Therefore, models deployed using KServe can be easier to maintain than models deployed using traditional Kubernetes deployment that require a Flask or FastAPI service.

With KServe, there is no requirement to wrap your model inside a FastAPI or Flask app before exposing it through the internet via HTTPS. KServe has built-in functionality that essentially replicates this process but without the overhead of having to maintain API endpoints, configure pod replicas, or configure internal routing networks in Kubernetes. All you have to do is point KServe to your model, and it will handle the rest.

Beyond the simplification of the deployment processes, KServe also offers many features, including canary deployments, inference autoscaling, and request batching. These features will not be discussed as they are out of scope. However, this guide will hopefully set the foundation of understanding required to explore further.

First, let’s talk about the two key technologies, Istio and Knative, that accompany KServe.

Istio

Much of the functionality that KServe brings to the table would be difficult without Istio. Istio is a service mesh that extends your applications deployed in Kubernetes. It is a dedicated infrastructure layer that adds capabilities such as observability, traffic management, and security. For those familiar with Kubernetes, Istio replaces the standard ingress definitions typically found in a Kubernetes cluster.

The complexity of managing traffic and maintaining observability grows as a Kubernetes-based system scales. One of the best features of Istio is its ability to centralize the controls of service-level communications. This gives developers greater control and transparency over communications among services.

With Istio, developers do not need to design applications that can handle traffic authentication or authorization. Ultimately, Istio helps reduce the complexity of deployed apps and allows developers to concentrate on the important components of the apps.

By leveraging the networking features of Istio, KServe can bring features that include canary deployment, inference graphs, and custom transformers.

KNative

Knative, on the other hand, is an open-source enterprise-level solution to build serverless and event-driven applications. Knative is built on top of Istio to bring serverless code execution capabilities that are similarly offered by AWS Lambdas and Azure Functions. Knative is a platform-agnostic solution for running serverless deployments in Kubernetes.

One of the best features of Knative is the scale-to-zero feature. This is a critical component of KServe’s ability to scale up or down ML model deployment and one that maximizes resource utilization and saves on costs.

Should I use KServe?

KServe, like many other tools, is not a one size fits all solution that will fit your organisation’s requirements. It has a high cost of entry due to the fact that some experience in working with Kubernetes is required. If you are just getting started with Kubernetes, there are many resources online and I highly recommend checking out resources such as the DevOps guy on Youtube. Nonetheless, even without a deep understanding of Kubernetes, it is possible to learn to use KServe.

KServe will be ideal in organisations already leveraging Kubernetes where there are existing knowledge in working with Kubernetes. It may also suit organisations looking to move away or complement managed services like SageMaker or Azure Machine Learning in order to have greater control over your model deployment process. The increase in ownership can result in significant cost reductions and increased configurability to meet project specific requirements.

Nevertheless, the right cloud infrastructure decision will depend on a case by case basis as infrastructure requirements will differ across companies.

Pre-requisites

This guide will take you though the steps required to get you set up with KServe. You will be walked through the steps to install KServe and serve your first model.

There are several pre-requisites that will need to be met before proceeding. You will require the following:

Kubectl installation
Helm installation
Kubectx installation (optional)

Kubernetes Cluster

For this tutorial, I recommend experimenting with a Kubernetes cluster using Kind. It is a tool to run a local Kubernetes cluster without needing to spin up cloud resources. In addition, I highly recommend Kubectx as a tool to easily switch between Kubernetes context if you are working across multiple clusters.

However, when running production workload, you will need access to a fully functioning Kubernetes cluster to configure DNS and HTTPS.

Deploy a Kubernetes cluster in Kind with:

kind create cluster --name kserve-demo

Switch to the correct Kubernetes context with:

kubectx kind-kserve-demo

Installation

The following steps will install Istio v1.16, Knative Serving v1.7.2 and KServe v0.10.0. These versions are best suited for this tutorial as Knative v1.8 onwards will require DNS configuration for ingress which adds a layer of complexity that is currently out of scope.

Istio Installation.

# Install istio
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.16.0 TARGET_ARCH=x86_64 sh -
istioctl install --set profile=default -y

Install KNative Serving.

# Install the Knative Serving component
export KNATIVE_VERSION="v1.7.2"
kubectl apply -f https://github.com/knative/serving/releases/download/knative-$KNATIVE_VERSION/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-$KNATIVE_VERSION/serving-core.yaml

# Install istio-controller for knative
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.7.0/net-istio.yaml

Install cert manager. Cert manager is required to manage valid certificate for HTTPs traffic.

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.11.0 --set installCRDs=true

Create a namespace for model.

kubectl create namespace kserve

Clone the KServe repository.

git clone git@github.com:kserve/kserve.git

Install KServe Cutom Resource Definitions and KServe Runtimes into the model namespace in your cluster.

cd kserve
helm install kserve-crd charts/kserve-crd -n kserve
helm install kserve-resources charts/kserve-resources -n kserve

Great! We now have KServe installed on the cluster. Let’s get deploying!

First Inference Service

In order to ensure that the deployment went smoothly, let us deploy a demo inference service. The source code for the deployment can be found here.

kubectl apply -n kserve -f - <


The yaml resource definition above deploys a test inference service that sources a publicly available model trained using the SciKit-Learn library. KServe supports many different flavours of machine learning libraries. These include MLFlow, PyTorch or XGBoost models; more are added at each release. If none of these out-of-the-box libraries meet your requirements, KServe also support custom predictors.
It is possible to monitor the status of the current deployment by getting the available pods in the namespace.
kubectl get pods -n kserve
Image by author
If you run into issues with the deployment use the following to debug:
kubectl describe pod  -n kserve
We can also check the status of the inference service deployment with:
kubectl get isvc -A
Image by author
If the inference service is marked true, we are ready to perform our first prediction.
Performing a prediction
To perform a prediction, we will need to determine if our Kubernetes cluster is running in an environment that supports external load balancers.
kubectl get svc istio-ingressgateway -n istio-system
Kind Cluster
Clusters deployed using Kind does not support external load balancers therefore you will have an ingress gateway that looks similar to below.
Kind External Load Balancer (Image by author)
In this case we would have to port-forward the istio-ingressgateway which will allow us to access it via localhost.
Port-forward the istio ingress gateway service to port 8080 on localhost with:
kubectl port-forward -n istio-system service/istio-ingressgateway 8080:80
Then set the ingress host and port with:
export INGRESS_HOST=localhost
export INGRESS_PORT=8080
Kubernetes Cluster
If the external IP is valid and does not display , we are able to send an inference request through the internet at the IP address.
Ingress Gateway IP address (Image by author)
Set the ingress host and port with:
export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
Perform Inference
Prepare an input request json file for the inference request.
cat < "./iris-input.json"
{
  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]
  ]
}
EOF
Then perform an inference with curl:
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n kserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json 
The request will be sent to the KServe deployment through the istio-ingress gateway. If everything is in order, we will get a json reply from the inference service with the prediction of [1,1] for each of the instances.

Scaling to Zero
By leveraging the features of Knative, KServe supports scale-to-zero capabilities. This feature effectively manages limited resources across the cluster by scaling pods not in use to zero. Scaling to zero capabilities allow the creation of a reactive system that responds to requests, as opposed to a system that is always up. This will facilitate the deployment of a greater number of models in the cluster than traditional deployment configurations can.
However, note that there is a cold start penalty for pods that have been scaled down. This will vary depending on the size of the image/model and the available cluster resources. A cold start can take 5 minutes if the cluster needs to scale an additional node or 10 seconds if the model is already cached on the node.
Let us modify the existing scikit-learn inference service and enable scale to zero by defining minReplicas: 0.
kubectl apply -n kserve -f - <

By setting minReplicas to 0, this will command Knative to scale the inference service down to zero when there are no HTTP traffic. You will notice that after a period of 30 seconds, the pods for Sklearn-Iris model will be scaled down.
kubectl get pods -n kserve
Sklearn-Iris predictors scales down to zero
To reinitialise the inference service, send a prediction request to the same endpoint.
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n kserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json

This will trigger pod initialisation from cold start and return a prediction.
Conclusion
KServe simplifies the process of machine learning deployment and shortens the path to production. When combined with Knative and Istio, KServe has the added bonus of being highly customisable and bring many features that easily rivals those offered in managed cloud solutions.
Of course, migrating the model deployment process in-house has its own innate complexities. However, the increase in platform ownership will confer greater flexibility in meeting project specific requirements. With the right Kubernetes expertise, KServe can be a powerful tool that will allow organisations to easily scale their machine learning deployment across any cloud provider to meet increasing demand.
The post KServe: Highly scalable machine learning deployment with Kubernetes appeared first on Towards Data Science.



Infinitely scalable storage for Kubernetes
Flavien Berwick — Sun, 07 May 2023 12:33:16 +0000
Infinitely Scalable Storage for Kubernetes
Sometimes, you just need storage that works. The luxury of having a Cloud provider storage class is not always possible, and you have to manage it all by yourself. This is the challenge I had to answer for my on-premise client in healthcare.
In this article, you will learn why and how to install Rook Ceph to provide your Kubernetes cluster with an easy-to-use replicated storage class.
We will then deploy a file-sharing app, destroy the node on which it is deployed, and see what happens. Will Ceph make our files accessible again?
Containers to the horizon. Photo by Kelly on Pexels.
Choosing a storage solution
Storage has always been a challenge in Kubernetes as it doesn’t natively provide redundant and distributed storage solutions. With native Kubernetes, you can only attach a hostPath volume for persistent storage.
My client has its own on-premise infrastructure and wanted to make sure none of its data would get lost if one of its servers went down. Most of its apps are monoliths and don’t natively include data replication mechanisms.
So I had to choose from a variety of storage solutions. My client didn’t need ultra-high performances but wanted a stable solution. I came to choose Rook Ceph because :

It is a CNCF-graduated project (guarantee of stability and quality)
Is open source with great documentation and community support
It is easy to deploy and use
Performances are fair (chapter "The benchmarks")

Prepare your cluster
We need a Kubernetes cluster of a minimum of 3 nodes and 1 empty attached disk for each.
I recommend using Scaleway Kapsule to easily instantiate a Kubernetes cluster and attribute unformatted disks. Once the Kubernetes cluster has started, we will create an attached volume (disk) for each node :

Go to "Instances"
Select your node
Click the "Attached volumes" tab
Click "+" (Create volume) and create a new disk

Download your kubeconf file and place it in ~/.kube/config . You should now get access to your cluster with your kubectl CLI.
Install Rook Ceph
1. This blog post has a companion repo on GitHub, let’s clone it to have all resources we need
git clone https://github.com/flavienbwk/ceph-kubernetes
cd ceph-kubernetes
2. Clone the Rook repo and deploy the Rook Ceph operator
git clone --single-branch --branch release-1.11 https://github.com/rook/rook.git
kubectl create -f ./rook/deploy/examples/crds.yaml
kubectl create -f ./rook/deploy/examples/common.yaml
kubectl create -f ./rook/deploy/examples/operator.yaml
3. Create the Ceph cluster
kubectl create -f ./rook/deploy/examples/cluster.yaml -n rook-ceph
Wait several minutes for Ceph to configure the disks. Health should be HEALTH_OK :
kubectl get cephcluster -n rook-ceph
4. Create the Storage classes
Rook Ceph can provide you with two main storage classes. One is RBD and allows you to have a replicated storage in ReadWriteOnce mode. The second one we’ll install is CephFS, which allows you to have replicated storage in ReadWriteMany mode. RBD stands for RADOS Block Device and allows you to have a storage class to provision volumes in your Kubernetes cluster. This only supports ReadWriteOnce volumes (RWO). CephFS acts like a replicated NFS server. This is what will allow us to create volumes in ReadWriteMany mode (RWX).
kubectl create -f ./rook/deploy/examples/csi/rbd/storageclass.yaml -n rook-ceph
kubectl create -f ./rook/deploy/examples/filesystem.yaml -n rook-ceph
kubectl create -f ./rook/deploy/examples/csi/cephfs/storageclass.yaml -n rook-ceph
5. Deploy the Ceph dashboard
kubectl create -f ./rook/deploy/examples/dashboard-external-https.yaml -n rook-ceph
Forward dashboard’s HTTP access :
kubectl port-forward service/rook-ceph-mgr-dashboard -n rook-ceph 8443:8443
Connect with the username admin and the following password :
kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}"
You should get the following page browsing https://localhost:8443
Image by author: Ceph dashboard
Deploying an app
We will deploy a self-hosted file-sharing app (psitransfer) to check if our volumes bind correctly.
1. Deploy the file-sharing app (NodePort 30080)
kubectl create -f ./psitransfer-deployment-rwx.yaml
2. See on which node it is deployed
kubectl get pods -o wide -l app=psitransfer
Retrieve the IP of this node (through the Scaleway interface) and check the app is running at http://nodeip:30080
3. Let’s upload some files
Download the 5MB, 10MB and 20MB files from xcal1.vodafone.co.uk website.
Upload them to our file transfer app. Click the link that appears on the screen.
You should now see the tree files imported. Click on it and keep the link in your browser tab, we’ll use it later.
After uploading around 400MB of files, we can see the replication of data is coherent across disks. We see that the 3 disks are written simultaneously while we upload files. In the following screenshot, usage is 1% for each disk: although I uploaded on the same host, it seems the replication is working as expected with data equally persisted across the 3 disks (OSDs). Disk 2 has a lot of "read" activity as the 2 other disks synchronize data from it.

Ceph’s dashboard should look like this now :

C. Destroy and see
We’re going to stop the node hosting the web app to make sure data was replicated on the other nodes.

See on which node the app is deployed

kubectl get pods -o wide -l app=psitransfer

Power off the node from the Scaleway console

This simulates a power failure on a node. It should become NotReady after several minutes :
$> kubectl get node
NAME                                             STATUS     ROLES    AGE    VERSION
scw-ceph-test-clustr-default-5f02f221c3814b47a   Ready         3d1h   v1.26.2
scw-ceph-test-clustr-default-8929ba466e404a00a   Ready         3d1h   v1.26.2
scw-ceph-test-clustr-default-94ef39ea5b1f4b3e8   NotReady      3d1h   v1.26.2
And Node 3 is unavailable on our Ceph dashboard :

Ceph’s dashboard should now look like this :

3. Reschedule our pod
The scheduled pod node is unavailable. However, our pod still thinks it is active :
$> kubectl get pods -o wide -l app=psitransfer
NAME                                      READY   STATUS    RESTARTS   AGE   IP            NODE
psitransfer-deployment-8448887c9d-mt6wm   1/1     Running   0          19h   100.64.1.19   scw-ceph-test-clustr-default-94ef39ea5b1f4b3e8
Delete it to reschedule it on another node :
kubectl delete pod psitransfer-deployment-8448887c9d-mt6wm
Check the status of the newly-restarted pod. Your app should be available again at the link previously kept.

To avoid having to manually delete the pod to be rescheduled when a node gets "NotReady", scale the number of replicas of your app to at least 3 by default.
You can now restart the previously powered-off node.
When to use rook-ceph-block or rook-cephfs ?
If your applications need better performance and require block storage with RWO access mode, use the rook-ceph-block (RBD) storage class. On the other hand, if your applications need a shared file system with RWX (CephFS) access mode and POSIX compliance, use the rook-cephfs storage class.
If choosing RBD and trying to reschedule a pod while its original node is offline as we did with CephFS, you will get an error from the PVC stating: "Volume is already exclusively attached to one node and can’t be attached to another". In that case, you just need to wait for the PVC to bind back (it took ~6 minutes for my cluster to automatically re-attribute the PVC to my pod, allowing it to start).
Try this behavior following the associated repo chapter.
Final word
You have learned how to install and deploy an app with Ceph. You have even proven that it replicates data. Congrats 
All images, unless otherwise noted, are by the author.
The post Infinitely scalable storage for Kubernetes appeared first on Towards Data Science.



How to Build ML Applications on the AWS Cloud with Kubernetes and oneAPI
Eduardo Alvarez — Fri, 17 Mar 2023 14:37:17 +0000
Image Source
Building and deploying high-performance AI applications can be a challenging task that requires a significant amount of computing resources and expertise. Fortunately, modern technologies such as Kubernetes, Docker, and the Intel AI Analytics Toolkit (AI Kit) make it easier to develop and deploy AI applications optimized for performance and scalability. Moreover, by using cloud services like Amazon Web Services (AWS), developers can further streamline the process and take advantage of the flexible and scalable infrastructure provided by the cloud.
In this article, we will explore how to use Kubernetes, Docker, and the Intel AI Analytics Toolkit to build and deploy AI applications on the AWS cloud. Specifically, we will focus on one of the first Intel Cloud Optimization Modules, which serves as a template with codified Intel accelerations covering various AI workloads. We will also introduce the AWS services that we will use in the process, including Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Registry (ECR), Amazon Elastic Compute Cloud (EC2), and Elastic Load Balancer (ELB).
Figure 1. This architecture is designed for AI production scenarios where many discrete models must be trained with low-moderate compute requirements. – Image by Author
The sample application that we will deploy focuses on Loan Default prediction, a common problem in the finance industry. We will use the daal4Py library to accelerate the inference of an XGBoost Classifier, enabling us to achieve high performance while reducing the time required to train and deploy the model.
Figure 2. A simplified API to Intel(R) oneAPI Data Analytics Library that allows for fast usage of the framework suited for Data Scientists or Machine Learning users. Built to help provide an abstraction to Intel(R) oneAPI Data Analytics Library for either direct usage or integration into one’s framework. – Image Source
By the end of this article, readers will have a basic understanding of how to build and deploy performant AI applications on the AWS cloud using Kubernetes, Docker, and the Intel AI Analytics Toolkit. Additionally, they will have a practical example of how to leverage these technologies to accelerate the inference of a loan default prediction model.
You can find all of the source code for this tutorial in our public GitHub Repository.
Get your Development Environment Ready
Install the AWS CLI – The AWS CLI (Command Line Interface) tool is a command-line tool for managing various Amazon Web Services (AWS) resources and services.
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install
Configure AWS Credentials using aws configure – learn more about setting credentials with aws cli here.
Install eksctl – eksctl is a command-line tool for creating, managing, and operating Kubernetes clusters on EKS.
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin
eksctl version
Install aws-iam-configurator – AWS IAM Authenticator is a command-line tool that enables users to authenticate with their Kubernetes clusters on EKS using their AWS IAM credentials.
curl -Lo aws-iam-authenticator https://github.com/kubernetes-sigs/aws-iam-authenticator/releases/download/v0.5.9/aws-iam-authenticator_0.5.9_linux_amd64
chmod +x ./aws-iam-authenticator
mkdir -p $HOME/bin && cp ./aws-iam-authenticator $HOME/bin/aws-iam-authenticator && export PATH=$PATH:$HOME/bin
echo 'export PATH=$PATH:$HOME/bin' >> ~/.bashrc
aws-iam-authenticator help
Install kubectl – Kubectl is a command-line tool for interacting with Kubernetes clusters. It allows users to deploy, inspect, and manage applications and services running on a Kubernetes cluster and perform various administrative tasks such as scaling, updating, and deleting resources.
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(cat kubectl.sha256)  kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
Our Loan Default Prediction Application
The application we will be deploying is based on the Loan Default Risk Prediction AI Reference Kit.
Image Source
We refactored the code from this reference solution to be more modular in support of our three main APIs:

Data processing – This endpoint preprocess data and stores it in a data lake or another structured format. This codebase also handles the expansion of the dataset for benchmarking purposes.
Model Training – This endpoint trains an XGBoost Classifier and converts it to an inference-optimized daal4py format.
Inference – This endpoint receives a payload with raw data and returns the loan default classification of each sample.

The directory tree below outlines the codebase’s various scripts, assets, and configuration files. The majority of the ML application code is in the app/ folder. This folder contains loan_default and utils packages – the loan_default package contains the server-side python modules that support our three main APIs. The server.py script contains the FastAPI endpoint configurations, payload data models, and commands to start a uvicorn server.
├───app/
|   ├───loan_default/
|   |   ├───__init__.py
|   |   ├───data.py
|   |   ├───model.py
|   |   └───predict.py
|   ├───utils/
|   |   ├───__init__.py
|   |   ├───base_model.py
|   |   ├───logger.py
|   |   └───storage.py  
|   ├───logs/
|   ├───server.py
|   └───requirements.txt    
|
├───kubernetes/
|   ├───cluster.yaml
|   ├───deployment.yaml
|   ├───service.yaml
|   └───serviceaccount.yaml
|
├─README.md
├─Dockerfile
├─SECURITY.md
A deep dive into the code base is beyond the scope of this tutorial. However, it is worth pointing out where we leverage the daal4py to improve our inference performance. Inside model.py, you’ll find the "train" method, which handles model training and conversion to daal4py format using the d4p.get_gbt_model_from_xgboost() function.
def train(self):
        # define model
        params = {
            "objective": "binary:logistic",
            "eval_metric": "logloss",
            "nthread": 4,  # flags.num_cpu
            "tree_method": "hist",
            "learning_rate": 0.02,
            "max_depth": 10,
            "min_child_weight": 6,
            "n_jobs": 4,  # flags.num_cpu,
            "verbosity": 0,
            "silent": 1,
        }

        log.info("Training XGBoost model")
        self.clf = xgb.train(params, self.DMatrix, num_boost_round=500)
        self.clf = d4p.get_gbt_model_from_xgboost(self.clf)
In the original reference kit’s performance testing, this simple conversion resulted in an ~4.44x boost in performance (Figure 3).
Figure 3. For batch inference of size 1M, Intel® v1.4.2 offers up to a 1.34x speedup over stock XGBoost v0.81 and with Intel® oneDAL, up to a 4.44x speedup. – Image by Author
Configuring and Launching Elastic Kubernetes Service Clusters
Elastic Kubernetes Service is a fully managed service that makes it easy to deploy, manage, and scale containerized applications using Kubernetes on Amazon Web Services (AWS). It eliminates the need to install, operate, and scale Kubernetes clusters on your own infrastructure.
To launch our EKS cluster, we must first create our cluster configuration file.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: "eks-cluster-loanDefault"
  version: "1.23"
  region: "us-east-1"

managedNodeGroups:
- name: "eks-cluster-loanDefault-mng"
  desiredCapacity: 3
  instanceType: "m6i.large"
We can configure the name and region of our cluster deployment, as well as the version of EKS that we want to run, in our "metadata" section. Most importantly, we can configure basic requirements for we compute resources in the "managedNodeGroups" section:

desiredCapacity – the number of nodes to scale to when your stack is created. In this tutorial, we will set this to 3.
instanceType – the instance type for your nodes. This tutorial uses an m6i.large instance, a 3rd Generation Xeon (2vCPU and 8GiB). Once openly available, we recommend trying out the r7iz instance family to take advantage of the Intel Advanced Matrix Extension (AMX) — a dedicated accelerator for deep learning workloads inside of Intel 4th Generation Xeon CPUs.

We execute eksctl create cluster -f cluster.yaml to create the Cloud Formation stack and provision all relevant resources. With the current configurations, this process should take 10 to 15 minutes. You should see a log similar to Figure 4.
Figure 4. Cloud formation log for EKS cluster provision workflow – Image by Author
You should run a quick test to ensure your cluster has been provisioned properly. Run eksctl get cluster to get the name of your available cluster(s), and eksctl get nodegroup --cluster  to check on your cluster’s node group.
Setting up all of the Kubernetes Application Resources
Let’s dig into launching your Kubernetes application. This process entails creating a namespace, a deployment manifest, and a Kubernetes service. All of these files are available in the tutorial’s codebase.
Before moving on to this part of the tutorial, please:

Create a docker image using the Dockerfile in the application codebase and push it to the Elastic Container Registry on AWS.
Create and configure your kubernetes service account to grant your application proper access to S3 resources.

Image by Author
A Kubernetes namespace is a virtual cluster that divides and isolates resources within a physical cluster. Let’s create a namespace called "loan-default-app"
kubectl create namespace loan-default-app
Now, let’s configure our Kubernetes deployment manifest. A Kubernetes deployment is a Kubernetes resource that allows you to declaratively manage a set of replica pods for a given application, ensuring that the desired number of replicas are running and available at all times while enabling features such as scaling, rolling updates, and rollbacks. It also provides an abstraction layer over the pods, allowing you to define your application’s desired state without worrying about the underlying infrastructure.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: "eks-loan-default-app"
  namespace: "loan-default-app"
  labels:
    app: "loan-default"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: "loan-default"
  template:
    metadata:
      labels:
        app: "loan-default"
    spec:
     serviceAccountName: "loan-default-service-account"
     topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: "loan-default"
     containers:
       - name: "loan-app-image"
         image: 
         ports:
           - containerPort: 80
         imagePullPolicy: "Always"
The Kubernetes deployment manifest (deployment.yaml) above defines the following:

kind: Deployment – The type of Kubernetes resource
name: "eks-loan-default-app" – The name of our deployment
namespace: "loan-default-app" – The namespace that this deployment should be assigned to
app: "loan-default" – The name we assign our application
replicas: 3 – the number of desired copies of a pod that should be created and maintained at all times.
serviceAccountName: "loan-default-service-account" – make sure this matches the service account you created earlier.
topologySpreadConstraints: – helps define how pods should be distributed across your cluster. The current configuration will maintain an equal distribution of pods across available nodes.
containers: name/image – where you provide the URI for your application container image and assign the image a name.

Run kubectl apply -f deployment.yaml to create your Kubernetes deployment.
Now let’s configure our Kubernetes service. A Kubernetes service is an abstraction layer that provides a stable IP address and DNS name for a set of pods running the same application, enabling clients to access the application without needing to know the specific IP addresses of individual pods. It also provides a way to load-balance traffic between multiple replicas of the application and can be used to define ingress rules for external access.
apiVersion: v1
kind: Service
metadata:
  name: "loan-default-service"
  namespace: "loan-default-app"

spec:
  ports:
  - port: 8080
    targetPort: 5000
  selector:
    app: "loan-default"
  type: "LoadBalancer"
The Kubernetes service manifest (service.yaml) above defines the following:

kind: Service – the type of Kubernetes resource.
name: "loan-default-service" – The name of our deployment.
namespace: "loan-default-app" – The namespace that this Service should be assigned to.
port: 8080 – The port where the service will listen to.
targetPort: 5000 – The port the service will communicate with on the pods.
app: "loan-default" – The name we assigned to our application
type: "LoadBalancer" – The type of service we selected.

Run kubectl apply -f service.yaml to create your Kubernetes service.
This will automatically launch an Elastic Load Balancer – a cloud service that distributes incoming network traffic across multiple targets, such as EC2 instances, containers, and IP addresses, to improve application availability and fault tolerance. We can use the ELB’s public DNS to make requests to our API endpoints from anywhere in the world.
Here are a few tips before moving on:

Run kubectl get all -n loan-default-app to get a full overview of the Kubernetes resources you have provisioned. You should see your pods, services, and replica groups.
Run kubectl -n loan-default-app describe pod  to get a detailed description of your pod.
If you need to diagnose a specific pod’s behavior, you can start a bash shell inside your pod by running kubectl exec -it  -n loan-default-app -- bash – type exit and hit enter to exit the shell.

Testing our Loan Default Prediction Kubernetes Application
Now that all of our infrastructure is in place, we can set up the data component of our application and test our endpoints.
We will begin by downloading the dataset from Kaggle. The dataset used for this demo is a set of 32581 simulated loans. It has 11 features, including customer and loan characteristics and one label, which is the outcome of the loan. Once we have the .csv file in our working directory, we can create an S3 bucket and upload are Kaggle dataset.
# create S3 Bucket for data
aws s3api create-bucket --bucket loan-default --region us-east-1

# upload dataset
aws s3api put-object --bucket loan-default --key data/credit_risk_dataset.csv --body 

Making HTTP Requests to our API Endpoints
We will be using Curl to make HTTP requests to our server. Curl allows you to send HTTP requests by providing a command-line interface where you can specify the URL, request method, headers, and data. It then handles the low-level details of establishing a connection, sending the request, and receiving the response, making it easy to automate HTTP interactions.
We will start by sending a request to our data processing endpoint. This will create test/train files and save our preprocessing pipeline as a .sav file to S3. The body of the requests requires the following parameters:

bucket: name of S3 bucket
key: path where your raw data is saved in S3
size: total samples you want to process
backend: options include "local" or "s3" – the codebase supports running the entire app locally for debugging purposes. When using the "s3" backend, the "local_path" and "target_path" parameters can be set to "None".

curl -X POST :8080/data -H 'Content-Type: application/json' -d '{"bucket":"loan-default","backend":"s3","key":"data/credit_risk_dataset.csv","target_path":"None","local_path":"None","size":400000}'
You can navigate to your S3 bucket in the AWS console to verify that all files have been properly generated (Figure 5).
Figure 5. S3 bucket with the outputs generated by our /data endpoint – Image by Author
Now we are ready to train our XGBoost Classifier model. We will make a request to our /train endpoint, which trains our model, converts it to daal4py format, and saves it to S3. The body of the requests requires the following parameters:

bucket: name of S3 bucket
data_key: folder path that contains processed data created by our data processing API
model_key: folder where we want to store our trained model
model_name: the name that we want to give our trained model
backend: options include "local" or "s3" – the codebase supports running the entire app locally for debugging purposes. When using the "s3" backend, the "local_model_path" and "local_data_path" parameters can be set to "None."

curl -X POST :8080/train -H 'Content-Type: application/json' -d '{"bucket":"loan-default","backend":"s3","local_model_path":"None","data_key":"data","model_key":"model","model_name":"model.joblib","local_data_path":"None"}'
You can navigate to your S3 bucket in the AWS console to verify that your model file has been created (Figure 6).
Figure 6. S3 bucket with the outputs generated by our /train endpoint – Image by Author
Now that we have a trained daal4py optimized XGBoost Classifier, we can make inference requests to our API. The /predict endpoint will return a binary classification of True for high default likelihood and False for low default likelihood. The response also includes the probability generated by the classifier. In the codebase, we have set anything above a 50% probability to be labeled as a high default likelihood. This can be adjusted to return more discretized labels like low, medium, and high default likelihood. The body of the requests requires the following parameters:

bucket: name of S3 bucket
model_name: the name of the trained model is S3
data_key: folder path that contains .sav processing pipeline file (should be the same as your processed data folder)
model_key: folder where your trained model was saved in S3
sample: your model inputs as a list of dictionaries
backend: options include "local" or "s3" – the codebase supports running the entire app locally for debugging purposes. When using the "s3" backend, the "local_model_path" and "preprocessor_path" parameters can be set to "None".

curl -X POST :8080/predict -H 'Content-Type: application/json' -d '{"backend":"s3","model_name":"model.joblib","data_key":"data","bucket":"loan-default","model_key":"model","sample":[{"person_age":22,"person_income":59000,"person_home_ownership":"RENT","person_emp_length":123,"loan_intent":"PERSONAL","loan_grade":"D","loan_amnt":35000,"loan_int_rate":16.02,"loan_percent_income":0.59,"cb_person_default_on_file":"Y","cb_person_cred_hist_length":3},{"person_age":22,"person_income":59000,"person_home_ownership":"RENT","person_emp_length":123,"loan_intent":"PERSONAL","loan_grade":"D","loan_amnt":35000,"loan_int_rate":55.02,"loan_percent_income":0.59,"cb_person_default_on_file":"Y","cb_person_cred_hist_length":0}],"local_model_path":"None","preprocessor_path":"None"}'
You can expect a response from the server fairly quickly (Figure 7).
Figure 7. Payload and response from the /predict endpoint – Image by Author
You can find all of the source code for this tutorial in our public GitHub Repository. Feel free to leave a comment or message me on LinkedIn if you have any questions.
Summary and Discussion
In this tutorial, we have demonstrated how to build a Kubernetes application on the AWS cloud based on a high-availability solution architecture. We have highlighted the use of Intel Xeon processors and AI Kit components to improve performance while enabling scale with Kubernetes.
We encourage readers to watch for upcoming workshops and future Intel Cloud Optimization Modules (ICOMs), as leveraging the Intel optimizations in these modules can qualify their applications for an "Accelerated by Intel" badge.
Our goal with ICOMs is to help developers enhance the performance and scalability of their applications with intel software and hardware. With the increasing demand for high-performance cloud applications, it is crucial for developers to stay informed and utilize the latest technologies and tools available to them.
Don’t forget to follow my profile for more articles like this!
The post How to Build ML Applications on the AWS Cloud with Kubernetes and oneAPI appeared first on Towards Data Science.



Expose Kubernetes Volumes Securely Over HTTP: How to Serve PVC on the Internet
Meysam — Wed, 08 Feb 2023 11:54:10 +0000
Create Kubernetes manifests to expose PersistentVolumeClaims
Photo by Uriel Soberanes on Unsplash
Intro
You may have encountered a situation in your daily product development where you needed to get your hands on some persisted files residing in the Kubernetes cluster. One common & safe approach is to do port-forwarding, whether with the help of Kubectl or pure SSH using a bastion host.
In either case, after you’re done with the task, you’d terminate the session, and for every future interaction, you’d go through the same manual process every time.
It might be ideal, security-wise, to keep your environment as sealed as possible, not giving the adversaries any chance & it is a valid reason to keep it like that.
But, if you want long-running exposure to the underlying storage out on the internet, this article is for you.
First Things First: Authentication
As this file server will be exposed publicly to the internet, your first and most important line of defense is the authentication layer. To put that into perspective, a formal definition of authentication is necessary.
Authentication is the act of proving an assertion, such as the identity of a computer system user. [source]
In layperson’s terms, authentication happens when a system user proves he is who he claims to be!
Now that we’ve cleared that let’s dig out some options for integrating authentication into our webserver (further below).

Using Nginx or Apache as a proxy, with the help of htpasswd, an Apache tool that allows storing an encrypted username-password pair in a file, which can later be used to verify a given password.
Ory Oathkeeper as a proxy, with the help of Kratos, another one of Ory’s products, as the identity provider. This is somewhat more complex than the earlier approach, and it takes some learning curve to master the configuration and the provisioning of those two. I will cover another article later about this, so stay tuned!  

Of course, you can add many more to this list, but for the sake of keeping this article short, and honestly, because I don’t know many other solutions, I’ll suffice to the two items above for the moment.
Another point I want to mention here is that since this article is about exposure to the internet, I’m not talking about private network solutions here. Still, you can imagine that will also be one safe option.
Now that we know Ory’s products are not the easiest to provision and that the author is not an authentication expert   let’s keep it simple and go for the first approach.
Create the htpasswd File
htpasswd is quite a simple tool to enforce a Basic authentication mechanism into any platform. It works by receiving a username and a password as input. The result will be a one-way hashed password in a file or standard output that can later be used to verify the user credential. Still, it can not be reversed (de-hashed) to the original password in a reasonable amount of time, at least not in 2023, with our current computing capacity!
To have a simple demonstration, look at the snippet below.

	


This will only create a new file for a user and tries to verify it with both the correct password and the wrong one.
We will use the same in our "Secure File Server," exposed publicly to the internet.
Reverse Proxy
Unless you want to handle the authentication on the file server layer (I know I won’t), you’ll use a reverse proxy to sit right in front, receiving every traffic and failing all those with the wrong credentials. You may even add another restricting measure, including but not limited to rate-limiting, logging, instrumentation, reporting, etc.
Apache & Nginx can both work with a htpasswd generated file to verify the credential. You can see the links below for more information on each:

https://httpd.apache.org/docs/2.4/howto/auth.html
https://docs.nginx.com/nginx/admin-guide/security-controls/configuring-http-basic-authentication/

I’m sure other web servers are fine, doing the same stuff as the one mentioned here.
In this article, I’m going to go with Nginx & since this will be hosted in Kubernetes, it will be a docker container of Nginx. This allows me to mount any number of config files into the /etc/nginx/conf.d directory, and the Nginx web server process picks it up.
So, if I can mount any config file in the directory, I can write a config file in a Kubernetes ConfigMap, and mount that into the container’s desired directory. This is both powerful and quite flexible.
This is the configuration I’m about to mount into the Nginx container.

	


The entry namedproxy_passyou see in the configuration file points to the file server that will expose the file system’s directory using the HTTP protocol. More on this in the next section.
Photo by Carl Barcelo on Unsplash
File Server Over HTTP
There are just a few to mention among the many other static web servers.

Python module: http.server
Npm: serve
Nginx static serving

Of course, this list can grow more, but we’re trying to keep it short and informative. 
In this article, I’ll be using Python’s builtin module: http.server. It has got a straightforward and intuitive interface and makes usage straightforward.
The way you can serve static content with it is as below:
ADDRESS=0.0.0.0 PORT=8000 DIRECTORY=/tmp
python -m http.server -b $ADDRESS -d $DIRECTORY $PORT
This works very well, especially since you don’t need to do a lot of magic to make it work.
Having this web server running and accessible from the Nginx container means you can mount your PersistentVolumeClaims in the static web server & place the Nginx described above right in front of it to gate for unauthenticated access to your precious data in the Kubernetes cluster.
Mount Kubernetes ConfigMap as Volume
Before we wrap it all up into one unified manifest, one last critical piece of information is used in this approach and needs a little explanation. But if you’re already a master on how to mount a ConfigMap as a Volume to a container in Kubernetes, feel free to skip this section.
To mount a Kubernetes ConfigMap as a Volume, you use the projection in the volumes’ section of the container definition like below [source]:

	


Right at the same level as containers , there is a volumes defined, which can accept a couple of volume types, one being ConfigMap. This allows for defining some script and passing that as volume to the running container.
After creating the manifest above, here’s what the logs will show from the container.
kubectl logs job/demo -c demo
total 0      
drwxr-xr-x    2 root     root          45 Feb  4 06:58 ..2023_02_04_06_58_34.2149416564
lrwxrwxrwx    1 root     root          32 Feb  4 06:58 ..data -> ..2023_02_04_06_58_34.2149416564
lrwxrwxrwx    1 root     root          21 Feb  4 06:58 favorite-color -> ..data/favorite-color
lrwxrwxrwx    1 root     root          16 Feb  4 06:58 names.txt -> ..data/names.txt
red
Alex
Jane
Sam
Wrapping it All Together
Now that we’ve seen all the information piece by piece, it’s time to put them together to serve as a unified manifest, which will be used for one sole purpose: An HTTP serving file server.

	


The second file, an Ingress resource, is optional but still included since this article was about publicly exposing an HTTP static webserver. You will only get internet exposure if you create the Ingress.
To get another safety measure in place, you can assign a UUID-generated value to your subdomain to avoid having only the username & password as your ONLY security measure. It can be something like this:
415cb00c-5310-4877-8e28-34b05cadc99d.example.com
Otherwise, you’re only as safe as your username & password, and if your brand gives away the username you have assigned, then you’re only as secure as your password, and that’s different from the position you’d like to put yourself in!
Also, remember that you will need HTTPS. You never want to have some random fellow eavesdrop on your connection, watching you transmit precious customer data over the internet.
Photo by Haley Phelps on Unsplash
Conclusion
Since this is Kubernetes we’re talking about, there is no tying to any cloud providers, and you simply apply this practice anywhere with a Kubernetes cluster.
This will effectively mean that you can securely expose your dynamic provisioned persistent volumes to the internet!
Take the "secure" part with a grain of salt since this is not the safest bet, but you can still protect yourself if you assign a random string to the subdomain of your Ingress. That way, the attacker will have to find the URL out of many combinations on the internet, effectively taking many years. By which time, we may have all been gone by then!
Have an excellent rest of the day. Stay tuned, and take care!

Acknowledgment
I hope you find this helpful article. Here’s a list of some of my previous work you might enjoy.

How to Set Up Ingress Controller in AWS EKS
What is HAProxy & how to get the most out of it?
How to Write Your Own GitHub Action
12-Factor App For Dummies
Stop Committing Configurations to your Source Code


References

http://nginx.org/en/docs/
https://kubernetes.io/docs/
The post Expose Kubernetes Volumes Securely Over HTTP: How to Serve PVC on the Internet appeared first on Towards Data Science.