Tensorflow Serving – Machine Learning Like It’s Nobody’s Business

Serving up Models

Welcome to part 2 of my series on Deep Learning 101! If you missed part 1 on how we generated the model that we’ll be using, I highly recommend you return to that tutorial and go through the steps of understanding how Deep Learning works, exploring the Colab notebook, and generating the model.

To bring you up to speed, as of right now you should have generated a model based on the input and output data and downloaded it to your desktop called c2f_model.zip.

While we have a model trained and are able to test data through a Colab notebook, that doesn’t help us if we want to expose this to another system over the Internet. In fact, while it might be good for running reports and exploration with your own data, the true power of the Internet are web services and API’s! I recall when I started working on Jupyter / Colab notebooks, I was thinking, “This is cool for my own exploration, but how the heck is this useful in my mobile and web applications?”. The Internet is run on systems talking to other systems and just having a notebook that can generate some predictions, while cool, is limiting.

Enter TensorFlow Serving! The Serving project from TensorFlow allows us to serve models using a REST API. If you are wondering how you can leverage all the work you put into building your models to apply to your existing applications, this is what you have been looking for. Here’s where Serving comes into practice.

While TensorFlow Serving provides out-of-the-box integration with TensorFlow models, it can serve other types of models and data as well.

As a short primer, I’ll break down the key concepts here to give some context. You can find more details on the project’s architecture page.


Servables are the underlying objects that clients use to perform computation (for example, a lookup or inference).

The size and granularity of Servables are flexible. A single Servable might include anything from a single shard of a lookup table to a single model to a tuple of inference models. Servables can be of any type and interface, enabling flexibility and future improvements such as:

  • Streaming results
  • Experimental APIs
  • Asynchronous modes of operation
  • Servables do not manage their own lifecycle.

Typical servables include the following:

  • TensorFlow SavedModelBundle (tensorflow::Session)
  • Lookup table for embedding or vocabulary lookups


TensorFlow Serving represents a model as one or more servables. A machine-learned model may include one or more algorithms (including learned weights) and lookup or embedding tables.

You can represent a composite model as either of the following:

  • Multiple independent servables
  • Single composite servable
  • A servable may also correspond to a fraction of a model. For example, a large lookup table could be shared across many TensorFlow Serving instances.


Loaders manage a servable’s life cycle. The Loader API enables common infrastructure independent from specific learning algorithms, data, or product use-cases involved. Specifically, Loaders standardize the APIs for loading and unloading a servable


Sources are plugin modules that find and provide servables.

Aspired Versions

Aspired versions represent the set of servable versions that should be loaded and ready.


Managers handle the full lifecycle of Servables, including:

  • Loading servables
  • Serving servables
  • Unloading servables

TensorFlow Serving RESTful API

While there’s a number of ways in which I can show Tensorflow Serving being used within my client’s architecture and DevOps pipelines, the most straightforward is to talk about the RESTful API. At its core, this is a JSON request and response. Perfect! This is how most systems interact with each other. You could easily imagine a user request coming into your system asking for values of a predicted future state (say a piece of equipment diagnostics), and based on the data in the model returning the prediction of time to failure.

Sequence Diagrams

Tensorflow Serving Sequence Diagram 1In this first use case, we feed our application data from an IoT device, store the data in a DataStore, and use a data pipeline to then train this model and store it in a location where it can be accessed from TensorFlow serving when needed.

Tensorflow Serving Sequence Diagram 2We then could use a different application to access this data to request a prediction. This could be anything from a mobile device, desktop, or a cloud service calling into our web application which talks with our TF Serving process and return the data to the client – all standard API’s using JSON.

In a future blog post I will go in-depth on the ingestion process, but for now, let’s assume that we want to have a client that would like to discover a Fahrenheit value based on a Celcius input.

For this exercise, let’s use our current model for predicting Farhenheit based on Celsius values over a RESTful interface.

Setting up TensorFlow Serving

Getting TensorFlow Serving up and running is super easy and quick. From the README, follow the steps outlined below. In this short exercise, we will:

  1. Download the TensorFlow Docker Image
  2. Clone the repo that has a few sample models. We’ll be using a demo model called “Half Plus Two”, which generates 0.5 * x + 2 for the values of x we provide for prediction.
  3. Start the container
  4. Query the model with some numbers using the curl command and ensure the values we get back are the output of the formula above.

So, open up a terminal and try executing the following commands. Before you start, make sure you have Docker installed.

# Download the TensorFlow Serving Docker image
docker pull tensorflow/serving

# Clone the TensorFlow Serving repository
git clone https://github.com/tensorflow/serving

# Location of demo models

# Start TensorFlow Serving container and open the REST API port
docker run -t --rm -p 8501:8501 \
-v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" \
-e MODEL_NAME=half_plus_two \
tensorflow/serving &

# Query the model using the predict API
curl -d '{"instances": [1.0, 2.0, 5.0]}' \
-X POST http://localhost:8501/v1/models/half_plus_two:predict

# Returns => { "predictions": [2.5, 3.0, 4.5] }

Cool! That was easy! So, we now have a REST API that we can pass values in for predictions and receive a result.

Let’s build on our Part 1 Blog Post where we were using Machine Learning to predict Celsius to Fahrenheit conversions and see how that works. If you’ll recall one of the final steps was to download from Collab a .zip file of our model. Let’s take this .zip file and put it in the directory with the other test data (it really can live anywhere you want) and then we’ll unzip it and feel this model into our docker instance.

From the console that you were running the docker container you can stop it by executing:

docker container ls

And take the CONTAINER ID and run a docker container stop. For example for me:

docker container stop 11dfc29b8b84

Once the container is stopped, let’s copy our model.zip into the TESTDATA directory. So, for example:

mv c2f_model.zip $TESTDATA
unzip c2f_model.zip

we are almost ready to start the docker container, but first, as TensorFlow handles multiple versions of models, it requires that your model lives inside a folder that has a version number. The number that you name the folder doesn’t matter, but just that the c2f_model folder lives inside it.

To make things easy, create a folder called 0001 and put your model inside that.

Now we start up our docker container again by going to the top level of the serving project and starting the container with

docker run -t --rm -p 8501:8501 \
-v "$TESTDATA/c2f_model:/models/c2f" \
-e MODEL_NAME=c2f \
tensorflow/serving &

With a little luck (and a sprinkle of some skill) you’ll see the container and service starts and reports

Building single TensorFlow model file config:  model_name: c2f model_base_path: /models/c2f

Excellent! Let’s test it out!

Let’s see what it say when we ask for a prediction of what 100 degrees Celcius is by executing the following command:

curl -d '{"instances": [[100]]}' \
-X POST http://localhost:8501/v1/models/c2f:predict

the result should be something similar to:

{ "outputs": [[ 211.277405]] }

Voila! Just like we did within the Collab Notebook, we get the same value of 100 degrees celsius being predicted to be close to 212 degrees Fahrenheit.

If you are curious about a video step by step, I have created a video tutorial of the steps here.


Let’s review what we have done in this exercise.

  1. We set up a Tensorflow Serving server using Docker
  2. Showed an example of using one of the pre-trained example models via the Serving REST API.
  3. Loaded our model from part 1 of this Machine Learning series that predicted Fahrenheit from Celsius
  4. Showed how Tensorflow Serving can be used to serve the model we created

I hope you have seen the value that Tensorflow Serving can bring to your projects. Being able to use an API based architecture will help in applying the Separation of Concerns design principle to your projects.

In a future post, I will take this model and apply it to an edge computing device. For example, what if we wanted to avoid having to make calls to the cloud and instead make predictions directly on an Android phone?

TensorFlow can do just that with TensorFlow Lite. That’s coming up next!