Introduction
In this article we will show how to run a machine learning task on the cloud. More specifically, we will be showing how to solve a classification problem using Google Cloud Machine Learning solution. The core tools used to build the solution are Python language and TensorFlow open source library. Using Google Cloud Machine Learning services allows you to realize the following tasks:
-
ML Engine Job service: Train and evaluate your model using advanced hardware configuration (GPU, multiple GPU, TPU).
-
ML Engine Model service: Host and serve your model on the cloud. By having your model served on the cloud, your model becomes scalable and would benefit form dynamical resources allocation that the cloud platform would provide for optimum performance.
Actually, both tasks are optional, it is possible to train your model on the cloud and then host it on your own server or any other device supported by TensorFlow (i.e mobile devices, raspberry PI, etc.). Alternatively, it is possible to store a model that you have in the cloud and use the ML Engine Model service to run your model. This happens by creating a Model object on the cloud and linking it to the directory at which you have stored your model.
Storage at Google cloud is known as a "Bucket". Usually, buckets are referenced using "gs://" initials followed by the project name (i.e gs://project_1/my_bucket). As we will see later, replacing normal file access by "Bucket" file access is one of the possible modifications you have to apply to your code to make it cloud compatible.
Compatibility Check List
In what follows, we are supposing that you have a training task that is written in Python and does run correctly on your local computer. This checklist, points out what you have to change into your code to make it cloud compatible:
-
Parameters: In General, any python package or routine can be run on the cloud without the need of specific parameters. Still, as we will show later, there is one special parameter, with the name '--job-dir', that is passed to your Python main routine.
-
Versioning: When the Python routine is submitted as a Job, or a trained model is submitted as a Model, you have most probably to tell the Cloud what are the versions of TensorFlow, and Python you want to use. To set the TensorFlow version we have to use the option --runtime-version when using the gcloud utility. To set the Python version, we need to create a configuration file config.yaml that indicate the Python version. An example of such file would as in the following:
trainingInput:
pythonVersion: "3.5"
-
Packaging: When your Python task is run as a Job on the cloud the first step that the cloud will perform is installing the Python packages used by your code and mentioned in the import clauses of the source code. Unfortunately, not all packages are compatible with ML Engine virtual machines. Sometimes, it is necessary to load the package using special directives. For example, to be able to use matplot package you have to use the import syntax:
import matplotlib
matplotlib.use('AGG')
Another constraints is that the current directory is undefined when you are running on ML Engine. For example, if your package name is 'my_package' with the files: train.py, eval.py then you cannot import files by using file name directly. Instead, it is necessary to reference the imported file by using explicit package name:
Wrong: import eval
Correct: import my_package.eval OR import .eval
Model Definition
In our example we will show how to train a model for predicting income level, weather greater or less than 50k per year, according to different criteria. Input data, including the person's income, are rows of the following features:
- 'age'
- 'work class'
- 'education'
- 'education period'
- 'marital status'
- 'occupation'
- 'relationship'
- 'race'
- 'gender'
- 'capital gain'
- 'capital loss'
- 'work hours per week'
- 'native country'
- 'income level'
The proposed solution is to use the TensorFlow Estimator DNNLinearCombinedClassifier. This estimator combines a neural network with a linear regression to produce the predicted income level.
Training on the cloud Example
This example uses the code available at Cloud Samples. After downloading the code, you need to make sure that your local cloud utilities are installed and configured to use your Google Cloud Account and your Current Project.
Next, go to the sample directory
cd cloudml-samples-master/census/estimator
Download the Data
mkdir data
gsutil -m cp gs://cloudml-public/census/data/* data/
set BUCKET_NAME=income_emo
Create the bucket on the cloud, your credentials should have been already saved on local machine
set REGION=europe-west1
gsutil mb -l %REGION% gs://%BUCKET_NAME%
Copy the data
gsutil cp -r data gs://%BUCKET_NAME%/data
set TRAIN_DATA=gs://%BUCKET_NAME%/data/adult.data.csv
set EVAL_DATA=gs://%BUCKET_NAME%/data/adult.test.csv
gsutil cp ../test.json gs://%BUCKET_NAME%/data/test.json
set TEST_JSON=gs://%BUCKET_NAME%/data/test.json
set JOB_NAME=census_dist_1
set OUTPUT_PATH=gs://%BUCKET_NAME%/%JOB_NAME%
Submit the job using python 3.5 -specified in the Config file config.yaml- and Tensorflow version 1.4. The --scale-tier STANDARD_1 option specifies to create a training job that will use 1CPU + 1GPU
gcloud ml-engine jobs submit training %JOB_NAME% --job-dir %OUTPUT_PATH% --config config.yaml --runtime-version 1.4 --module-name trainer.task --package-path trainer/ --region %REGION% --scale-tier STANDARD_1 -- --train-files %TRAIN_DATA% --eval-files %EVAL_DATA% --train-steps 1000 --eval-steps 100 --verbosity DEBUG
Stream training output to your local machine:
gcloud ml-engine jobs stream-logs %JOB_NAME%
At the end of the training you should be able to read the evaluation accuracy that is equals to:
Saving dict for global step 1005: accuracy = 0.81025, accuracy_baseline = 0.76325, auc = 0.862617, auc_precision_recall = 0.648715, average_loss = 0.518367, global_step = 1005, label/mean = 0.23675, loss = 20.7347, prediction/mean = 0.265439
Once the job has terminated successfully, you can check training output
gsutil ls -r %OUTPUT_PATH%
The output should be similar to the following:
Tensorflow checkpoint(s) : these files represents the standard Tensorflow way of saving training states and results. These files can be used to visualize the model graph using TensorBaord, and/or restart training at later stage. It is a good practice to save these files for later examination. Still, Checkpoint file format is is not compatible with ML Engine Model service. So you do not need these files to host your model on the cloud
gs://income_emo/census_dist_1/:
gs://income_emo/census_dist_1/
gs://income_emo/census_dist_1/checkpoint
gs://income_emo/census_dist_1/events.out.tfevents.1524258793.cmle-training-master-179227214c-0-bxr6f
gs://income_emo/census_dist_1/graph.pbtxt
gs://income_emo/census_dist_1/model.ckpt-1005.data-00000-of-00003
gs://income_emo/census_dist_1/model.ckpt-1005.data-00001-of-00003
gs://income_emo/census_dist_1/model.ckpt-1005.data-00002-of-00003
gs://income_emo/census_dist_1/model.ckpt-1005.index
gs://income_emo/census_dist_1/model.ckpt-1005.meta
gs://income_emo/census_dist_1/model.ckpt-310.data-00000-of-00003
gs://income_emo/census_dist_1/model.ckpt-310.data-00001-of-00003
gs://income_emo/census_dist_1/model.ckpt-310.data-00002-of-00003
gs://income_emo/census_dist_1/model.ckpt-310.index
gs://income_emo/census_dist_1/model.ckpt-310.meta
gs://income_emo/census_dist_1/eval_census-eval/:
gs://income_emo/census_dist_1/eval_census-eval/
gs://income_emo/census_dist_1/eval_census-eval/events.out.tfevents.1524259016.cmle-training-master-179227214c-0-bxr6f
The export directory contains the actual, complete and final model data that can be used by ML Engine model service. This is SavedModel file format that contains all the necessary information to serve your model on the cloud. The main difference between checkpoint file format and SavedModel format is that SavedModel contains MetaGraphDef meta data that precise the way input is formatted and presented to the model and the way output is sent. Also, SavedModel can contains any extra files required by your model (i.e lookup table(s), etc..).
gs://income_emo/census_dist_1/export/:
gs://income_emo/census_dist_1/export/
gs://income_emo/census_dist_1/export/census/:
gs://income_emo/census_dist_1/export/census/
gs://income_emo/census_dist_1/export/census/1524259030/:
gs://income_emo/census_dist_1/export/census/1524259030/
gs://income_emo/census_dist_1/export/census/1524259030/saved_model.pb
gs://income_emo/census_dist_1/export/census/1524259030/variables/:
gs://income_emo/census_dist_1/export/census/1524259030/variables/
gs://income_emo/census_dist_1/export/census/1524259030/variables/variables.data-00000-of-00001
gs://income_emo/census_dist_1/export/census/1524259030/variables/variables.index
This is the Python training task packaged in a .gz format. It is NOT used in serving the model on the cloud.
gs://income_emo/census_dist_1/packages/:
gs://income_emo/census_dist_1/packages/856741b0cb35099072134cabf18bac769e6dc0aa302ec1105a7a645e53851210/:
gs://income_emo/census_dist_1/packages/856741b0cb35099072134cabf18bac769e6dc0aa302ec1105a7a645e53851210/trainer-0.0.0.tar.gz
You can donwload the model to your local machine:
mkdir final_model
gsutil cp -r %OUTPUT_PATH% ./final_model
You can also examine the graph using TensorBoard
tensorboard --logdir=./final_model
Now that we have trained the model we are ready to create the ML Engine Model:
set MODEL_NAME=census
gcloud ml-engine models create %MODEL_NAME% --regions=%REGION%
Previous command will create an empty object. To "fill" it with actual model, we need to assign a storage address at which the actual model can be loaded. As we have mentioned before this corresponds to the SavedModel export directory:
set MODEL_BINARIES=gs://%BUCKET_NAME%/census_dist_1/export/census/1524259030/
gcloud ml-engine versions create v1 --model %MODEL_NAME% --origin %MODEL_BINARIES% --runtime-version 1.4
From now on, ML Engine Model is active and operational. In production conditions, it is so often that the model is developed in multiple version. It is possible to create multiple model versions under the same model name. In the previous command we have created version 1. At later stages we can retrain the model using the same or different data set and create version 2 and so on. Each new version is saved under a different timestamp directory under the "export" directory according to: ./export/[model_name]/[time_stamp]/. In our example the timestamp is 1524259030.
Next, let us try to test the online model:
gcloud ml-engine predict --model %MODEL_NAME% --version v1 --json-instances ../test.json
We should get an output similar to:
LOGISTIC |
[0.06436885893344879] |
LOGITS |
[-2.676591396331787] |
PROBABILITIES |
[0.9356311559677124, 0.06436885893344879] |
Even that the returned output contains multiple outputs it actually reflects a single output prediction value. For this response, have 0.0643 as the predicted probability. In our model it is defined as the logistic regression probability. LOGITS is the odd logarithm, defined as ln(p/(1-p)), [-2.676 = ln(0.0643/(1-0.0643)] . 0.935 is simply the "untrue" probability which is equal to 1-0.0643.
Online testing
We have hosted an instance of the model on our server. You can test the developed model online using this interface.