# Training Deep Neural Networks on Google Tensor Processing Units (TPUs)

As part of my research work at Stanford, I have been training Object Detection Deep Neural Networks. Training these networks is extremely compute intensive, and while we have a variety of powerful compute options at Stanford, I was looking for a faster way to train these networks. This article is a detailed guide to training the popular RetinaNet object detection network on TPU.

# Google's Tensor Processing Units (TPUs)

Finding a Tensorflow Tensor Processing Unit (TPU) enabled version of the network I was training, I reached out to Google through their TensorFlow Research Cloud program, asking for access to TPUs via Google Cloud. Google quickly responded and graciously allowed us use of several TPUv2 compute units.

The TPUv2 is composed of 8 processors, each with 8GB of High Bandwidth Memory (HBM) and the processors are quoted at 180 teraflops en total or 22.5 TFLOPs each.

We were later granted access to several TPUv3s which are also composed of 8 processors, but each with 16GB of High Bandwidth Memory (HBM) and processors quoted at a total of 420 teraflops or 52.5 TFLOPs each!

To put these specs in perspective, the Stanford DGX-1 (from NVIDIA) has 8x P100 GPUs, each with 16GB of memory and is quoted at a total of 85 teraflops single precision or 10.6 TFLOPs per P100 GPU processor performance.

Moreover, utilizing all 8 GPUs in parallel can be challenging as most networks are not designed to handle this type of hardware parallelism.

There is also the Tensorflow Models repo, which contains a similar config for RetinaNet. For some reason, I had some trouble with the configuration in the TPU repository, specifically I had trouble with the evaluation. I'm sure these work fine, but since I was working off the Models repository before RetinaNet was added to the TPU repo, I'll be working off the Models repo in my examples below.

# TPU Usage Mental Model

It took a bit to wrap my head around what a "Cloud TPU" actually is and how one would use it. Turns out a "Cloud TPU" is the combination of a virtual machine (VM) and a TPU hardware accelerator that get booted when a TPU is created using the ctpu CLI tool or the Google Cloud Web Interface. When working on the Google Research Cloud, the TPU and it's associated VM are both free, for up to the amount of time you have been allocated the TPU resources. From now on, I'll refer to the TPU and the VM that controls it (and has an associated IP, etc.) as just the TPU.

The TPU vs. VM point is slightly confused by the fact that the ctpu tool will automatically create another VM you can use to interact with the TPU. Since you cannot directly SSH into the TPU, you'll want to use one VM of reasonable size to interact with all your TPUs. This VM that you use to interact with your TPUs is not free.

Also, for the TPUs you're using to read and write info to disk, you'll need to use Google Cloud Storage. Though this resource is inexpensive, it is not free.

When creating a new Google Cloud account, you'll be granted $300 in credit, which should be plenty, if you keep in mind the above. Non-TPU VMs and Google Cloud Storage are inexpensive, but not free. Limit your use of these resources as much as possible by turning off the VM when not in use and only using 1 if possible. Limit duplicate data stored on Google Cloud Storage and minimise transfers between off-Google-services and on-Google-services networks. # Training RetinaNet Now, let's train RetinaNet! ## Setup On a new Ubuntu machine, run: sudo apt-get -y install python-pil python-lxml python-tk  Python We'll use pipenv to isolate this project from the global python environment pip install --user --upgrade pipenv  pipenv install tensorflow-gpu  Get the TensorFlow model repository, cd to your source directory and run: export TF_ROOT=$(pwd)
git clone https://github.com/tensorflow/models


## Build

Cocoapi

git clone https://github.com/cocodataset/cocoapi.git
pushd cocoapi/PythonAPI
pipenv run python setup.py build_ext --inplace
make
cp -r pycocotools ${TF_ROOT}/models/research/ popd  Protobuf push models/research wget -O protobuf.zip https://github.com/google/protobuf/releases/download/v3.0.0/protoc-3.0.0-linux-x86_64.zip unzip protobuf.zip ./bin/protoc object_detection/protos/*.proto --python_out=. popd  ### Environment Set path export in your pipenv .env file echo 'PYTHONPATH=${PYTHONPATH}:${PWD}:${TF_ROOT}/models/research:${TF_ROOT}/models/research/slim' | tee .env echo "LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64/" | tee -a .env  Note that if paths in your environment change, you will need to re-generate the .env file ## Cuda You'll need cuda 9 and cudnn 7 unzip the cudnn 7 tar and copy it's contents into your local cuda dir: tar xzf cudnn-9.0-linux-x64-v7.4.2.24.tgz sudo cp -r cuda/* /usr/local/cuda-9.0/ rm -rf cuda  ## Test Test your setup with: pipenv run python models/research/object_detection/builders/model_builder_test.py  ## Data Prep Download the COCO dataset, if you don't have it, cd into your data directory first: # Helper function to download and unpack a .zip file. function download_and_unzip() { local BASE_URL=${1}
local FILENAME=${2} if [ ! -f${FILENAME} ]; then
echo "Downloading ${FILENAME} to$(pwd)"
wget -nd -c "${BASE_URL}/${FILENAME}"
else
echo "Skipping download of ${FILENAME}" fi echo "Unzipping${FILENAME}"
${UNZIP}${FILENAME}
}

export TF_DATA=$(pwd) # Download the images. BASE_IMAGE_URL="http://images.cocodataset.org/zips" TRAIN_IMAGE_FILE="train2017.zip" download_and_unzip${BASE_IMAGE_URL} ${TRAIN_IMAGE_FILE} TRAIN_IMAGE_DIR="${TF_DATA}/train2017"

VAL_IMAGE_FILE="val2017.zip"
download_and_unzip ${BASE_IMAGE_URL}${VAL_IMAGE_FILE}
VAL_IMAGE_DIR="${TF_DATA}/val2017" TEST_IMAGE_FILE="test2017.zip" download_and_unzip${BASE_IMAGE_URL} ${TEST_IMAGE_FILE} TEST_IMAGE_DIR="${TF_DATA}/test2017"

BASE_INSTANCES_URL="http://images.cocodataset.org/annotations"
INSTANCES_FILE="annotations_trainval2017.zip"
download_and_unzip ${BASE_INSTANCES_URL}${INSTANCES_FILE}

TRAIN_OBJ_ANNOTATIONS_FILE="${TF_DATA}/annotations/instances_train2017.json" VAL_OBJ_ANNOTATIONS_FILE="${TF_DATA}/annotations/instances_val2017.json"

TRAIN_CAPTION_ANNOTATIONS_FILE="${TF_DATA}/annotations/captions_train2017.json" VAL_CAPTION_ANNOTATIONS_FILE="${TF_DATA}/annotations/captions_val2017.json"

BASE_IMAGE_INFO_URL="http://images.cocodataset.org/annotations"
IMAGE_INFO_FILE="image_info_test2017.zip"
download_and_unzip ${BASE_IMAGE_INFO_URL}${IMAGE_INFO_FILE}

TESTDEV_ANNOTATIONS_FILE="${TF_DATA}/annotations/image_info_test-dev2017.json"  Get the checkpoint, cd into your data directory: cd${TF_DATA}
mkdir checkpoints && pushd checkpoints
tar xzf ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
popd


Make the 2017 (aka. trainval35k / minival2014) splits for COCO:

pipenv run python models/research/object_detection/dataset_tools/create_coco_tf_record.py \
--train_image_dir="${TF_DATA}/train2017" \ --val_image_dir="${TF_DATA}/val2017" \
--test_image_dir="${TF_DATA}/test2017" \ --train_annotations_file="${TF_DATA}/annotations/instances_train2017.json" \
--val_annotations_file="${TF_DATA}/annotations/instances_val2017.json" \ --testdev_annotations_file="${TF_DATA}/annotations/image_info_test2017.json" \
--output_dir="${TF_DATA}/tf2017/"  COCO Labels are in: models/research/object_detection/data/mscoco_label_map.pbtxt Copy them into your data directory with something like: cp${TF_ROOT}/models/research/object_detection/data/mscoco_label_map.pbtxt ${TF_DATA}/  ## Google Cloud In your environment, set: export GCP_PROJECT="some-project-name-1337" export GCP_BUCKET="pick-a-bucket-name"  Install gcloud and Run gcloud auth login to login Run gcloud config set project some-project-name-1337 to set the default project TPU and ML apis have been enabled There is a regional cloud storage bucket (central-1, where the TPUs will run) named pick-a-bucket-name Get your service account (response should be in the format ..."tpuServiceAccount": "[email protected]"): curl -H "Authorization: Bearer$(gcloud auth print-access-token)"  \
https://ml.googleapis.com/v1/projects/${GCP_PROJECT}:getConfig  and add it to the environment: export [email protected]unt.com  Grant the account permission: gcloud projects add-iam-policy-binding$GCP_PROJECT  \
--member serviceAccount:$GCP_TPU_ACCOUNT --role roles/ml.serviceAgent  Set your default region / zone to the location with free TPUs gcloud config set compute/zone us-central1-f  A note from Google on free TPUs, Activating Allocations: 5 v2-8 TPU(s) in zone us-central1-f 100 preemptible v2-8 TPU(s) in zone us-central1-f IMPORTANT: This free 30-day trial is only available for Cloud TPUs you create in the zones listed above. To avoid charges, please be sure to create your Cloud TPUs in the appropriate zone.  Verify you have the right zone: gcloud config list  Create a new service account so python can access GCP resources: open https://console.cloud.google.com/apis/credentials/serviceaccountkey?project=some-project-name-1337&folder&organizationId set the type to editor hit create and copy the downloaded file into the projects credentials folder and explort it in the shell: mv ~/Downloads/my-creds-2395ytqh3.json credentials/service.json export GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/credentials/service.json


Copy the coco dataset to the bucket:

# data is in ${TF_DATA}/tf2017 gsutil -m cp -r${TF_DATA}/tf2017 gs://${GCP_BUCKET}/datasets/coco/tf2017  Copy labels: gsutil cp${TF_DATA}/mscoco_label_map.pbtxt gs://${GCP_BUCKET}/datasets/coco/labels/mscoco_label_map.pbtxt  Copy the checkpoint: # data is${TF_DATA}/checkpoints/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/
gsutil cp /scr/ntsoi/datasets/coco/checkpoints/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/model.ckpt.* gs://${GCP_BUCKET}/datasets/coco/checkpoints/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/  ## Running on TPU ### Setup Only needs to be done once per project. Grant the TPU service bucket permission, open https://console.cloud.google.com/iam-admin/iam?project=some-project-name-1337 and add the TPU service account to: • Storage Admin • Viewer ### On each Run Bring up a TPU (v3) ctpu up --zone us-central1-a --name=tpu-1--v3 --tpu-size=v3-8 --tpu-only  Or v2 ctpu up --zone us-central1-f --name tpu-1--v2 --tpu-size=v2-8 --tpu-only  (or leave off --tpu-only if you need a vm also) try to run only 1 vm though, and as many TPUs as you need, as we're billed for VMs but up to 5 tpus and 100 pre-emptable tpus are free Which will print a TPU name, paste it below: If keys are created, you'll see a message like: Your identification has been saved in$HOME/.ssh/google_compute_engine.
Your public key has been saved in $HOME/.ssh/google_compute_engine.pub.  Connect to the TPU # From the project root, connect with port forwarding, only possible if you've left the --tpu-only flag off the ctpu up command gcloud compute ssh tpu1-1--v2 -- -L 6006:localhost:6006  Setup the TPU, on the VM you've created, run: wget https://gist.githubusercontent.com/nathantsoi/8a422a19d08335f52dc49657058da251/raw/83e2a374224723715ebae529ab52668ac5d71a9c/setup_tpu.sh chmod +x setup_tpu.sh ./setup_tpu.sh  You'll want to edit ~/.bash_rc and set the GCP_BUCKET to your bucket name. Local training on the TPU (once ssh'd in): # Start a tmux sesssion, to keep the session running after disconnecting, then: pushd${RUNNER}
export JOB_NAME=retinanet
export MODEL_DIR=gs://${GCP_BUCKET}/train/${JOB_NAME}
python models/research/object_detection/model_tpu_main.py \
--gcp_project=some-project-name-1337 \
--tpu_name=tpu1 \
--tpu_zone=us-central1-f \
--model_dir=${MODEL_DIR} \ --pipeline_config_path=models/research/object_detection/samples/configs/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config \ --alsologtostderr  Run eval on a local gpu box, TPU inference isn't supported yet, so you won't be able to run the eval on a TPU, yet. export JOB_NAME=retinanet_baseline export MODEL_DIR=gs://${GCP_BUCKET}/train/${JOB_NAME} python models/research/object_detection/model_main.py \ --model_dir=${MODEL_DIR} \
--pipeline_config_path=models/research/object_detection/samples/configs/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config \
--checkpoint_dir=\${MODEL_DIR} \
--alsologtostderr