Cost Effective Docker Jobs on Google Cloud

Recently I wanted to run some jobs. I’m a huge advocate of using docker, so natrually I was going to build a docker image, to run my python scripts, then I wanted to schedule said job to run once in a while. Doing so on AWS is pretty easy using lamda and step functions, however since this is wasn’t a paid gig, I wasn’t able to get someone to fork the bill, enter Google Cloud!

Google Cloud Platform (GCP), is in a way the newer kid on the block. AWS has a long history of cloud platform and excellent customer support, whereas Google customer service is a bit like big foot, you’ve heard of it, some people say they seen it, but it doesn’t really exist…BUT, google still is an amazing technology company, they release early, the imporve it to make it rock (i.e. Android). And best they offer 300 bux free credits. So I decided to go for google, how bad can it be?

In this post I’ll talk about how I setup the google cloud to work for me, in a rather cool way. It took lots of blood, sweat and tears but I got it working. I schedule a job once in a while, I spin up a cluster of instances, run the job then shut it down! Not only is that cool (ya I’m a geek), it’s also quiet cost effective.

I will outline what I did, and even try share the my code with you guys.
Here goes:

Step 1 – Build docker image and push to google cloud private registry

The first step was the easier and the most trival. It is pretty much the same as AWS.

Create a build docker image

Let’s start with creating a build image. GitLab ci allows you to use your own image as your build machine, this is cool. If you’re using a different ci, I leave it to you to adjust this to your our system.


from docker:latest

RUN apk add --no-cache python py2-pip curl bash
RUN curl -sSL https://sdk.cloud.google.com | bash
ENV PATH $PATH:~/google-cloud-sdk/binser

RUN pip install docker-compose

This a Dockerfile for the build machine. It uses docker machine and it pulls pip, and installs glcoud.

Then I push this build image to docker-hub. If you haven’t done this before you need to:
1) Singup to docker cloud https://hub.docker.com and remember your username.

2) in the build machine folder, run docker build . -t /build-machine
3) run:


$ docker login
$ docker push /build-machine:latest
service

Create a GCP service account

You have to create a service account, give it access to the registery then export the key file as json. This is very simple step. If you’re unsure how to do it, just click through the IAM / Admin, you need to create a user, give it an IAM and export the key. Very easy.

Customize CI Script to push to private registery

Once this is all done, you have your build machine, we can now work on your ci script. I will show you how to do this on gitlab ci, but you can adapt this to your own environment. First create a build environment variable called CLOUDSDK_JSON and paste the contents of the json key you created in the previous step as the value of that key. Then add the following .gitlab-ci.yaml file to your project.


image: /build-machine

services:
  - docker:dind

stages:
  - build
  - test
  - deploy

before_script:
  - apk add --no-cache python py2-pip
  - pip install --no-cache-dir docker-compose
  - docker version
  - docker-compose version
  - gcloud version

build_image:
  stage: build
  except:
    - develop
    - master
  script:
    - docker build -t :latest .


deploy:
  stage: deploy
  only:
    - develop
    - master
  script:
    - docker build -t :latest .
    - echo $CLOUDSDK_JSON > key.json
    - gcloud auth activate-service-account  --key-file=key.json
    - docker tag :latest $PRIVATE_REGISTERY/:latest
    - gcloud docker -- push $PRIVATE_REGISTERY/:latest
    - gcloud auth revoke

Please adjust the job-image-name to your job docker image name, service_account_name to the service acocunt name you created and the build image to the image you pushed to docker hub. This yaml file is directect at a python job, but you can change it to any other language.

I have 3 stages: build, test and deploy.
I build and test on all branches, but only deploy on master. Gitlab ci has an issue, each step can happen on a different machine, so my first build step isn’t kept to the deploy phase, which forced me to re-build in the deploy phase.

Once this is done, you ci system should be pusing your image to your google private registery, well done!

Step 2 – Running Jobs in a Tеmp cluster

Here come the tricky part. Since jobs only need to run every x time, and only for a limited period, it’s ideal to be run as a google function. However those are limited to one hour, and can only be written in JavaScript (AWS support multiple languages with lamda and with state machines). And since I didn’t want to pay for full time cluster time running, I had to develop my own way to run jobs.

Kubernetes Services

Controlling jobs in a cluster and cluster control can be achieved using Kubernetes. This is one part of GCP that really shines, it let’s you define services, jobs, and pods (a collection of containers), and to run them.

To do this, I wrote a KubernetesService class in python that will:

– Spin up / create a cluster.
– Launch docker containers on the cluster.
– Once jobs finish, shutdown the cluster.


class KubernetesService():

    def __init__(self, namespace='default'):
        self.api_instance = kubernetes.client.BatchV1Api()
        service = build('container', 'v1')
        self.nodes = service.projects().zones().clusters().nodePools()
        self.namespace = namespace

This is the class and constructor. The full code for this class has more configuration and env varibles, as is part of the appengine cron project. I will include repo, if you want full details on how to achieve this.


def setClusterSize(self, newSize):
        logging.info("resizing cluster {} to {}".format(CLUSTER_ID, newSize))
        self.nodes.setSize(projectId=PROJECT_ID, zone=ZONE,
                           clusterId=CLUSTER_ID, nodePoolId=NODE_POOL_ID,
                           body={"nodeCount": newSize}).execute()

This function can control the cluster size. It can spin it up, before jobs need to be run, then shut it down after:


    def kubernetes_job(self, containers_info,  job_name='default_job', shutdown_on_finish=True):

        # Scale the Kubernetes to 3 nodes
        self.setClusterSize(3)
        timestampped_job_name = "{}-{:%Y-%m-%d-%H-%M-%S}".format(job_name, datetime.datetime.now())
        # Adding the container to a pod definition
        pod = kubernetes.client.V1PodSpec()
        pod.containers = self.create_containers(containers_info)
        pod.name = "p-{}".format(timestampped_job_name)
        pod.restart_policy = 'OnFailure'
        # Adding the pod to a Job template
        template = kubernetes.client.V1PodTemplateSpec()
        template_metadata = kubernetes.client.V1ObjectMeta()
        template_metadata.name = "tpl-{}".format(timestampped_job_name)
        template.metadata = template_metadata
        template.spec = pod
        # Adding the Job Template to the Job spec
        spec = kubernetes.client.V1JobSpec()
        spec.template = template
        # Adding the final job spec to the top level Job object
        body = kubernetes.client.V1Job()
        body.api_version = "batch/v1"
        body.kind = "Job"
        metadata = kubernetes.client.V1ObjectMeta()
        metadata.name = timestampped_job_name
        body.metadata = metadata
        body.spec = spec
        try:
            # Creating the job
            api_response = self.api_instance.create_namespaced_job(self.namespace, body)
            logging.info('job creations result'.format(api_response))
        except ApiException as e:
            print("Exception when calling BatchV1Api->create_namespaced_job: %s\n" % e)

kubernetes_job function creates continers (an additional function that creates container objects with env variables. Containers are then part of a pod, and that pod is part of a job template which is part of a job spec. You can read more about it in the Kubernetes docs.


def shutdown_cluster_on_jobs_complete(self):
        api_response = self.api_instance.list_namespaced_job(self.namespace)
        if next((item for item in api_response.items if item.status.succeeded != 1), None) is None:
            logging.info("no running jobs found, shutting down clutser")
            self.setClusterSize(0)
        else:
            logging.info("found running jobs, keeping cluster up")

If you don’t want to code to continue to wait for the jobs, you can poll for completion, and that is what shutdown_cluster_on_jobs_complete is for. It will shutdown the cluster once there are no running jobs.

This class controls all the job scheduling execution successfully.
And it’s part of an appengine (however they can be used independently).
Next we we need to have this script scheduled or triggered to activate.
And that is our cron scheduler task.

Cron scheduler appengine service

Sadly google doesn’t give you an easy way to run code in the cloud, you actually have to write more code to run code (silly right?)

The concenpt is that appengine provies you with a cron web scheduler that calles you own apps endpoints in given intervals.

First you add cron.yaml to your project and you configure which endpoint and the time interval to hit that endpoint:


cron:
- description: task to kick off all updates
  url: /events/run-jobs
  schedule: every 2 hours
- description: task to shutdown jobs when finished
  url: /events/shutdown-jobs
  schedule: every 5 min

Then we can add a handler to shutdown the jobs, and to kick off the jobs.


class RunJobsHandler(webapp2.RequestHandler):
      def get(self):
        try:
            logging.info("running jobs")
            jobs_list = Settings.get("JOBS_LIST").split()
            for job_name in jobs_list:
                job_name = job_name.replace("_", "-") //names cannot have underscore
                logging.info('about to publish job {}'.format(job_name))

                containers_info = [
                    {
                        "image": Settings.get("IMAGE_NAME"),
                        "name": job_name,
                        "env_vars": [
                            { "name": "SOME_ENV_BAR", "value": some_value}
                        ]
                    }
                ]

                job_env_vars = Settings.get('JOB_ENV_VARS').split()
                for env_var in spider_container_env_vars:
                    logging.info('adding container var {}'.format(env_var))
                    containers_info[0]['env_vars'].append({
                        "name": env_var,
                        "value": Settings.get(env_var)
                    })
                kuberService.kubernetes_job(containers_info, job_name, False)
            self.response.status = 204
        except Exception, e:
            logging.exception(e)
            self.response.status = 500
            self.response.write("error running jobs, check logs for more details.")
        else:
            self.response.write("jobs published successfully")

Last we want to add a Setting class to load env like variables from the datastore:


import os
from google.appengine.ext import ndb

if os.getenv('SERVER_SOFTWARE', '').startswith('Google App Engine/'):
    PROD = True
else:
    PROD = False
class Settings(ndb.Model):
    name = ndb.StringProperty()
    value = ndb.StringProperty()

    @staticmethod
    def get(name):
        NOT_SET_VALUE = "NOT SET"
        retval = Settings.query(Settings.name == name).get()
        if not retval:
            retval = Settings()
            retval.name = name
            retval.value = NOT_SET_VALUE
            retval.put()
        if retval.value == NOT_SET_VALUE:
            raise Exception(('Setting %s not found in the database. A placeholder ' +
                             'record has been created. Go to the Developers Console for your app ' +
                             'in App Engine, look up the Settings record with name=%s and enter ' +
                             'its value in that record\'s value field.') % (name, name))
        return retval.value

Note that most the app depends on the datastore. Sadly google doesn’t allow you to have env variables easily, but you can setup env variables in the datastore.
For this I added a class called Settings.

Then we just add bind the route handler:


import webapp2


app = webapp2.WSGIApplication([('/events/run-jobs', RunJobsHandler)],
                              debug=True)

This should allow our app, to spin up a cluster, launch containers and then shutdown the cluster. In my code I also added a handler for the shutdown.

Then make sure you have gcloud installed (here is how and just deploy the appengine using the gcloud deploy command and you should be good to go ( here is how

While my example runs the same docker image, and just has different operation with different env variables, you can easily adjust this code to suit whatever need you might have.

Here is the full git repo:

Hope you find it useful!