Analysis and design for batch jobs

Introduction

As described in ACE-374, when scaling up, ACE might end up with lots of "customers" or tenants that, for example, share a single shop. In such cases, when you update the shop, you need to iterate over all customers and update their target and deployment repositories. This is just one example of a batch job that might need to be executed.

What we need is a generic batch job support process that allows us to queue such jobs easily. This task is about writing an analysis on how best to implement this.

Analysis and Design

So we need a mechanism to submit jobs to a queue. In a distributed system, this will be a distributed queue. Workers can then take jobs from this queue and execute them.

So what does a job look like?

We can model jobs as OSGi services with some kind of method to execute them. Another option we have is to look at jobs as GoGo Shell scripts. Services have the advantage that we can pretty much do anything from code. Scripts have the advantage that they are easy to "run anywhere" and that you can create them without having to deploy new code to ACE. Given that we need to take into account that we want to be able to run jobs in a distributed environment as well, it probably makes the most sense to choose scripts here to implement our jobs.

Looking at a typical script to update a script, as described in the introduction, it would look somewhat like this:

w = (cw storeName targetName deploymentName)
$w commit
rw $w

Queue implementation

Given that jobs are just scripts, we can also implement the queue as a shell command. It might make sense to also provide a service in case we want to submit jobs via code.

We need at least a command to push a job onto the queue and one to pull a job from the queue.

If we later want to use a distributed model, we can make the queue a "remote service" so we can access it from anywhere inside our grid.

Another thing to consider is to persist the queue so it survives restarts.

When we've pulled the job from the queue, we need to execute it. That requires a command similar to the existing "sh" or "gosh" command (except that these commands read the script from disk and we have it "in memory"). When executing the script it probably makes sense to detect if the script failed or not. If it did, we could retry it, or put it back in the queue for later execution.

Thoughts

Not necessarily part of this analysis because there are no direct requirements for it, but:

Once we have such a mechanism, we can also add features like support for a "cron" like command, that will execute a specific script at a certain point in time.
Instead of a queue we could also consider using a (tuple/java)space, where workers can freely fetch jobs from the space.
In future distributed scenarios, maybe not all workers/clients are the same (in terms of speed, commands they support, etc.) so we might want to add a simple matching or scheduling algorithm to ensure we execute the jobs in the most efficient way.

Child pages

Analysis and design for batch jobs

Introduction