[SalesForce] Scheduled batch jobs and durability

I have a process that runs scheduled Apex quite regularly as a batch job manager. This process runs on a one-time schedule, and looks into a batch job table (custom object). If it finds work to do, it kicks off a batch job.

All my batch jobs re-schedule the main batch manager to run a short time in the future after they complete. Similarly, if the main batch manager finds no jobs to do, it schedules itself to run again a short time in the future. Since these jobs are serialized in that way, there's only ever a maximum of one batch job running at a time, and only ever a maximum of one scheduled apex job.

This works great, except that very occasionally (seems to be once every few months), the scheduled job "disappears" and has to be rescheduled manually.

My batch jobs maintain thorough logs of their activity, and have very conservative exception handling – other than LimitExceptions (which I am 100% certain I am not throwing), all Exceptions are caught and logged, and the normal execution flow is then followed.

I'm not sure why the job sometimes dies, but I'm fairly certain it is not code-related. I don't have enough data yet, but it appears that the times this has happened in the past coincide with major upgrade windows from Salesforce. The last time I saw it was on NA14 on June 16 at exactly 04:00 UTC. That wasn't the stated major update window (which was listed as 06:00 UTC on June 15) but is suspiciously similar to the update times cited by Salesforce for that weekend.

  • has anyone seen similar behaviour, where in-process batch jobs and/or scheduled Apex jobs get summarily killed during upgrade windows?
  • has anyone addressed this issue, and if so, what was the solution? My only thought is to try a second regularly scheduled job that runs, say, once per day at a time not associated with maintenance windows, and is purely a sentinel process to make sure the batch job manager is properly scheduled and/or running. However, that will cost a very valuable open scheduled job slot, which I'd prefer to avoid. I suppose I could also add similar sentinel code into the UI of my app, but it's not always guaranteed that users will be hitting any part of the UI every day.

Any help appreciated!

Best Answer

Rather than scheduling a one-time job, schedule a recurring job.

Schedule the job to run on an hourly interval (every hour). As part of the finishing phase of your job, cancel this hourly schedule and replace it with another similar hourly schedule where the first execution is set to be a short period (let's say 5 minutes) from the finish of the job.

This works in a very similar way to using a "one off" schedule (as per your existing implementation) - in both of these implementations the job is rescheduled in the finish phase, but by using a recurring schedule you have the added benefit that if for any reason the job does not execute, the platform will attempt to run it again an hour later, and every hour until it succeeds.

Note that we don't know why the job may fail to execute - but we're assuming that it relates to platform maintenance. Chaining one-off scheduled jobs together relies on the successful start and completion of each job for the integrity of the chain, whereas using a recurring scheduled job provides "auto-resume" behaviour regardless of the successful start / completion of an individual job.

Example process flow:

(1) at 12:00 we schedule a job to run every every hour, at 5 minutes past the hour: 12:05,13:05,14:05...etc...

(2) at 12:05 the batch manager job is started according to the hourly schedule, and this checks your custom batch job object records to see if there is any work currently running or waiting.

It finds that there are no jobs running but there is a job waiting: "Foo". The batch manager therefore starts the batch process for Foo.

(3) at 13:05 the batch manager job is started according to the hourly schedule.

On this occasion it finds that job Foo is in progress and so quits taking no action.

(4) at 13:35 job Foo finishes.

In the finish phase, the existing hourly scheduled job is cancelled, and another new hourly job is scheduled, this time to run at 40 minutes past the hour: 13:40, 14:40, 15:40...etc…

(5) at 13:40 the batch manager job is due to start according to the hourly schedule, but this fails (we assume because of platform maintenance)

(6) at 14:40 the batch manager job is started according to the hourly schedule.

It finds that there are no jobs running but there is a job waiting: "Bar". The batch manager therefore starts the batch process for Bar.

etc.