Rarely, batch jobs do fail to execute with no error message.
One of the modes that causes this, can be when a SOQL query cannot be executed within the time limits. For example, a particularly non-selective query happening during the batch execute()
method will cause the rest of the job to fail to execute. That's not "abort", but actually fail to execute.
The really kooky details are:
- the job does not appear like
System.abortJob
took place,
- the remaining executes are discarded,
- the
finish()
method does run,
If you are able to raise a case to Salesforce, they may be able to analyze to the extent:
"After processing N batches, one query in Class
at line 46 has timed out as it is running for more than 2 minutes and the job was aborted there."
One time during API 34.0, they were also able to confirm that there is some erroneous behaviour around the UI display of the AsyncApexJob which should show Failed
but doesn't.
Perhaps by exceeding the heap size with the Database.Stateful
it has incurred a similar failure mode, where the whole job has to be trashed rather than continuing because of memory limits?
A Queueable chain is probably going to be the best solution pattern for you here, understanding that (as you say) there is not a particularly fluent Salesforce idiom for this kind of processing.
The nice thing about the Queueable chain is that it can essentially parallelize all of the different invocations of this functionality, even though each one individually runs serially. Hence, you won't run into issues with callout limits due to data volume, since you spin off each sequence of callouts into a separate Queueable. You can chain multiple different Queueable classes to get each callout and final processing completed.
Basically, the way the Queueable chain works is that each "step" is one Queueable class, which completes its work and when finished enqueues the next "step", which could be a different class. When you get to the last callout in the sequence, that Queueable can then enqueue your final, post-callout processing. There's no polling or sleep()
functionality required - each job just kicks off the next as it finishes, and you don't monitor the chain externally.
Yes, it's far from ideal, but it's probably the best and definitely the most stable option relative to trying to simulate a sleep()
call or mucking around with the Tooling API.
If it's most critical that the callouts be run in parallel, you could stick with the Queueable chain, but use a pattern like the one Dan Appleman develops in his Advanced Apex book. Basically, you'd serialize the request to a custom object and kick off N queueable chains, each of which has the job of running one of your N callouts on the data stored in that custom object (however many records there happen to be across your org). Those chains would all run in parallel.
Each Queueable, as it processes each custom object, would write its results back to the object in a different custom field. You could use a workflow rule or Process Builder to kick off the final processing job, contingent on the condition that all N custom fields are populated, meaning that all N jobs have completed.
This is all assuming that you aren't working within a Visualforce page, since you mention doing this on the backend. If you are working in Visualforce, the Continuation object might just suit you.
Best Answer
I'm thinking I can use the Org cache to put how many instances of the job were executed, then in the
finish
method I can create a platform event that invokes a trigger to update the number of jobs executing.When that cache value reaches 0 in the trigger, I can execute the apex
Edit: This still isn't a 100% perfect solution because the trigger subscribed to the platform event could run in parallel and the platform cache does not have mutexes for reading/writing.
Instead of using the platform cache I may need to use a concrete object using
FOR UPDATE
to have a platform level mutex on the record to update the # of parallel jobs in a thread safe manner