[SalesForce] Only query Id field in start() to speed up Apex Batch

I am looking for ways to speed up a complex Batch job which takes days to run in a specific customer org.

This Batch is snapshoting records of arbitrary objects into denormalized (as a single records contains fields of multiple related objects) "backup" object.

public Database.QueryLocator start(Database.BatchableContext context) {
    String complexSoql = 'SELECT a,b,c,..z (SELECT a,b,c,..z FROM Childs__r...), (SELECT...';
    return Database.getQueryLocator(complexSoql);
}

public void execute(Database.BatchableContext context, List<SObject> scope) {
    List<Snapshot__c> snapshots = new List<Snapshot__c>();

    for(SObject sourceRecord : scope ) {
         snapshots.add( createSnapshot(sourceRecord) );
    }

    insert snapshots;
...

Currently in the start() we query all of the source fields including those of related objects. This can sometimes be dozens up to hundreds of fields. Starting the batch currently can take more than 10 minutes and each Batch job (a single call of execute()) yields a big 15 second delay on the Server describes here in more detail.

enter image description here

We ran into many hard to reproduce or fix errors. Some of them documented here:

Now the idea came up if one could speed up things by dramatically simplifying the start() query. Select only the Id field and get rid of all the sub-selects. Instead we query the full objects per execute().

public Database.QueryLocator start(Database.BatchableContext context) {
    String simpleSoql = 'SELECT Id FROM...";
    return Database.getQueryLocator(simpleSoql);
}

public void execute(Database.BatchableContext context, List<SObject> scope) {
    String soql = 'SELECT many fields, (SELECT many fields FROM...)... WHERE Id IN :scope';
    for(SObject sourceRecord : Database.query(soql) ) {
...

Would this speed things up and why? Are there any other best practices to make Batch classes run faster?

Best Answer

In the start method, a database cursor is created. You can think of a database cursor as a temporary table that provides consistent results across the life of the cursor. While it is true that this implies staleness, it also implies a proper snapshot-- all of the data as it was at a specific point in time. Moving that query to execute will cause you to have possibly inconsistent data.

For example, let's say that John Doe worked at Contoso, and now works for Acme. Let's also suppose that the query was alphabetical on account name in ascending order. We'll also say, for sake of argument, this change was made to John Doe during the snapshot process (between the start time and the time it got to his company). With the entire query in the start method, the snapshot would show John Doe only once, working at Contoso, which is out of date, but would be captured on the next snapshot. Conversely, live queries would miss John Doe if Acme was queried before he was updated and Contoso was queried afterwards. Conversely, if he was moving from Acme to Contoso, and was moved after Acme and before Contoso, he would appear in the snapshot twice.

That said, moving the query to execute would speed up the initial query, because it takes less time to execute many small queries than it does to create a larger cursor. Just be aware that it will be more likely to miss data or duplicate data in some scenarios. If records can't move, then this is less of a concern, although data will still be skewed.

Related Topic