[SalesForce] Compare 2 Lists and Remove Duplicates

I have an application which pulls through a set list of Placement__c records. The user is able to:

Use select boxes to select certain records.
Specify a start date
Specify an end date

Those that are selected should then create a Timecard__c record.

What I want to do is ensure that no duplicates are created based on the start date and end date and the Contact lookup in question!

At the moment I have the following:

List<ts2__Placement__c> selectedPlacements = new List<ts2__Placement__c>();

for (pPlacement pPla : getPlacements()) {
  if (pPla.selected == true) {
    selectedPlacements.add(pPla.pla);
  }
}

if (!selectedPlacements.isEmpty()) {
  for (ts2__Placement__c pla : selectedPlacements) {
    System.debug(pla);
  }
}

Now I know I could run a query, something like this:

if (!selectedPlacements.isEmpty()) {
  for (ts2__Placement__c pla : selectedPlacements) {
    Integer checkExists = [SELECT Count() FROM Timecards__c 
                            WHERE Contractor__c = :pla.ts2__Employee__c AND
                                  Week_start_date__c = :startDate AND
                                  Week_end_date__c = :endDate];

      if (checkExists == 0) {
        // Do stuff!
      }
  }
}

But I know I'll quickly hit the SOQL governor limit if I do this.

What would be the most efficient way of checking if a record already exists and removing that record from the list so no duplicates are created?

Best Answer

My first thought is to turn to a pair of Set<Timecard__c>.

Start by gathering the set of Ids representing the contractors from the selected Placement__c records
create Timecard__c records (only in memory, no DML insert at this stage), setting Contractor__c, Week_Start_Date__c, and Week_End_Date__c. Put these records into a Set<Timecard__c>
Run a query for the Contact records matching those contractor Ids, and include a parent-child subquery for all related Timecard__c records
- Make sure the subquery only pulls in the same fields as you set when creating the initial set of Timecard__c records
Iterate over the Contacts, and put all of the related Timecard__c records into a different set
Finally, set1.removeAll(set2)

The idea is that the hash function underlying the Set type should produce identical results if you pass in different object instances with exactly the same fields populated, and exactly the same values for those fields.

That allows you to use remove() or removeAll() to do the heavy lifting.

In testing this myself, the general idea works but a query will also pull the Id and RecordTypeId, which means you'd need to iterate over the related Timecard__c records and explicitly create new Timecards that omit the Id and RecordTypeId rather than just being able to directly addAll() on the second set from the list of related Timecard__c records on the Contact.

code would look like this:

// startDate and endDate defined elsewhere
List<ts2__Placement__c> selectedPlacements = new List<ts2__Placement__c>();
Set<Id> contactIds = new Set<Id>();
Set<Timecard__c> timecards = new Set<Timecard__c>();
Set<Timecard__c> existingTCs = new Set<Timecard__c>();

for (pPlacement pPla : getPlacements()) {
  if (pPla.selected == true) {
    selectedPlacements.add(pPla.pla);
    contactIds.add(pPla.ts2__Employee__c);
    timecards.add(
      new Timecard__c(
        Contractor__c = pPla.ts2__Employee__c,
        Week_Start_Date__c = startDate,
        Week_End_Date__c = endDate
      )
    );
  }
}

if (!selectedPlacements.isEmpty()) {
  for (Contact con :[SELECT Id, (SELECT Contractor__c, Week_Start_Date__c, Week_End_Date__c FROM Timecards__r) FROM Contact WHERE Id IN :contactIds]) {
    for(Timecard__c tc :con.Timecards__r){
      existingTCs.add(new Timecard__c(
      Contractor__c = tc.ts2__Employee__c,
      Week_Start_Date__c = tc.Week_Start_Date__c,
      Week_End_Date__c = tc.Week_End_Date__c);
    }
  }
}

timecards.removeAll(existingTCs);

insert timecards;

B-tree (binary tree)

A b-tree reduces the time complexity by allowing you to perform a binary search on the source data, reducing the initial complexity from 10,000 rows to search to 13 rows. It works because the tree structure stores data in a balanced way. Example:

                      12pm
                    /      \
                 6am        6pm
                /   \      /   \
             3am     9am  3pm   9pm

The problem, of course, is building a b-tree. I recommend a good article on Wikipedia or some programming website. In essence, though, you have nodes that look like this:

public class BNode {
    public DateTime nodeValue;
    public BNode lessThan, greaterThan;
}

To find a specific date/time coordinate, you examine the root node; if the value is an equal match, you have a result, otherwise if the search value is less than the node value, follow lessThan, otherwise follow greaterThan. In this way, you're performing a "binary search" which can run through 10,000 records in 14 iterations (a b-tree holds a power of two at each level, and 2^14 is 16384 values, more than enough to hold 10,000 records).

Benefits: Can store as granular time as possible. Detractors: B-trees are "advanced" and require planning to get efficient algorithms. Requires significant recursion to build initial list and final list, so initialization is slowest of the three solutions, and list building is equally slow.

Optimized Lists

Assuming you have a list of records sorted by time intervals, either ascending or descending, you can also perform a binary search without the B-tree, which is probably easier to conceptualize. Again, you start at record 5,000, compare the current value to that value. If it is less than, move to 2,500, if it is greater than, move to 7,500. You've probably done this without knowing it if you played "guess that number", where the "host" picks a number between 1 to 10. Most people would guess like this: 5 (more), 7 (less), 6 (correct), or 5 (less), 3 (less), 2 (less), 1 (correct). In other words, closing the gap by half each time, you can arrive at the correct answer in no more than 4 moves.

Benefits: Lists have a native sort function, so you spend less time writing code figuring that part out. Granular time frames are not problematic. No duplicates, final list loads fast. Detractors: Because you're trying to binary search on non-binary-sized data, you have to be careful not to round incorrectly, or you can miss spots in the time line. Also, the size of the array may change with each booking, since a single booking may consume multiple time slots.

Maps

This concept can work as well. First, you start with a map:

// TimeEntry is assumed to be a class
Map<DateTime, TimeEntry> entries = new Map<DateTime, TimeEntry>();

From here, you can first load all of the time entries into the map by time. This is a linear process, and the map can easily hold 10,000 values. This would have a relatively fast load time, and each result can be found in exactly one iteration; searches are in linear time.

Benefits: Simple code, linear searching. No sorting is required at all, and each element can be found instantly. Detractors: You would need to either store the time entries for each minute or second if you wanted that level of granularity, or remember to round to the correct granularity, because you can't depend on the order of the list, and you can't find approximate keys, only exact keys. Also, if you store the same entry multiple times (to represent spans of time), you may need a second map to correctly remove all the entries. This can make deduplicating the final list a nightmare as well.

Next Steps

Once you've picked a method to use, you can use the following pseudo-code:

Load the available time slots
Load the current bookings
For each booking
    Replace the time slot(s) covered with the appropriate booking.
End for

Alternatives?

You could avoid this entire mess by adding a single lookup field to the slot object; when a booking is created or modified, it would assign itself to that lookup field by trigger for each matching slot. This means you only need two queries, interleave the results, and output them. This assumes that slots will cover unique time periods and that no more than one booking can consume a time slot. More complex scenarios are also possible with some ingenuity (e.g. many-to-many connections to detect conflicting bookings).

Lock Step (Alternate algorithm)

Query both lists
Sort by ascending time
While both lists are not empty
    if the first booking in the list starts after the end of the slot
        add the slot to the output list
    end if
    remove the slot from the source list
    if the first slot in the list occurs after the end of the booking
        add the booking to the output list, remove from source list
    end if
End While

Hints

Don't use a loop-loop construct. Use maps, learn b-trees, or learn binary searching through arrays. There are probably millions of articles on how to do each on the Internet, but your approach should meet your skill level. Don't take on b-trees straight off unless you feel comfortable with them.

Edit

I just realized this entire answer kind of fails to emphasis the main point. Don't look for which slots can fit between which bookings. Instead, create a combined list of slots and bookings, and each time a booking covers a slot, remove the slot from the list, and put the booking in its place. This way, you're actually just building the results directly.