[SalesForce] find duplicate records based on multiple fields

I have a large amount of duplicate records (25.000) in an even larger table (250.000 records). I've done the detection using the reporting function. The criteria I used to detect the duplicates were 3 fields (unique account identifier, unique user identifier and date). Based on a list of criteria, I would like to either merge or delete these records. I've got a handle on the logic associated with the merging and the deleting but what I'm having trouble with is defining the duplicate detection process.

I'm looking at either creating a custom object to take care of the detection or dumping the records into an SQL database and taking care of it offline. Any thoughts on which one of the two is the best approach?

Edit:

After reading the first comment, I think the more appropriate question would be "How do I approach the problem in SOQL".

The object I'm working with is the Event object and I'm really only looking at the standard fields, AccountId, OwnerId and ActivityDate. From what I've found online, there's no easy way to find records that have the same value for these three fields. I think it could be accomplished by nesting SOQL queries but I'd like to use a different approach if at all possible.

Best Answer

This could be base of your query for duplicates:

SELECT COUNT(Id), AccountId, OwnerId, ActivityDate
FROM Event
GROUP BY AccountId, OwnerId, ActivityDate
ORDER BY COUNT(Id) DESC
LIMIT 10

Don't run it yet! It's very likely to timeout and I can imagine it having serious performance issues.

This version is bit better (and it returns the earliest Event that matches the conditions):

SELECT MIN(Id) oldest, COUNT(Id), AccountId accId, OwnerId userId, ActivityDate d
FROM Event
WHERE AccountId != null AND ActivityDate = THIS_MONTH
GROUP BY AccountId, OwnerId, ActivityDate
HAVING COUNT(Id) > 1
ORDER BY COUNT(Id) DESC
LIMIT 10

Keep tweaking it till you're happy it can run in reasonable time. Make the WHERE clause selective, maybe increase the HAVING... The MIN(Id) is an idea for the "winner" record. Maybe you'll decide that newest one should win. Maybe you'll build something that asks the user to decide...

Anyway. Let's say that "earliest" Id will be a winner (come to thing about it, we probably shouldn't count in Id's being sequential...).


So you have List<AggregateResult> which you can use to make the detailed query for all Events:

// Id oldestEvent = (Id) results[0].get('oldest');
Id accId = (Id) results[0].get('accId'), userId = (Id) results[0].get('userId');
DateTime actDate = (DateTime) results[0].get('d');

Use these to build SELECT ... FROM Event WHERE AccountId = :accId AND ... (optionally AND Id != oldestEvent if you want to skip my "winner". Or sort them by CreatedDate / ActivityDate DESC if you want to be sure the record you want is first in the result set.

The standard merge statement works only on Accounts, Contacts and Leads. I think you'll have to delete them with your own code, deciding which values to keep in the winner, which should just go... Luckily we can't make lookups to Events so at least you don't have to worry about relationships...