I have a large amount of duplicate records (25.000) in an even larger table (250.000 records). I've done the detection using the reporting function. The criteria I used to detect the duplicates were 3 fields (unique account identifier, unique user identifier and date). Based on a list of criteria, I would like to either merge or delete these records. I've got a handle on the logic associated with the merging and the deleting but what I'm having trouble with is defining the duplicate detection process.
I'm looking at either creating a custom object to take care of the detection or dumping the records into an SQL database and taking care of it offline. Any thoughts on which one of the two is the best approach?
Edit:
After reading the first comment, I think the more appropriate question would be "How do I approach the problem in SOQL".
The object I'm working with is the Event object and I'm really only looking at the standard fields, AccountId, OwnerId and ActivityDate. From what I've found online, there's no easy way to find records that have the same value for these three fields. I think it could be accomplished by nesting SOQL queries but I'd like to use a different approach if at all possible.
Best Answer
This could be base of your query for duplicates:
Don't run it yet! It's very likely to timeout and I can imagine it having serious performance issues.
This version is bit better (and it returns the earliest Event that matches the conditions):
Keep tweaking it till you're happy it can run in reasonable time. Make the WHERE clause selective, maybe increase the HAVING... The
MIN(Id)
is an idea for the "winner" record. Maybe you'll decide that newest one should win. Maybe you'll build something that asks the user to decide...Anyway. Let's say that "earliest" Id will be a winner(come to thing about it, we probably shouldn't count in Id's being sequential...).So you have
List<AggregateResult>
which you can use to make the detailed query for all Events:Use these to build
SELECT ... FROM Event WHERE AccountId = :accId AND ...
(optionallyAND Id != oldestEvent
if you want to skip my "winner". Or sort them byCreatedDate / ActivityDate DESC
if you want to be sure the record you want is first in the result set.The standard merge statement works only on Accounts, Contacts and Leads. I think you'll have to delete them with your own code, deciding which values to keep in the winner, which should just go... Luckily we can't make lookups to Events so at least you don't have to worry about relationships...