apex – How to Parse Large CSV and Handle Newline Issues

I have a csv with about 24,000 lines that I need to parse in apex. I tried splitting the string by newline ('\n'), but got a limit error:

System.LimitException: Regex too complicated

I tried looping through and getting the left of the string for each newline:

pos = contents.Indexof('\n');
String this_line = contents.left(pos);

I know this will get me into limits issues, but I want to see if I can even start with the parsing.

For this method, the above snippet only finds the first newline.

In order to make sure I wasn't having dos/unix errors with newlines, I changed the csv to have the following string as line delimiter: ~~~. This did not make a difference.

Sample Content:

id,email_address,optin_status,optin_status_last_updated~~~3,082790@test.com,manually_uploaded,""~~~5,0@test.com,bounced,2011-10-31 18:10:40~~~7,101k@test.com,manually_uploaded,""~~~9,1254@test.com,bounced,2012-05-24 14:05:02~~~11,134@test.com,manually_uploaded,""~~~13,1613@test.com,manually_uploaded,""~~~15,123na@test.com,unsubscribed,2012-12-03 15:02:56~~~17,123tst@test.com,manually_uploaded,""~~~19,123test@test.com,manually_uploaded,""~~~21,12@test.com,unsubscribed,2012-11-28 22:56:58~~~23,12ser@test.com,manually_uploaded,""~~~25,152t@test.com,manually_uploaded,""~~~27,16e@test.com,manually_uploaded,""~~~29,17@test.com,manually_uploaded,""~~~31,180@test.com,manually_uploaded,""~~~33,1vz@test.com,manually_uploaded,""~~~35,1a@test.com,manually_uploaded,""~~~37,1ov@test.com,manually_uploaded,""~~~39,1d@1d.com,bounced,2012-02-10 00:32:17~~~41,1gabrman@test.com,unsubscribed,2012-01-11 20:36:35~~~43,1libbypollak@gmail.com,bounced,2012-01-23 18:02:27~~~45,1on1@aish.com,manually_uploaded,""~~~47,1stmusketeer@gmail.com,manually_uploaded,""~~~49,1tman@test.com,manually_uploaded,""~~~51,21vin@test.com,unsubscribed,2012-02-05 04:11:28~~~53,22im@test.com,manually_uploaded,""~~~

All emails have been changed to ensure privacy.

Code:

Integer count = 0;
    Integer pos = 100;
    while(count < 1000 && pos > 0){
        pos = contents.Indexof('~~~');
        String this_line = contents.left(pos);
        lines.add(this_line);
        contents = contents.remove(this_line);
        System.debug('the current pos: ' + pos + ' and the new position will be: ' 
            + contents.IndexOf('~~~'));
        count = count + 1;
    }

Debug Log:

21:22:25.134 (1134910000)|USER_DEBUG|[57]|DEBUG|the current pos: 55 and the new position will be: 0
21:22:25.135 (1135010000)|USER_DEBUG|[57]|DEBUG|the current pos: 0 and the new position will be: 0
21:22:25.135 (1135081000)|USER_DEBUG|[61]|DEBUG|the number of lines: 2

Does anyone have any ideas here?

Thank you!

Best Answer

Best String Methods

I've been wondering about your use of String.remove in your above code. This method is described as follows. Thus if you have identical lines in your CSV this will for sure not be what your expecting, which i assume was just to remove the last line read.

Removes all occurrences of the specified substring and returns the String result.

You should consider using String.substring to chop your way along the string (though reviewing the ever growing array of String methods, suggests maybe String.substringAfter might also be worth a try).

Example Line Reader

I recently used the following Iterator implementation (useful for Batch Apex integration) to parse some Apex code this should work for your use case. I've modified the LINE_SEPARATOR variable to based on a CSV file I've used in the past (from Excel), though there is no standard specification on this you will have to stipulate and/or make it a configuration option.

LineReader lineReader = new LineReader(csvFile);
List<String> newLines = new List<String>();
while(lineReader.hasNext())
{
    // Read line
    String line = lineReader.next();
}

/**
 * Utility class to iterate over lines
 **/
public class LineReader 
    implements Iterator<string>, Iterable<string>
{       
    private String LINE_SEPARATOR = '\r\n';

    private String textData;

    public LineReader(String textData)
    {       
        this.textData = textData;
    }

    public Boolean hasNext()
    {
        return textData.length() > 0 ? true : false;
    }   

    public String next()
    {
        String row = null;
        Integer endPos = textData.indexOf(LINE_SEPARATOR);
        if(endPos == -1)
        {
            row = textData;
            textData = '';
        }
        else
        {
            row = textData.subString(0, endPos);
            textData = textData.subString(endPos + LINE_SEPARATOR.length(), textData.length());             
        }
        return row;
    }

    public Iterator<String> Iterator()
    {
        return this;   
    }           
} 

Batch Apex Consideration

Quite a large CSV file to parse, in the statement governor free world, this type of processing stands a better chance. Though depending on what your doing with the parsed data you may still run into issues. There is an approach described in more detail here which uses Batch Apex you might want to look at (not the global keyword is not needed these days).

Related Topic