[SalesForce] How to validate UTF-8 in regex

I have basic validation rules setup for name fields:

NOT(REGEX(FirstName, "^[A-Za-z\\. '-]+$"))

The goal is to only allow letters, periods, spaces, hyphens and apostrophes in the name field. The problem with this is that it does not allow accented characters (graphemes). I've tried some simplified ideas based on a regex tutorial and the Java Docs Salesforce links to, but they do not work:

  1. NOT( REGEX( FirstName , "\\P{M}\\p{M}") )
  2. NOT( REGEX( FirstName , "\\p{Alpha}") )
  3. NOT( REGEX( FirstName , "\\X") )

Has anybody else run into this problem? How do you validate names with accent marks?

Update: After further testing I'm making some progress:
The validation rule REGEX(LastName, "(?>\\P{M}\\p{M}*)") successfully flags "é" as a match. Unfortunately that means pretty much any character is a match and I want to exclude numerals and most punctuation.

Best Answer

This might need some refinement, but my understanding is \p{L} will match "a single code point in the category 'letter'".

I tested the following as Anonymous Apex and got the Matches debug message.

String FirstName = 'Fredé';

Pattern regexPattern = Pattern.compile('^[\\p{L}\\. \'-]+$');
Matcher regexMatcher = regexPattern.matcher(FirstName);

if (!regexMatcher.matches()) {
    System.debug(LoggingLevel.Warn, 'No Matches');
} else {
    System.debug(LoggingLevel.Debug, 'Matches');
}

According to the Regex Tutorial: Unicode Character Properties you will probably need to add \p{M}* to optionally match any diacritics:

To match a letter including any diacritics, use \p{L}\p{M}*. This last regex will always match à, regardless of how it is encoded.

Related Topic