InDesign's GREP supports three classes of wildcards:
With the addition of this third class of wildcards it is now possible to do more refined searches more easily. InDesign implements the Unicode properties as defined by Boost and documented in J. Friedl, Mastering Regular Expressions, O'Reilly, 2006, pp. 122 and 123.
The addition of the new class of wildcards means that InDesign's GREP now supports 112 wildcards: 16 standard, 12 POSIX, and 74 Unicode (counting negated classes separately for dramatic effect; if you don't like drama, there are still 60 classes). They are distinguished by their form: \w, [[:alpha:]], and \p{L*}, respectively. With such a plethora of symbols, the question arises of which character class matches which characters.
To check how InDesign's GREP classes and properies capture characters from selected Unicode ranges, I use an InDesign document as a template and a script that prints selected Unicode ranges in that document. The script then lets you select a GREP class and highlight its targets.
For more information and a tutorial, see GREP in InDesign
The script and the document work in CS4 and later. Download the ZIP file (download links at the bottom of this page), and extract these files into your script folder:
Don't open the IDML file. Run the grep_mapper.jsx script. It shows this dialog:
The dialog shows the Unicode ranges listed in grep_mapper.txt – later on I'll show how you can change this list to add or remove Unicode ranges. Select any ranges in the usual way using the mouse to select single lines and Ctrl/Cmd+click and Shift+click to add to an existing selection. Finally, press OK or Enter/Return to create the selected unicode ranges in the template.
Suppose you selected all Latin unicode ranges, the first eight in the list. Press OK and the script opens the IDML file and prints the selected ranges in columns: a four- or five-digit unicode number followed by the corresponding character. Any squares indicate that the character is not available in the selected font. (The template uses Cambria; see Some notes on the template, below, for details and changing the font.)
The script then displays a panel with the three GREP classes. To highlight the scope of a certain class or wildcard, select it in the tree view. For example, to dispay in the InDesign document all characters matched by, say, \p{Lowercase_letter}, expand the Unicode properties group, then expand the Letter, then click Lowercase_letter:
When you click a node, the GREP expression is printed in the panel's frame. To see what is matched by the selected GREP expression, double-click the node:
To display the scope of another class, just double-click the node in the panel.
The configuration file can be changed to add or remove unicode ranges. The first few lines of the configuration file look like this:
C0 controls and basic Latin (0x0000-0x007F) /Latin C1 controls and Latin-1 supplement (0x0080-0x00FF) /Latin-1 Latin extended A (0x0100-0x017F) /Latin-A Latin extended B (0x0180-0x024F) /Latin-B Latin extended C (0x2C60-0x2C7F) /Latin-C Latin extended D (0xA720-0xA78C) /Latin-D Latin extended D (0xA7FB-0xA7FF) /Latin-D Latin extended additional (0x1E00-0x1EFF) /Latin add. Combining diacritical marks (0x0300-0x036F) /Comb dia Combining diacritical marks supplement (0x1DC0-0x1DFF) /Comb d/supp Combining diacritical marks for symbols (0x20D0-0x20FF) /Comb d/sym Combining half marks (0xFE20-0xFE2F) /Comb half IPA extensions (0x0250-0x02AF) /IPA Phonetic extensions + supplement (0x1D00-0x1DBF) /IPA ext
Each entry in the configuration file consists of three parts:
1. The name of the range, e.g. "C1 controls and Latin-1 supplement". This is displayed in the script's dialog. You can use any wording you like.
2. The unicode range, e.g. (0x0080-0x00FF). This too is shown in the dialog. You must use the notation 0x0000 for the unicode values (or 0x00000 for plane-1 and higher planes), and the ranges must be wrapped in parentheses. The script uses these unicode numbers to print the range.
3. A label that is printed as a column header in the template. The script expects this at the end of each line, after a forward slash. You can use any text you like, but the shorter the better.
For example, you could combine the second and third, and the fourth and fifth lines as follows:
C0/1 controls, basic Latin and Latin-1 (0x0000-0x00FF) /Latin 0-1 Latin extended A/B (0x0100-0x024F) /Latin-A/B
The template is an IDML file that can be used in CS4 and later. It uses the font Cambria, which is supplied with most modern OSs, and which has a very large character set. There is a companion font Cambria Math. Alternatives are Everson Mono, a very large Unicode font (25 Euro shareware), which is especially good for languages, Junicode (a free font) and Code2000 (US$ 5 shareware).
Consider making a donation. To make a donation, please press the button below. This is Paypal's payment system; you don't need a Paypal account to use it: you can use several types/brands of credit and debit card.
All required files are put together in this zip file: grep_mapper.zip
28 Nov. 2019: Fixed a bug in the display of some plane-1 characters.
30 June 2019: Updated the text and the screenshots; script and data file haven't changed.
28 February 2015: Updated the text with a note on two more GREP classes (\K and \R). This has no influence on the scripts, which are therefore unchanged. Added a screenshot of the script's panel in InDesign CC because it looks different from CS6 and earlier. The panel's functionality hasn't changed.
10 July 2014: Added a note about two wildcards introduced in CS6: \h (all horizontal space) and \v (all vertical space and break characters). The scripts haven't changed.
29 June 2013: Rewrote the script and the above text; added support for plane 1 and higher; dropped CS3 support.
Sept. 2009: first posted.
Installing and running scripts
Questions, comments? Get in touch