URLs: Breaking (Badly)
This article appears in Issue 50 of InDesign Magazine.
With the help of GREP styles, you can solve the problem of bad breaks in long URLs.
Dividing URLs over a line is seemingly simple—Never hyphenate a URL—but the next question is, where do we break long URLs? The Chicago Manual of Style, 16th Edition now recommends breaking before rather than after a slash. But not everyone agrees with or follows this style. Luckily, we can use GREP styles to find URLs and override InDesign’s paragraph composer.
The first task is to prevent hyphenation. John Gruber’s liberal, accurate regex pattern for matching URLs is a great GREP pattern for finding nearly any URL, and then it’s simply a matter of assigning No Language (found in Advanced Character Formats) to that (or any) URL.
But this GREP won’t work for the second task of allowing long URLs to break since it selects the entire thing (applying No Break here could cause instant overset text). Instead, we need use a second GREP expression to selectively apply No Break within a URL:
([w-]+:/{1,3})|(<[^s$&+/:;=?@#]+.[^s$&+/:;=?@#]+?>)
Now in English: Group 1 finds protocols like https:// or telnet:// and even x-yojimboitem:// by matching a word that may contain hyphens followed by a colon and one to three forward slashes. And Group 2 finds hostnames like www.adobe.com or indesignmag.com by matching two strings of non-space and non-reserved characters that are separated by a period and omitting any trailing punctuation.
CMS-style breakup is more complex, since InDesign breaks after punctuation, not before. While this can’t be fully automated, it can be accomplished much faster using GREP Find/Change. First, Find:
([$&+/:;=?@#](<[^s$&+/:;=?@#]+))(?=[$&+/:;=?@#.])
Group 1 matches any reserved URL character followed by a string of non-space and nonreserved URL characters, and Group 2 is a positive lookahead for any reserved URL character or a period.
After determining that you’re within a URL (don’t use Change All… bad things could happen), use Change to insert a leading Discretionary Line Break before the found text: ~k$0.
This article was last modified on February 16, 2022
This article was first published on April 24, 2017
