is now part of CreativePro.com!

Find and Remove Repeated Words Using GREP

11

How often do we come across repeated (duplicate) words in a document? We may inadvertently type these repeated words ourselves in InDesign or bring in a document containing these repeated words. Is there a way to get rid of these repeated words in InDesign correctly? Let us look at an example and a couple of alternatives to find the repeated words.

Here is an example. The valid repeated words have been underlined with Red. The words or parts of words that should not be considered repeated are underlined with Green.

One possible alternative is to use the Dynamic Spelling feature. And Dynamic Spelling does help to a certain extent in highlighting the repeated words. It shows an underline for the repeated words in the chosen color.

InDesign Dynamic Spelling preference for detecting repeated words

But the limitation of Dynamic Spelling is that it highlights only those words that are in the Dictionary of the Language that is chosen. If words are not in the dictionary, those words do not get underlined and Check Spelling does not help us in removing them. Even when the repeated words are found, it is a long and tedious process to find each instance of repeated word in the document and change it.

Let us see if we can do this using GREP Find Change:

We may start writing the GREP code like this:

Find what: \b(\w+)\b \1

Change to: $1

When we run this query, we see that it does help us in finding the valid repeated words. And we may be tempted to go with this query. But on a slightly deeper analysis, we find that it also finds words which, in fact, are not repeated. Also, it fails to find some repeated words. In the first sentence of the example shown above, the query finds valid repeated words ‘the’ (underlined with red) but in the second sentence, it also incorrectly flags the word ‘the’ and ‘the’ in thesis (underlined with green). Also, we see that in the last sentence, the query fails to find repeated words on two separate lines.

So let us refine our GREP Find query to:

Find what: \b(\w+)\s+\1\b

Change to: $1

Here’s what it means:
Look for a word boundary…
followed by any word character one or more times…
followed by any white space one or more times…
followed by the back reference to the previously found result…
followed by a word boundary.

The last word boundary is important to include as it ensures that we find only complete words that are repeated and not just parts of subsequent words.

This Find Change query successfully finds all the correct instances of repeated words only. We can now confidently click the Change All to remove all repeated words in the document.

Hope you find this GREP Find Change query useful in your workflows. Do try it out and suggest any further tweaks to make it even better.

  • Clever, very clever!

    Since you’re a talented GREP/ID search guru, I’ve got another puzzle for you. I have a scientific book that has a characters in the text that are not available in the body text font and are hence replaced with an X inside a colored block. To fix each I can apply a character style that uses the STIX font to supply those characters, but I’ve been unable to find a search option to look for missing glyphs.

    I know that there is a way that doesn’t use S&R.I can use the Preflight panel to find them as missing glyphs and manually apply the STIX character style to each. Since there’s only 24 of them, that’d doable. But it does seem that ID should have a way to search for various formatting errors much like there’s a clever way to look for repeated words.

    Any ideas?

    • Olaf Walther says:

      Hi Michael,

      you can use GREP to filter the unicode blocks:

      First you should search for the right unicode characters you need to replace … you can do this by marking the single character in InDesign and search for the “Unicode: …” line in your “Information” panel.
      With this Article on Wikipedia, you may find whole ranges, you want to replace: https://en.wikipedia.org/wiki/Unicode_block

      If you found your block or just single characters, you can use the following GREP in your Paragraph Style: [\x{2E80}-\x{FE4F}]+
      In this example it searches for CJK Characters, you should replace the “2E80” and “FE4F” with the first character and the last character of your range.
      On this pattern you can now apply a new character style, which uses the right font.

    • Jamie McKee says:

      Michael-

      Would Peter Kahrel’s “Manage Missing Glyphs” script work for your situation?:
      https://www.kahrel.plus.com/indesign/missing_glyphs.html

      • Thanks, Jamie, that script is just what I need. The other suggestion, to search for Unicode blocks, would work but it wouldn’t save that much time since there are a number of them.

        I do wish Adobe would add the STIX math/science font to their collection. It’s open source, so there’d be little cost involved. It’s possible to add it yourself to ID. I have done that. But having it available direct from Adobe would be easier for some and would ensure that it gets updated regularly. Currently, many of Adobe’s best-known fonts don’t even include a full Greek character set much less the quirky math ones, making fonts like STIX even more necessary.

        —-
        The mission of the Scientific and Technical Information Exchange (STIX) font creation project is the preparation of a comprehensive set of fonts that serve the scientific and engineering community in the process from manuscript creation through final publication, both in electronic and print formats. Toward this purpose, the STIX fonts will be made available, under royalty-free license, to anyone, including publishers, software developers, scientists, students, and the general public.

        https://www.stixfonts.org
        —-

  • Jean-Claude Tremblay says:

    I like to modify this query to:

    \b([-\w]+)\s+\1\b

    So it will catch repeated word containing hyphen. Like Jean-Claude Jean-Claude

    When I know I’m not searching for word with number and preventing it to catch return between I use this:

    \b([-\u\l]+)\h+\1\b

    And finally, when I want to catch multiple repeat of a word (2 and more) I use this:

    \b([-\u\l]+)\h+(\1\b\h?)+(?=\h)

  • Chris says:

    On replacing repeated words, beware of ‘Change All’ – I think that that may cause a problem.
    I spoke about it with a colleague, but I wish he had had a better example.
    If you’re trying to make a phone with a piece of string, a can can help.

  • Elizabeth French says:

    I tried the script and found it caught a ‘1’ of a code at the end of a line and a ‘1’ of a point at the beginning of the following line. Both are needed in the text. Are you able to add something to ignore a line return?

  • Derek Pell says:

    Excellent!

  • >
    Notice: We use cookies on our websites to give you a great online experience. If you keep browsing, we'll assume you're ok with this. For more information, see our privacy policy. By closing this banner, you agree to the use of cookies.I AGREENo