1. J.D. Ray

    J.D. Ray Member Supporter Contributor

    Joined:
    Oct 15, 2018
    Messages:
    657
    Likes Received:
    668
    Location:
    Oak Harbor, Washington, USA

    Scrivener Searching with Regular Expressions (RegEx)

    Discussion in 'Writing Software and Hardware' started by J.D. Ray, Mar 24, 2020.

    Recently I was excited to discover that Scrivener supports searching with what are called "regular expressions", or regex by technically-minded folks. If you're a Unix user, you may be familiar with the tools sed and awk, which use regex to manipulate text. I wanted to pass along a bit of introduction to regex for those who might find this feature useful, though I'm certainly no expert.

    To find this capability, pull down the option menu on the Search function and look for the RegEx option:
    Screen Shot 2020-03-24 at 9.32.56 AM.png

    If you Google for "regex" you'll find a ton of resources describing exactly what it is, but suffice to say that it is a way to programmatically describe very detailed search terms. There's a pretty detailed tutorial here, though it's somewhat aimed at programmers. Regular expressions use special characters such as slashes, backslashes, common symbols, and letters to create micro-programs that can be used to do all manner of text manipulation, in this case searching. For example, if I know that I alternate between grey and gray in my document, and want to find all the instances of either, my search term would look like this:

    gr[ea]y

    The brackets create a set of characters, so regex substitutes each character in the set in sequence. Sure, you could use regular search and enter a list of words, but when your substitution sets get large, regex quickly becomes useful, if not exactly friendly.

    About the most powerful character is a regex expression is the dot or period. It represents "any single character". In the search term above, I could have said:

    gr.y

    That would have returned grey, gray, grgy, gr4y, gr%y, and more if they existed. In this simple search, that might have been the best way to go. Regular expressions can be used in some environments to both search and replace, so you might imagine how the dot might get someone into a lot of trouble.

    "Classes" of characters can be found by using "escape sequences" like \d for "any digit" and \s for "whitespace character", which includes spaces, tabs, and line enders. The class identified by \w is "any word character", which includes numbers, so the character sets [0-9], [a-z] and the hyphen. Regex considers 1999 to be a word. Note my use of ranges there; the sets [0-9] and [0123456789] are functionally the same.

    I mentioned before that the period character is used to represent any single character. If you want to search for periods, you have to "escape" the period by putting a backslash in front of it like this: \. Using the character class indicator for "any word character" along with the escaped period, you can search for \w\. and find "any word character that is followed by a period", which will primarily get you ends of sentences, including sentences that end with quote marks (e.g. Marko said, "This doesn't look right."); the search will find the t character followed by the period. The same search will find abbreviations such as Mr. and Mrs.

    If you commonly use numbers where you should be spelling out words (in prosaic writing, most one- and two-digit numbers should be spelled out), search using \d to highlight all instances of numbers, then fix the ones you should. Or you can say \b\d{2}\b to find all instances of two digit numbers (see below for what \b means). Note the curly braces that are used to limit the number of instances of the thing right before them, in this case the digit indicator.

    The ^ symbol marks "beginning of line" and $ "end of line". So if I want to find anywhere I start a paragraph with "because" I would enter ^because (Scrivener doesn't seem to care about case with its regex). If I wanted to find everywhere I ended a paragraph with that same word, I would enter because\.$ as the search term, which will find the word "because" followed by a period and an end-of-line character, but not paragraphs that end in "because:", since that might be a valid lead-in to a list of bullets.

    Using the word boundary character \b, as I did above with the number example, combined with a pattern described by thin[kg], we enter \bthin[kg] and find all instances of the words "think" or "thing" but not "something" (because there's no word boundary ahead of the "-thing" in "something").

    The | symbol creates an "or" operator, so you can search for "this" or "that" with this|that. It's most useful for using in groups, such that (cat|dog) food would find all instances of "cat food" and "dog food", whereas cat|dog food would find instances of "cat" and of "dog food" independently. This is a bit confusing, because one might expect the search to find all instances of the letters in cat or dog. Remember, though, that [] surround single character classes. Here we've used () to surround groups. So if you want to search for colors, enter (red|orange|yellow|green|blue|indigo|violet) or whatever other color words you might have used in your documents. If you want to search for instances of letters, search for [xyz] to find all instances of those three letters, wherever they appear in words. Escape sequences count as single characters, so [\txz] would search for any instance of either the tab character, the letter x, or the letter z.

    I've been using the site linked above as a reference for writing this post, learning things along the way. As I said, I'm no regex expert, so can't fully explain how this works (I'm exhausted at this point reading the various details of backreferences, grouping, capturing, and the rest), but I ran across this tidbit that seems useful for writers. According to the regular-expressions.info site, the following search string will find any repeated words:

    \b(\w+)\s+\1\b

    Sure enough, it works. My WIP included a reference to Bora Bora, and that search term found it. If you want to make sure you don't have any instances of "the the" or "is is" in your work, use that to find them, but don't ask me how it does it.

    My final item is that the question mark character, which identifies a "look ahead", says that you're asking what a particular character is, so q(?=u) matches the q in question but not in Iraq because the q in Iraq isn't followed by a u.

    That's all I've got. Regex is a powerful tool that is used broadly by programmers to do massively complex searching through large blocks of text (millions of words at a time). Having it in your toolkit gives you one more way to find things quickly and accurately. Scrivener's support of it is a boon.

    Cheers.

    J.D.
     
    Lifeline and Wreybies like this.
  2. Wreybies

    Wreybies Thrice Retired Supporter Contributor

    Joined:
    May 1, 2008
    Messages:
    23,826
    Likes Received:
    20,818
    Location:
    El Tembloroso Caribe
    Wow. I've used RegEx in the past in Calibre but I also did not know that Scrivener supports it. Excellent.
     
    J.D. Ray likes this.

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice