sed: Text search across multiple lines
7.7 Text search across multiple lines
=====================================
This section uses 'N' and 'D' commands to search for consecutive words
spanning multiple lines. ⇒Multiline techniques.
These examples deal with finding doubled occurrences of words in a
document.
Finding doubled words in a single line is easy using GNU 'grep' and
similarly with GNU 'sed':
$ cat two-cities-dup1.txt
It was the best of times,
it was the worst of times,
it was the the age of wisdom,
it was the age of foolishness,
$ grep -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
it was the the age of wisdom,
$ grep -n -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
3:it was the the age of wisdom,
$ sed -En '/\b(\w+)\s+\1\b/p' two-cities-dup1.txt
it was the the age of wisdom,
$ sed -En '/\b(\w+)\s+\1\b/{=;p}' two-cities-dup1.txt
3
it was the the age of wisdom,
* The regular expression '\b\w+\s+' searches for word-boundary
('\b'), followed by one-or-more word-characters ('\w+'), followed
by whitespace ('\s+'). ⇒regexp extensions.
* Adding parentheses around the '(\w+)' expression creates a
subexpression. The regular expression pattern '(PATTERN)\s+\1'
defines a subexpression (in the parentheses) followed by a
back-reference, separated by whitespace. A successful match means
the PATTERN was repeated twice in succession. ⇒
Back-references and Subexpressions.
* The word-boundery expression ('\b') at both ends ensures partial
words are not matched (e.g. 'the then' is not a desired match).
* The '-E' option enables extended regular expression syntax,
alleviating the need to add backslashes before the parenthesis.
⇒ERE syntax.
When the doubled word span two lines the above regular expression
will not find them as 'grep' and 'sed' operate line-by-line.
By using 'N' and 'D' commands, 'sed' can apply regular expressions on
multiple lines (that is, multiple lines are stored in the pattern space,
and the regular expression works on it):
$ cat two-cities-dup2.txt
It was the best of times, it was the
worst of times, it was the
the age of wisdom,
it was the age of foolishness,
$ sed -En '{N; /\b(\w+)\s+\1\b/{=;p} ; D}' two-cities-dup2.txt
3
worst of times, it was the
the age of wisdom,
* The 'N' command appends the next line to the pattern space (thus
ensuring it contains two consecutive lines in every cycle).
* The regular expression uses '\s+' for word separator which matches
both spaces and newlines.
* The regular expression matches, the entire pattern space is printed
with 'p'. No lines are printed by default due to the '-n' option.
* The 'D' removes the first line from the pattern space (up until the
first newline), readying it for the next cycle.
See the GNU 'coreutils' manual for an alternative solution using 'tr
-s' and 'uniq' at
<https://gnu.org/s/coreutils/manual/html_node/Squeezing-and-deleting.html>.