coreutils: Character sets

 
 9.1.1 Specifying sets of characters
 -----------------------------------
 
 The format of the SET1 and SET2 arguments resembles the format of
 regular expressions; however, they are not regular expressions, only
 lists of characters.  Most characters simply represent themselves in
 these strings, but the strings can contain the shorthands listed below,
 for convenience.  Some of them can be used only in SET1 or SET2, as
 noted below.
 
 Backslash escapes
 
      The following backslash escape sequences are recognized:
 
      ‘\a’
           Control-G.
      ‘\b’
           Control-H.
      ‘\f’
           Control-L.
      ‘\n’
           Control-J.
      ‘\r’
           Control-M.
      ‘\t’
           Control-I.
      ‘\v’
           Control-K.
      ‘\OOO’
           The 8-bit character with the value given by OOO, which is 1 to
           3 octal digits.  Note that ‘\400’ is interpreted as the
           two-byte sequence, ‘\040’ ‘0’.
      ‘\\’
           A backslash.
 
      While a backslash followed by a character not listed above is
      interpreted as that character, the backslash also effectively
      removes any special significance, so it is useful to escape ‘[’,
      ‘]’, ‘*’, and ‘-’.
 
 Ranges
 
      The notation ‘M-N’ expands to all of the characters from M through
      N, in ascending order.  M should collate before N; if it doesn’t,
      an error results.  As an example, ‘0-9’ is the same as
      ‘0123456789’.
 
      GNU ‘tr’ does not support the System V syntax that uses square
      brackets to enclose ranges.  Translations specified in that format
      sometimes work as expected, since the brackets are often
      transliterated to themselves.  However, they should be avoided
      because they sometimes behave unexpectedly.  For example, ‘tr -d
      '[0-9]'’ deletes brackets as well as digits.
 
      Many historically common and even accepted uses of ranges are not
      portable.  For example, on EBCDIC hosts using the ‘A-Z’ range will
      not do what most would expect because ‘A’ through ‘Z’ are not
      contiguous as they are in ASCII.  If you can rely on a POSIX
      compliant version of ‘tr’, then the best way to work around this is
      to use character classes (see below).  Otherwise, it is most
      portable (and most ugly) to enumerate the members of the ranges.
 
 Repeated characters
 
      The notation ‘[C*N]’ in SET2 expands to N copies of character C.
      Thus, ‘[y*6]’ is the same as ‘yyyyyy’.  The notation ‘[C*]’ in
      STRING2 expands to as many copies of C as are needed to make SET2
      as long as SET1.  If N begins with ‘0’, it is interpreted in octal,
      otherwise in decimal.
 
 Character classes
 
      The notation ‘[:CLASS:]’ expands to all of the characters in the
      (predefined) class CLASS.  The characters expand in no particular
      order, except for the ‘upper’ and ‘lower’ classes, which expand in
      ascending order.  When the ‘--delete’ (‘-d’) and
      ‘--squeeze-repeats’ (‘-s’) options are both given, any character
      class can be used in SET2.  Otherwise, only the character classes
      ‘lower’ and ‘upper’ are accepted in SET2, and then only if the
      corresponding character class (‘upper’ and ‘lower’, respectively)
      is specified in the same relative position in SET1.  Doing this
      specifies case conversion.  The class names are given below; an
      error results when an invalid class name is given.
 
      ‘alnum’
           Letters and digits.
      ‘alpha’
           Letters.
      ‘blank’
           Horizontal whitespace.
      ‘cntrl’
           Control characters.
      ‘digit’
           Digits.
      ‘graph’
           Printable characters, not including space.
      ‘lower’
           Lowercase letters.
      ‘print’
           Printable characters, including space.
      ‘punct’
           Punctuation characters.
      ‘space’
           Horizontal or vertical whitespace.
      ‘upper’
           Uppercase letters.
      ‘xdigit’
           Hexadecimal digits.
 
 Equivalence classes
 
      The syntax ‘[=C=]’ expands to all of the characters that are
      equivalent to C, in no particular order.  Equivalence classes are a
      relatively recent invention intended to support non-English
      alphabets.  But there seems to be no standard way to define them or
      determine their contents.  Therefore, they are not fully
      implemented in GNU ‘tr’; each character’s equivalence class
      consists only of that character, which is of no particular use.