coreutils: Special handling of file extensions

 
 30.3.3 Special handling of file extensions
 ------------------------------------------
 
 GNU coreutils’ version sort algorithm implements specialized handling of
 file extensions (or strings that look like file names with extensions).
 
    This nuanced implementation enables slightly more natural ordering of
 files.
 
    The additional rules are:
 
   1. A suffix (i.e., a file extension) is defined as: a dot, followed by
      a letter or tilde, followed by one or more letters, digits, or
      tildes (possibly repeated more than once), until the end of the
      string (technically, matching the regular expression
      ‘(\.[A-Za-z~][A-Za-z0-9~]*)*’).
 
   2. If the strings contains suffixes, the suffixes are temporarily
      removed, and the strings are compared without them (using the ⇒
      algorithm Version-sort ordering rules. above).
 
   3. If the suffix-less strings are identical, the suffix is restored
      and the entire strings are compared.
 
   4. If the non-suffixed strings differ, the result is returned and the
      suffix is effectively ignored.
 
    Examples for rule 1:
 
    • ‘hello-8.txt’: the suffix is ‘.txt’
 
    • ‘hello-8.2.txt’: the suffix is ‘.txt’ (‘‘.2’’ is not included
      because the dot is not followed by a letter)
 
    • ‘hello-8.0.12.tar.gz’: the suffix is ‘.tar.gz’ (‘‘.0.12’’ is not
      included)
 
    • ‘hello-8.2’: no suffix (suffix is an empty string)
 
    • ‘hello.foobar65’: the suffix is ‘.foobar65’
 
    • ‘gcc-c++-10.8.12-0.7rc2.fc9.tar.bz2’: the suffix is ‘.fc9.tar.bz2’
      (‘.7rc2’ is not included as it begins with a digit)
 
    Examples for rule 2:
 
    • Comparing ‘hello-8.txt’ to ‘hello-8.2.12.txt’, the ‘.txt’ suffix is
      temporarily removed from both strings.
 
    • Comparing ‘foo-10.3.tar.gz’ to ‘foo-10.tar.xz’, the suffixes
      ‘.tar.gz’ and ‘.tar.xz’ are temporarily removed from the strings.
 
    Example for rule 3:
 
    • Comparing ‘hello.foobar65’ to ‘hello.foobar4’, the suffixes
      (‘.foobar65’ and ‘.foobar4’) are temporarily removed.  The
      remaining strings are identical (‘hello’).  The suffixes are then
      restored, and the entire strings are compared (‘hello.foobar4’
      comes first).
 
    Examples for rule 4:
 
    • When comparing the strings ‘hello-8.2.txt’ and ‘hello-8.10.txt’,
      the suffixes (‘.txt’) are temporarily removed.  The remaining
      strings (‘hello-8.2’ and ‘hello-8.10’) are compared as previously
      described (‘hello-8.2’ comes first).  (In this case the suffix
      removal algorithm does not have a noticeable effect on the
      resulting order.)
 
    How does the suffix-removal algorithm effect ordering results?
 
    Consider the comparison of hello-8.txt and hello-8.2.txt.
 
    Without the suffix-removal algorithm, the strings will be broken down
 to the following parts:
 
      hello-  vs  hello-  (rule 2, all non-digit characters)
      8       vs  8       (rule 3, all digit characters)
      .txt    vs  .       (rule 2)
      empty   vs  2
      empty   vs  .txt
 
    The comparison of the third parts (‘‘.’’ vs ‘‘.txt’’) will determine
 that the shorter string comes first - resulting in ‘hello-8.2.txt’
 appearing first.
 
    Indeed this is the order in which Debian’s ‘dpkg’ compares the
 strings.
 
    A more natural result is that ‘hello-8.txt’ should come before
 ‘hello-8.2.txt’, and this is where the suffix-removal comes into play:
 
    The suffixes (‘.txt’) are removed, and the remaining strings are
 broken down into the following parts:
 
      hello-  vs  hello-  (rule 2, all non-digit characters)
      8       vs  8       (rule 3, all digit characters)
      empty   vs  .       (rule 2)
      empty   vs  2
 
    As empty strings sort before non-empty strings, the result is
 ‘hello-8’ being first.
 
    A real-world example would be listing files such as:
 ‘gcc_10.fc9.tar.gz’ and ‘gcc_10.8.12.7rc2.fc9.tar.bz2’: Debian’s
 algorithm would list ‘gcc_10.8.12.7rc2.fc9.tar.bz2’ first, while ‘ls -v’
 will list ‘gcc_10.fc9.tar.gz’ first.
 
    These priorities make sense for ‘ls -v’: Versioned files will be
 listed in a more natural order.
 
    For ‘sort -V’ these priorities might seem arbitrary.  However,
 because the sorting code is shared between the ‘ls’ and ‘sort’ program,
 the ordering rules are the same.