Filename Regular Expression

Checks that a string is valid on Windows (NTFS), Mac (HFS+) and most Linux distros as a file/folder name as well as part of a URI without encoding.

/^(?!.{256,})(?!(aux|clock\$|con|nul|prn|com[1-9]|lpt[1-9])(?:$|\.))[^ ][ \.\w-$()+=[\];#@~,&']+[^\. ]$/i

Rationale

Questions such as “which characters are valid in filenames” and “what’s valid in a URL” often yield unsatisfying answers because they fail to include one or more implied requirements. Often the implied requirement is that characters are not encoded, for example many unicode characters are valid in URIs, such as the English Pound £ symbol provided it is encoded as %C2%A3 but it’s probably intended that these be excluded.

Sometimes the answer simply fails to accommodate for the quirks of Windows, including the fact that reserved names like con are still invalid if they have an extension such as con.txt or even con.foo.bar. Furthermore, the rules that apply exclusively to Windows can be confusing because NTFS allows what a Linux user would call a “dotfile” (e.g. .gitignore) yet Windows Explorer will not let you name a file as such unless it has an extension (e.g. .gitignore.txt). Interestingly, cmd has no such qualms.

However the question is phrased, what’s really important is whether you can use a given string as a file/folder name with confidence that it will not cause problems on common systems and web servers, reducing the file’s portability. This pattern attempts to provide said confidence.

Pattern properties

  • Allows basic Latin alphanumeric characters of mixed case
  • Denies non-latin characters (such as the French é or the Polish ę)
  • Allows special characters commonly found on QWERTY keyboards, however:
    • Denies special characters disallowed in filenames on Windows, Mac and common Linux distros
    • Denies special characters disallowed in URIs
  • Limits length to 255 characters
  • Denies leading or trailing spaces
  • Allows files starting with a dot
  • Denies trailing dots, therefore also denies . and ..
  • Denies strings reserved by Windows, such as con, with any (or no) extension
  • Does not consider whether the string requires quoting in a terminal

Explanation of the pattern

  • / is the opening delimiter repeated at the end before the flags (explained later)
  • ^ means the start of the string which, coupled with $ at the end means that the match will fail even if part of the string is valid — i.e. the entire string being tested must pass
  • (?!.{256,}) is a negative lookahead meaning that the whole string must be fewer than 256 characters (not 256 chars or more)
  • (?!(aux|clock\$|con|nul|prn|com[1-9]|lpt[1-9])(?:\.|$)) is another negative lookahead for Windows’ reserved filenames and prevents them being standalone names or being used with any extension, e.g. con, con.txt and con.foo.bar are disallowed, but cons are fine — in other words, cannot be a reserved string followed by the end nor a reserved string followed by a dot
  • [^ ] is the first part of the actual (non-lookahead) pattern and means that the first character cannot be a space
  • [ \.\w-$()+=[\];#@~,&']+ means one or more of any of the enclosed characters (\w meaning alphanumerics and underscores)
  • [^\. ]$ means that the final character cannot be a dot or a space
  • /i is the closing delimiter and a flag making the pattern case-insensitive (so con etc. don’t have to be duplicated for uppercase)

Leave a Reply

Your email address will not be published.