RFC-0105: Regular expression libraries

RFC-0105: Regular expression libraries
StatusAccepted
Areas
  • General
Description

Restrict the set of regular expression libraries that may be used in Fuchsia.

Issues
Gerrit change
Authors
Reviewers
Date submitted (year-month-day)2021-04-26
Date reviewed (year-month-day)2021-06-16

Summary

This RFC proposes to ban all regular expression libraries that do not explicitly guarantee linear time complexity relative to the size of the string to be matched. The officially-sanctioned libraries are RE2 (C++), regex (Rust), and regexp (Go).

Motivation

It's well-known that pure regular expressions can be matched in O(mn) time where n is the length of the input string and m is the size of the regular expression's NFA after constructs like a{10}b{10} have been expanded out. However, many libraries include extensions such as backreferences which make this problem worst-case exponential. Other libraries use a backtracking implementation that is worst-case exponential even when the pattern does not use backreferences. This turns regular expressions into a vector for denial of service (DoS) attacks: if a malicious user controls the pattern or the string to match, then the user may be able to trigger an exponential search that exhausts CPU and causes a DoS. See Davis et al, Russ Cox, V8, and Wikipedia for further discussion.

As of April 2021, Fuchsia uses regular expression libraries that may be vulnerable to denial of service attacks. Below is a discussion for each of Fuchsia's supported languages:

C/C++

As of April 2021, there are 38 uses of <regex>, a library which does not make any guarantees about time or space complexity. By default, <regex> uses the same regular expression language as ECMA-262, which is based on PCRE and supports backreferences (see Hron Martin, Section 3.5.5).

As of April 2021, there are 6 uses of <regex.h>, a library which supports backreferences and uses a backtracking implementation (see Hron Martin, Section 3.5.7).

Google's RE2 does not support backreferences and guarantees linear time matching relative to the size of the input string. Further, RE2 has a configurable bound on memory usage which prevents the m term in O(mn) from causing DoS. As of April 2021, this library is imported into Fuchsia's //third_party but there are no users. Hence, it appears that Fuchsia's C and C++ code may be vulnerable to regular expression DoS.

Rust

Rust's regex crate does not support backreferences, it guarantees O(mn) time matching, it bounds the NFA as does RE2, and therefore it is not vulnerable to DoS. As of April 2021, this is the only regular expression library used directly by Rust programs within Fuchsia.

Rust's regex_automata create allows creating arbitrary DFAs and has worst-case exponential time and space usage. As of April 2021, this crate is not used directly by Fuchsia programs, however it is used indirectly via other crates, such as bstr, which uses regex_automata in a restricted way such that it can guarantee worst-case linear time and constant space.

Go

Go's standard regexp package is based on RE2, and therefore is not vulnerable to DoS. As of April 2021, this is the only regular expression library used by Go programs within Fuchsia.

Dart

Dart's standard RegExp library is built on JavaScript's RegExp library, therefore it may be vulnerable to DoS.

Design

To avoid DoS attacks, all Fuchsia code MUST use one of the following libraries for regular expression parsing and matching:

  • C++ code MUST use RE2
  • Rust code MUST use the regex crate
  • Go code MUST use the standard regexp package

These sanctioned libraries are chosen because they all guarantee linear time matching. It's not sufficient to simply ban features such as backtracking, because even pure regular expressions can have pathological behavior in libraries that do not guarantee worst-case linear time matching.

We place no restrictions on Dart code. Dart's built-in RegExp class uses a backtracking implementation which supports backreferences and does not guarantee linear-time matching. We considered banning backreferences in Dart programs, but decided this is infeasible for Dart programs that receive regular expressions as user input: such programs would be required to parse and validate regular expressions received from the user, and currently, no widely-used libraries support this. Further, Dart's RegExp implementation is based on V8's Irregexp, which can suffer exponential blowup even on patterns that do not contain backreferences. Since Dart cannot be used for core system services, DoS is less of a concern.

Exceptions to the above rules MAY be granted. The process for acquiring exceptions is described below.

Implementation

We will create a roller for the RE2 library, following the pattern established for //third_party/googletest. A //third_party/re2/OWNERS file will be created to specify who is responsible for maintaining this roller. The initial OWNER will be tombergan@google.com. After this roller is deployed and //third_party/re2 has been updated to the latest version, existing C++ code will be ported from <regex> and <regex.h> to RE2. No changes are needed to Go, Rust, and Dart programs, as those are already compliant with this RFC.

After that migration is complete, two kinds of rules will be added:

  1. A presubmit rule based on clang-tidy will be added to ban <regex> and <regex.h> in C++ code.

  2. A visibility rule will be added to prevent Fuchsia programs from importing Rust's regex-automata crate. The existing usage of this crate in bstr is granted an exception because bstr guarantees worst-case linear time and constant space.

Exceptions

Exceptions MAY be granted if the following conditions are met:

  1. There is a compelling need, such as a need for compatibility with a third-party library or tool.

  2. The newly imported library guarantees linear time behavior, OR, DoS is provably not an issue for the code in question. For example, if the regular expressions are used in a host tool or only as part of a developer workflow that is not available in production builds, then DoS is at worst a developer annoyance, not a security concern, and an exception MAY be granted.

Exceptions are documented in allowlists. Each allowlist MUST include a comment linking to this RFC so that readers have appropriate context. To apply for an exception, you MUST upload a CL to modify the appropriate allowlist. The CL MUST give a detailed rationale for the exception and the CL MUST be approved by a member of FEC. There are two kinds of allowlists:

  1. Our clang-tidy-based presubmit rule will have an allowlist in a TBD location in the Fuchsia source tree. The allowlist's location will be documented in the presubmit rule's error message so it can be discovered by running the normal presubmits.

  2. Third-party library exceptions are controlled by an ordinary visibility allowlist in the BUILD.gn file that defines the library.

New regular expression libraries cannot be imported to //third_party without approval from a member of FEC. Once imported, the library MUST contain a visibility allowlist along with a comment linking to this RFC.

Performance

Since this proposal bans libraries with worst-case exponential time, this will have a positive impact on performance.

Code size

fxrev.dev/532721 analyzes expected changes in code size for ARM64 builds. Since there are no changes to Go and Rust programs, there will be no changes in code size for those programs. For C++ programs, the code size increase depends on whether RE2 is linked statically or dynamically. Currently, Fuchsia host tools are linked statically, while target programs are linked either statically or dynamically depending on code size impacts. With static linking, C++ programs will increase in size by at least 180KB. With dynamic linking, the RE2 shared library costs about 350 KB. This cost is paid one time per system image: C++ programs that used <regex.h> will see approximately zero change in code size after switching to RE2, while C++ programs that used <regex> should get smaller after switched to RE2 because std::regex makes heavy use of templates, while RE2 does not.

Ergonomics

This proposal will lead to a more unified regular expression language across the Fuchsia platform. For example, current C++ tools and services may support backreferences via <regex>, while Rust and Go tools and do not. After this proposal is implemented, no system tools or services will support backreferences (except for tooling written in Dart).

Further, RE2 and Go's regexp support exactly the same regular expression syntax, while Rust's regex is based on RE2 and supports a superset of RE2's syntax. This standardizes Fuchsia regular expressions on RE2's syntax. Rust's regex adds a few minor features, such as character class interection and subtraction, and the x flag, which are not available in RE2.

Backwards compatibility

I have manually inspected all existing uses of <regex> and <regex.h>. In the majority of those cases, the regular expression pattern is a compile-time constant that will work as-is (or with a slight syntax change) in RE2. There are no backwards compatibility concerns in those cases.

I found just three programs which allow the user to specify a regular expression: the debugger, fidlcat, and perftest. To ensure those tools will not break after this change, I will add the OWNERs of those tools as reviewers of this RFC.

Security considerations

This RFC will reduce the likelihood that regular expressions can be used as a vector for DoS attacks.

Privacy considerations

None.

Testing

Existing tests will be run. There are no new features to test.

Documentation

As explained above, the presubmit and visibility rules will link to this RFC so that developers know why their imports were rejected. Reviewers of new third-party code will need to be made aware of this RFC so they know to reject imports of unapproved regular expression libraries. This includes C++ and C libraries that include <regex> or <regex.h> directly.

Drawbacks, alternatives, and unknowns

See "Backwards Compatibility". Other C++, Rust, or Go libraries may provide equivalent guarantees. We chose RE2, regex, and regexp (respectively) because they are widely used and already imported into Fuchsia's //third_party.

There is a slight risk that future versions of these libraries might introduce non-linear-time features. Since these libraries advertise linear-time matching as a core feature, we think this is unlikely, but if it happens, we will need to select new libraries.

Prior art and references

See references linked from the motivation section, particularly the articles by Russ Cox, which give a history and motivation for Google's development of RE2.