Oakland.pm

Reviews

Review of the article "Five Habits for Successful Regular Expressions"

author of article: Tony Stubblebine

reviewed by: George Woolley


Short Review and Recommendation

:) :) :) :) :) of 5

If you use regular expressions much, likely reading "Five Habits for Successful Regular Expressions" will enable you to use them more effectively.

Look at the list of five habits in the section below this one. Unless you are sure you understand the implications of these brief suggestions and already have the habit of following them, I recommend reading this article.

It's a short article and Tony Stubblebine makes his points clearly and concisely.

George Woolley of Oakland.pm

Miscellaneous

The Five Habits

  • 1. Use Whitespace and Comments
  • 2. Write Tests
  • 3. Group the Alternation Operator
  • 4. Use Lazy Quantifiers
  • 5. Use Available Delimiters

Some References

  • "Effective Perl Programming" by Joseph N. Hall with Randall Schwartz (1998). published by Addison-Wesley.
  • "Mastering Regular Expressions", 2nd Edition by Jeffrey E. F. Friedl (2002). published by O'Reilly
  • "The Perl Cookbook" by Tom Christiansen and Nathan Torkington (2003). published by O'Reilly.
  • The Perldoc Regular Expression FAQ. You can see this by typing perldoc perlfaq6 at a Unix/Linux command line.

Notes:

  • "Effective Perl Programming" is a book which contains a chapter on "Regular Expressions". That chapter contains some material specifically relevant to this article.
  • "Mastering Regular Expression" is a book devoted to regexes. IMO The title is right on. If you use regexes a lot, get this book -- if you don't have it already.
  • "The Perl Cookbook" contains a chapter on "Pattern Matching". Within that chapter are sections directly relating to two of the habits.
  • The Perldoc Regular Expression FAQ contains a Q/A "How can I hope to use regular expressions without creating illegible and unmaintainable code?" which is directly relevant to this article.
  • You'll find a wealth of other info about regexes in these references. I've learned things from all 4 references.

Notes:

  • Above is an image of the cover of "Regular Expression Pocket Reference" by Tony Stubblebine.
  • Click on the image to see the catalog entry.

Some Abbreviations

The following are abbreviations for "regular expression"

  • regex
  • regexp

There are also variants of these involving caps, e.g. RegExp.

IMHO is an abbreviation of "in my humble opinion". IMO is a variant on this that is less frequently used.

This review talks a lot about regexes. The whole review is IMO.

Note:

  • OK, likely you knew that.

In Case You Get Bored

Or even:


March Meeting Announcement

March Oakland.pm Meeting when: Tue. Mar. 9 at 7:30-9:30pm. (We meet 2nd Tuesdays.) where: Joshua Wait's place 1903 Virginia Street Apt. 3 Berkeley, CA 94709 directions: see links on home page what: * introductions * giveaways * talk by Tony Stubblebine "Regular Expression Best Practices" who: open to anyone interested. how much: no fee for our meetings.

Blurb for Tony's Talk

Regular expressions are broken and obtuse. They're hard to write. They are even harder to read, especially if you are not the original programmer. Tony Stubblebine, author of "Regular Expression Pocket Reference," will explain how to decrease development time while increasing reliability and readability.

Some Questions To Ponder

This section contains questions to encourage discussion among Oakland.pm members during the time before Tony Stubblebine's March presentation. The questions are:

  • Which of the 5 habits Tony describes do you have?
  • Are there any other regex habits that have served you well or which you wish you had?
  • What's an annoying problem you had with a regex?
  • What question would you like to hear Tony's answer to?
  • What documents have been most useful to you in understanding regular expressions?

My Answers

  • I have habits 2, 3 and 5. Re habit 2, I'm fairly good about testing regex. Re habit 3, I always group alternatives. Re habit 5, I often use alternate delimiters.
  • I've found being clear on the intent of the regex to be very useful.
  • More than once, I've been very confused by the error messages resulting from leaving out the  m  before the first delimiter (when the  m  was needed).
  • I'd like to ask Tony how he would characterize rules in Perl 6.
  • Initially, I found "Learning Perl" quite helpful. Later, I found "Mastering Regular Expressions" to be exceptional. "Effective Perl Programming" was also useful.



















Longer Review

Contents

Note:

  • So, as you can see, I'll be commenting on all 5 of the author's suggestions.

About This Review

Warning: The review below in this column is quite long, especially considering it's a review of a short article. For a shorter review, see my review at the the top of the left column.

Or you could just go read the Tony Stubblebine article.

Intent: My intent in writing this review is

  • to bring attention to a valuable article on regexes.
  • to encourage discussion of the article and the issues it brings up
  • to aid in preparing myself for the March Oakland.pm at which the author will speak

Contents: I take a look at all five habits the author recommends based on my own experience with regexes. For each habit, I briefly

  • describe the habit
  • say what I typically do
  • make a New Year's resolution regarding the habit
  • include a bad example or examples illustrating what not to do

You may be wondering why I have bad rather than good examples? I'm aware of several reasons:

  • It's easier.
  • It's more fun.
  • Bad examples are generally more engaging and more empowering.

Hey, there are good examples in the the article.

The Promise of the Article

The author says that regular expressions are

  • hard to write
  • hard to read
  • hard to maintain
  • often wrong

They are IMO also the best notation that's widely available for manipulating strings. Oh, well!

The author says: if you adopt the five habits he describes, you'll eliminate most of the trial and error involved in regexes. Kool!

What The Stubblebine Article Is Not

The Stubblebine article is short and focused.

The article is not

  • a tutorial on regexes
  • a cookbook of regexes
  • a how to guide for implementing the five habits (though there are some suggestions along the way)

Some people believe that regexes are often used where there is a preferable alternative. Could be. In any case, this article does not address the issue of when to use a regex and when not to.

About the Reviewer

I come to this review with a number of biases. Below I touch on some of them.

String Manipulation: When I started writing programs way back in the early 60s, one of the first things I noticed was that in the more widely used languages the notation for handling mathematical expressions was reasonably well-developed but the facilities for manipulating strings sucked. I had a strong background in formal logic and in my view

  • mathematics was dependent on logic
  • string manipulation and logic were co-dependent on each other

so it seemed odd that the widely used languages were brain dead regarding strings.

Perl: Up until 1994, I didn't have a favorite language. In 1994, I discovered Perl, and it quickly became my favorite language. Perl is not brain dead regarding strings; rather it has excellent regular expressions beautifully integrated into the language. And it's widely used. I'm a big advocate of Perl.

The author of the article being reviewed gives examples from PHP, Python and Perl. As I said, I've been using Perl since 1994, and with a fair amount of emphasis on regexes. I've played with Python, but haven't used PHP at all.

There was one year (just one, alas) when I had a dream job. I wrote code in nothing but Perl. I even got to teach a class in Perl. I was mostly writing filters to process website log files. I wrote an incredible number of regexes.

Most of my use of regexes has been in the context of some shell or Perl running on some flavor of Linux/Unix. I've never used Perl on any flavor of Windows.

I'm not an expert on Perl or regular expressions.

O'Reilly and Tony Stubblebine: I own lots of O'Reilly books, and I think both O'Reilly and their books rock. And Tony works for O'Reilly. (Lucky him!)

Also, I'm very active in Oakland.pm, and Tony is scheduled to speak to us at our March meeting.

Consequences: Mostly, it's up to you to adjust for my biases. But please note that all the examples are Perl.

1. Use Whitespace and Comments

The Habit: This habit is to use white space and indentation when writing regexes and to comment them too, as you do when writing the rest of your code. You do indent and comment your code. Right?

In Perl 5, implementing this habit requires use of the  x  modifier. Of course, there's more to it than that.

What I've Been Doing: I do, indeed,

  • indent my (non-regex) code religiously
  • comment my (non-regex) code
  • and am generally careful about the layout

However, when it comes to regexes I rarely

  • use whitespace to make them more readable
  • use comments to clarify them

The sad truth is that while I do sometimes use the  x  modifier, it's not my habit to do so. :(

Perhaps partly to compensate for my bad habit I do

  • generally keep my regexes short and simple
  • often find ways to break complex regexes up
  • often test my regexes extensively (See 2.)
  • sometimes make comments on my regexes external to them.

New Year's Resolution: Next year, I'll use white space and comments in my regexes to make them clearer for even moderately complex regexes.

Bad Example: I once was asked to make modifications to a Perl script. The script contained a humongous regex all on one line that

  • was so long the editor I was using choked on it
  • was really really hard to read

Notes:

  • This really did happen.
  • Now, it seems amusing to me. At the time it was annoying rather than amusing.

Silly Bad Example:

m/ t h e \s r e d ( c a t | d o g ) /

Notes on Silly Example

  • The silly example above illustrates that one can have too much white space.
  • If that's not too much whitespace for you, imagine greater & greater indentation and more & more empty lines until it does.

Notes on Compactness

  • It's possible too have too much whitespace and/or too much commentary.
  • \s is not exactly the same as a space.
  • In general, it's more important to be readable and understandable than it is to be compact.
  • Judgment is involved in determining how much whitespace and commentary.

2. Write Tests

The Habit: This habit involves

  • writing both made up tests and using real data for tests
  • including tests of both things that should and should not match
  • incorporating these into a test suite
  • running the test suite after every change

Hm, what's "test suite" mean here? I think it just means a well-conceived collection of test cases that has been captured for later reuse. (This understanding has been confirmed with the author.)

What I've Been Doing: I do different things depending especially on

  • the complexity of the regex
  • the nature of the data it will be applied to

If there is a large amount of data, I may go through a separate phase where I research the data. "Know your data" is a dictum that I take seriously. This phase overlaps with, but isn't the same as collecting sample test data.

When I was writing log filters, I used Unix shell commands extensively in this research, especially:  grep ,  cut ,  sort  and  uniq 

Typically, my test cases would be in two files:

  • one file containing made up examples that tested how I thought the filter was supposed to work
  • another file that contained actual selected real data

I also made sure that my filters (and changes to them) were tested by someone else after I finished my testing. Filters were not released until both I and the external tester were satisfied.

New Year's Resolution: I resolve to keep this habit.

Aside: I've found the module  Test::More  easy to understand and quite useful. But you may prefer  Test::Simple  which I gather is simpler. You can find out more about these modules in the Perldocs. For example, you could type the following at a Unix/Linux command line   perldoc Test::Simple

Bad Example: Someone modified one of my filters

  • without letting me know
  • without commenting the change and indicating who authored it
  • likely without systematically testing it
  • certainly without going through the usual external testing

The filter malfunctioned because of the changes and considerable effort was wasted in the resulting confusion.

3. Group the Alternation Operator

The Habit: This habit consists of grouping alternatives using parentheses.

What I've Been Doing: I have this habit. I generally do this even if the expression consists of nothing but the alternatives. For example, I likely write something like

m/(gif|jpg|jpeg|png)/

even though the parentheses are unnecessary.

New Year's Resolution: I resolve to keep this habit and combine it with the other habits resulting in regexes that look more like this

m/ ( gif | jpeg | jpg | png ) /x

Notes:

  • The examples in this review (and in this section, in particular) are somewhat simplistic.
  • Hopefully, they are, nevertheless, helpful.
or perhaps even like this

m/ ( gif # image file extensions | jpeg | jpg | png ) /x

Aside: Beware of accidentally creating a null alternative. Debugging this error could be quite annoying. The problem is that the null alternative will always match.

Bad Example:

m/\.gif|png|jpeg|jpg|/

Notes:

  • The regex above would benefit from parentheses.
  • The regex would also benefit from dropping the final  |

4. Use Lazy Quantifiers

The Habit: The habit is to use lazy quantifiers when this will make a regex easier to read.

What I've Been Doing: Bad me. I almost never use lazy quantifiers. I don't have this habit. :(

Perhaps as compensation, I sometimes use the anchors ^ and $ and describe the whole line in cases where it's really not necessary to do that.

New Year's Resolution: I resolve to develop the habit of asking for each regex I write: would it be more readable and/or more accurate using lazy quantifiers.

Bad Example:

s/<.*>//g; # zap html tags

Note:

  • This sometimes zaps more than was presumably intended.

Silly Bad Example:

s/x{2}?//;

Notes:

  • This matches (and removes) the minimum of exactly two x's.
  • But the minimum of exactly two is (duh) exactly two.
  • So the question mark has no useful effect. It could, however, cause confusion.

5. Use Available Delimiters

The Habit: This habit is to use alternate delimiters when that will lead to greater readability by reducing the number of escape characters.

What I've Been Doing: I have this habit. Boy does it make my life easier. I don't just count the number of backslashes saved. E.g. I generally don't use the delimiter / with multiple backslashes.

New Year's Resolution: I resolve to keep this habit.

Aside:

When you write a match, I suggest including the  m  even if it's not required. In the past I've had a bad habit of not doing that. It can lead to puzzling compile error messages if you change delimiters. Try the second bad example.

Bad Examples:

if ( $line =~ /\"(http:\/\/.*\/.*\.html)\"/ ) { $url = $1; }

Notes:

  • The above example would be more readable with an alternative delimiter.
  • OK, that's not the only problem with it.

Or how about the following attempted improvement.

if ( $line =~ #\"(http://.*/.*\.html)\"# ) { $url = $1; }

Note:

  • The above example won't even compile.

Silly Bad Example:

s#^\##\#\##;

Notes:

  • Hopefully, no one would do that.
  • But it works fine.

So What?

Part of what I'm trying to say is:

  • It's not just the author who thinks these habits are a good idea.
  • If you don't have these habits already, you're not alone.
  • If I can engage with the ideas in the article and resolve to improve my habits, so can you. Hey, you've already started.

Final Thoughts

As is likely obvious by now, IMO Tony's recommendations are sound.

Practice: It's all very well to resolve to change. But intellectual understanding that something is a good idea is insufficient to bring about change. And we all know that New Year's resolutions are usually for naught. So I am making a point to get some practice in following the five habits. And you?

Intent: Unless you know the intent of a regex, you generally can't say how good it is. For example, the regexes for the following would likely be different even though both would be dealing with phone numbers:

  • determine if a phone number being entered into an HTML form is well-formed
  • identify possible phone numbers in your email messages

Test: Also IMO the most important habit is testing. If you do that well, you'll likely end up adopting other good habits too in self-defense.

Last Updated: 2004-01-09