Complexity vs. Usefulness—Regular Expressions

Home » Thoughts and Musings » Thomas M. Tuerke on Design »

design: /di·'zin/

n a deliberate plan for the creation or development of an object. vt: to create something according to plan.
good design: /'gud —/ the product of deliberate forethought and careful understanding of the purpose of a subject, resulting in a subject which significantly improves its utility, allowing it to integrate seamlessly and naturally into the role for which it is intended.
false synonyms: fashion, decor.

Table of Contents [show/hide]

Tue May 16, 2006 Link to this entry

I love these things called regular expressions. I think they're the coolest. The kinds of things you can do with regular expressions is incredible. I use them everywhere I can—Perl, JavaScript, .NET—to get the job done, and I lament their absence in the places where they just aren't there (like XSLT and XPath.)

In short, regular expressions are a means of pattern matching. They're like string comparison on steroids. In most computing languages you can say something like

 if(SkyColor == "blue")

or some variation. You can even tweak it so case didn't matter: "Blue" would also compare true. But most languages give you only "exact matches" to the string: it's either equal to the whole thing character-for-character, or it's not. It gets more complex if you want to compare against a bunch of strings, as in this pseudo-code example:

 if(SkyColor == "blue" or
    SkyColor == "aqua" or 
    SkyColor == "teal" or 
    SkyColor == "cyan")

On the other hand, regular expressions make quick work of this (I'll use the Perl =~ operator in my pseudo-code examples to mean "compare against the regular expression"):

 if(SkyColor =~ /blue|aqua|teal|cyan/i)

For even more flexibility, try this:

 if(SkyColor =~ /((deep|dark|pale|light) )?(blue|aqua|teal|cyan)/i)

Here, we optionally allow an adjective like deep, dark, pale, or light to precede one of the colors, allowing strings like "deep cyan" or "pale teal" to match, as well as unadorned "aqua" or "blue". All in one line. Lots of power, very compact notation. Imagine having to do that without regular expressions, comparing SkyColor to each possible string value.

Huh?

Lots of power in that one (not so simple) line. But unless you know what you're looking at, it's just a bunch strange characters. So, now that I've illustrated an example, let's consider its complexity...

As you can see, despite their incredible capabilities, regular expressions aren't exactly the friendliest kids on the block. I'm just speaking for myself when I say I love Regular Expressions. Other developers either love them (for their power) or hate them (for their arcane syntax.) Here's what I mean; some of the rules for regular expressions are as follows:

Just like strings are enclosed in quotation marks, regular expressions are usually enclosed in pairs of slashes, as in /hi mom/.
Most characters—letters, numbers, punctuation—stand for themselves. If you say x =~ /a/, that's essentially the same as saying x == "a". ¹
Strings of these normal characters must be in the order given, just like ordinary string comparisons. If you say x =~ /ab/, that's just like saying x == "ab".

So far, these regular expression critters don't seem so odd. But wait, there's more. Lots more (as in lots more rules to know.) In particular, several characters have special meaning:

The vertical bar means "either what's to the left of me, or to the right, but not both," so x =~ /a|b/ is the same as saying (x == "a" or x == "b"), but it's not the same as saying x == "ab" nor is it the same as x == "a|b".
Parentheses enclose things to group them together, just like in mathematics. This is useful because...
The question mark means that what is in front of it is optional. The thing that precedes it can appear once, or not at all in order for the expression to match. x =~ /be?ar/ is the same as (x == "bear" or x == "bar") in that the "e" is optional. A question mark after a parenthesized expression means the whole thing is optional.
If you want to include one of these special characters (vertical bar, parentheses, or even a slash) in your regular expression, you need to put a back-slash in front of it, so x =~ /what?/ is the same as (x == "what" or x == "wha"), but not x == "what?"; for that you would need x =~ /what\?/.
To make the entire expression case-insensitive, just follow that trailing slash with an the letter i: x =~ /ab/i is the same as (x == "ab" or x == "Ab" or x == "aB" or x == "AB").

See? All that, just to understand the one-line expression above: eight rules to know instead of just three for simple strings. Complexity.

So the question becomes: is that added complexity necessary? The answer isn't that straightforward. If you're in an environment where knowing regular expressions is a given—like in a Perl application, nearly anything *nix, and so on—you might be Okay; those are just the rules of the road. But you probably don't want to surface that to the end user, unless you're targeting an audience for whom regular expressions are also a well-known thing (and that's not always synonymous with "all developers"!)

Summing up, a technology as powerful and wide-spread as regular expressions still has barriers to adoption, even in the technical community, so imagine what some "new" technology—some parochial construct you're considering for your application—will face. Is it worth it? You have some questions to answer, and you need to be objective in answering them:

Does the complexity of learning many rules, terms, or metaphors outweigh the benefits achieved? (This one is particularly difficult to keep in perspective, as many designers tend to over-estimate the importance of the thing they're designing.)
Are the rules ones that the user will bring with them from previous experience, and use in other parts of their lives, or are they just acquired for this one little part of their lives?
Is there another way—albeit with perhaps more effort on your part—to more simply solve their problem and allow them to accomplish their tasks?

So consider how important that new invention is before subjecting your users to it.

¹ Almost the same. I should point out that, for simplicity, I've omitted the ^ and $ notations. The description above is neither rigorous nor complete. Really, in order for the examples of regular expressions above to be truly the same as the simple string comparison counterparts, they should start with ^ and end with $; these force the pattern to match the entire string, and not just any fragment within it. Without them, the semantics are "is the expression contained within the string?" ... but let's not further complicate an already complex discussion. To the purists I say with a wink and a nod: just assume that they are present for the purpose of this discussion, and let's be done with that.

Sections: 2

The Complexity List

Fri May 19, 2006 Link to this entry

— Thomas M. Tuerke
One thing every designer should do when designing something is make an exhaustive list of every bit of minutia a user is expected to know in order to use that thing.
Each item in that list should also indicate where that knowledge comes from.
Is it universally common knowledge, learned in every-day life, like a red octagon typically means "stop"? (Though this also points out an issue of universality—stop signs aren't used everywhere in the world—I'm surprised that they're remarkably wide-spread, a few examples being seen in the UK, and in Germany, and as such the red octagon isn't necessarily limited to North America.)
Is it knowledge well-understood in the environment, like the way to transfer information from one application to another in Windows or the Macintosh is by the clipboard (copy and paste) or that on Linux, file names are case sensitive?
Is it knowledge common in a specific domain, trade, or industry?
Is it completely parochial, and can only be learned from the manual or other specific means of training? (Furthermore: how discoverable is that knowledge?)
These categories aren't meant to be rigidly defined buckets, but rather points along a spectrum: how well-understood is a bit of knowledge.
The point of this effort, then, is to measure the complexity of the thing being designed, and this measure is itself a thing of complexity. Naturally, the longer the list, the more complex the thing is overall. But by the same token, the more that the list consists of things that can only be learned by reading the manual, the more opaque it is to the unlearned user, so the more complicated it appears, the smaller the audience of ready users, (and the more you pay for printing a thicker manual!)
So the ideal is to have as short a list as possible—and containing as few things as possible that have to be learned anew. The operative phrase here is "as possible" because there's a constraint, a balance that needs to be struck. Simple and useless is no better than complex and useful. Ideally, we're striving for simple and useful....

Sections: 1

RE: Weighting the Complexity

Fri May 19, 2006 Link to this entry

— Thomas M. Tuerke
Okay, so you've created that list... but as you might guess, not all these items are "created equal". Here are some recommended "complexity" weights to assign to each item in your complexity list.
Universal: 1
Regional/Environmental: 2
Domain-specific:4
From the manual: 10
Notice the "disproportionate" weight assigned to something stemming from the manual... it's not by accident.
What you should do is assign each item in the complexity list a weight, then add up the weights to determine the complexity of your project. This should give you a sense of where additional design is in order to reduce complexity. Replacing a newly-coined term with something understood in the domain (assuming they mean the same thing) is a good step.
Replacing an operation that is specific to your application with a similar one that's common to the environment may reduce the complexity...

Complexity + Mental Speedbump = Cognitive Overload

Fri Apr 26, 2013 Link to this entry

— Thomas M. Tuerke
The spectrum of Complexity has a name: Cognitive Overhead (or, more generally, Cognitive Load) and obviously the goal is to get toward the low end of the range.
This article has some interesting contemporary products, with the Good, Bad, and Indifferent. Worth a read.