Regular Expressions

Regular expressions are an integral part of Perl. They are so succesful that other languages have often "Perl regular expressions" libraries that can be used (e.g. PCRE in C/C++, preg_...() in PHP).

Operators

o

Matching: if ($name =~ /regex/) is 'true' if variable $name matches.

o

Not matching: if ($name !~ /regex/) is 'true' when there is no match.

o

Substitution: $name =~ s/regex/replacement/ replaces 'regex' with 'replacement' in variable $name. Use the modifiers 'g' for global replacement, 'i' for case-insensitive. E.g.: $name =~ s/regex/replacement/gi

What's in a regex

See 'perldoc perlre' (or 'man perlre') for the manpage. In any case the following is important:

 .      matches any character
 \d     matches a digit (0-9), same as [0..9], \D matches a non-digit
 \s     matches a space, \S matches a non-space
 \w     matches a word, \W matches a non-word
 \n     matches a newline, \t matches a tab
 \      matches the next character literally (so \. is a literal dot)
 ^      beginning of string
 $      end of string
 |      or-match
 [..]   character class, e.g. [a-z] are lowercase, [AZ] is either A or Z
 [^..]  not-character class, e.g. [^a-z] matches everything but lowercase a-z
 (..)   grouping

Following the matched part, you can place the number of times to match:

 +      matches 1 or more times
 *      matches 0 or more times
 ?      matches 0 or 1 times
 {a}    matches a times
 {a,b}  matches a to b times

For example:

 if ($mystring =~ /\d{10}/) {
    print("Got a match of 10 digits\n");
 }

Results: $1, $2 etc.

The magic variables $1, $2 etc. are used as storage of matched groups. For example:

 my $str = q/The quick brown fox/;
 if ($str =~ /The q(.)i(.)/) {
     print("Matched $1 and $2\n");  # u and c
 }

Exercise

Given the following text in a code snippet:

 my $lorem = <<"END";
 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
 eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
 minim veniam, quis nostrud exercitation ullamco laboris nisi ut
 aliquip ex ea commodo consequat. Duis aute irure dolor in
 reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
 pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
 culpa qui officia deserunt mollit anim id est laborum.
 END

Write a program that counts the words in the text. In every word any non-letter characters are disregarded (so that "amet," is analyzed as "amet"). The word is treated as lower-case, so "Duis" and "duis" are considered the same.

After analysis the words that appear more than once are shown, and the number of times that they appear.