Moto - The Native Web Programming Language

4 Regular Expressions In Moto

Regular expression support is built into the moto language. Regular expressions in moto are specified by a forward slash '/', followed by the regular expression definition, followed by a forward slash. An example regular expression would be /([a-zA-Z]+)@(([a-zA-Z]+[.])*[a-zA-Z]+)/ which would match for simple email addresses. Regular expressions in moto have the type Regex. New regular expressions may be constructed dynamically from Strings by calling a constructor.

$declare(Regex emailrx = new Regex("([a-zA-Z]+)@(([a-zA-Z]+[.])*[a-zA-Z]+)"))

The String used for creating a regular expression must not contain opening and closing slashes.

As of moto version 0.12.0 regular expression functionality is now implemented by a variant of Ville Laurikari's excellent TNFA matching algorithms.

4.1 Specifying Regular Expressions

Regular expressions are specified with their own min-language within moto. Standard moto operators do not apply or work differently when constructing regular epxressions. The operators currently available for use in specifying a regular expressions are :

+ one or more of the immediately preceding character or subexpression
* zero or more of the immediately preceding character or subexpression
? zero or one of the immediately preceding character or subexpression
. matches any single character
| matches the expression that comes before the | or the expression that comes after it
\ treats the following character as literal except for \n \r and \t which match newline, carraige return, and tab respectively
(...) designates a subexpression. The values of matched subexpressions may be retrieved after a match is made by calling methods on the Match object
[...] designates a set of acceptable characters (character class) e.g. [1234567890] would specify that any numeric character was accepted.
The - operator may be used inside a character class to specify a range of accepted characters e.g. [1-9A-Za-z] would specify all numbers and letters as acceptable.
Starting a character class with a ^ negates the class, matching all characters or ranges not specified within it.
The \ operation also works within character classes the same as it does outside of them.

4.2 Searching and Matching

As was noted earlier there are operators built into that can test whether a regular expression is found within a String.

A =~ B the regular expression B matches a substring of the String A
A !~ B the regular expression B does not match any substring of the String A

These operators are used for searching and only return true or flase as to whether some substring of A matched B.

To find out whether a regular expression matches a String or whether it mathes a prefix or suffix of a String one must use methods of the Regex object.

Method Description

boolean matches(String s) Returns true if the specified string is matched by this regex
boolean matchesPrefix(String s) Returns true if the specified string starts with a prefix that is matched by this regex
boolean matchesSuffix(String s) Returns true if the specified string ends with a suffix that is matched by this regex
boolean matchesSubstring(String s) Returns true if the specified string contains any substring that is matched by this regex. This method returns the same results as using the =~ operator.

These methods are valuable for input validation e.g. you request that a user input an email address and you want to verify that what the user entered is at least of the form of a valid email address.

4.3 Details of the Match

Often times however it is not enough to know that an input String matched or contained a regular expression. Often one is interested in the details of how the match occurred. This information includes where in the input String the match was found, how long the match was, and what the contents of the matched subexpressions were. This information is retrieved via Regex methods which return Match objects.

Method Description

Match match(String s) Returns a Match object for the specified string matched by this regex
Match matchPrefix(String s) Returns a Match object for the specified string is matched by this regex requiring only the a prefix of s matches
Match matchSuffix(String s) Returns a Match object for the longest suffix of s matched by this regex
Match search(String s) Returns a Match object for the leftmost longest match of this regex in s

With a Match object in hand one can call methods on that object to return the relevant details of the match.

Method Description

String Match::subMatch(int i) Returns the substring of the matched string corresponding to the i'th subexpression of the regular expression used. The i'th subexpression is the subexpression started with the i'th opening parenthesis in the regular expression. Passing 0 to this method will return the entire string matched
String preMatch() Returns the substring of the String matched against occurring prior to the start of the match
String postMatch() Returns the substring of the String matched against occurring after the end of the match
int startIndex(int i) Returns the index in the String matched against where the i'th subexpression began. If i is set to 0 this method returns the index in the original String where the match begins
int endIndex(int i) Returns the index in the String matched against where the i'th subexpression ends. If i is set to 0 this method returns the index in the original String where the match ends

In the following example a Match object is used to pick out the pieces of an email address found in a string of text. The subexpressions of interest in this case are the name and the host in the email address.

${
   Regex rx = /([a-zA-Z]+)@(([a-zA-Z]+[.])*[a-zA-Z]+)/;

   String i2 = "email me at dhakim@webcodex.com!";
   Match m = rx.search(i2);
   print "Match was '"+m.subMatch(0)+"'\n";
   print "Name was '"+m.subMatch(1)+"'\n";
   print "Host was '"+m.subMatch(2)+"'\n";
   print "Pre-match was '"+m.preMatch()+"'\n";
   print "Post-match was '"+m.postMatch()+"'\n";
   print "Match was found at <"+m.startIndex(0)+","+m.endIndex(0)+"> in the input string\n";
}$

The output of this example is

Match was 'dhakim@webcodex.com'
Name was 'dhakim'
Host was 'webcodex.com'
Pre-match was 'email me at '
Post-match was '!'
Match was found at <12,31> in the input string

4.4 String Manipulation with Regular Expressions

The most powerful uses of regular expressions are for String manipulation. String manipulation with regard to regular expressions usually refers to two main activities :

Using regular expressions to split a string wherever a matching pattern is found. This is known as tokenizing or splitting a string.
Replacing all instances of a pattern in a String with something else. This is known generally as substitution.

4.4.1 Tokenizing with Regular Expressions

Suppose you have an input string that you want split by some pattern. Lets take as an example a string containing a numbered list of elements. The way you would do this in moto is by constructing a Tokenizer object from the input string and pattern. The pattern for a number followed by a period would be /[0-9]+\./ . Calling the next() method on the Tokenizer will return the next substring of the input string that does not contain the specified pattern. Calling next() repeatedly will continue doing this up until there are no further such substrings at which point next() will return null.

${
   use "codex.util";

   Regex rx=/[0-9]+\./;
   String s="1. foo 2. bar 3. maka";

   Tokenizer t= new Tokenizer(s,rx);
   String c;

   for(c=t.next();c != null;c=t.next())
      print "got '"+c+"'\n";
}$

The output of this program would be

got ' foo '
got ' bar '
got ' maka'

4.4.2 Pattern Substitution

Regular expression substitution is much more powerfull than the 'find and replace' functionality found in most word processors for two reasons:

Patterns are matched as opposed to static strings.
The substitution can make use of submatch information.

The following are some common examples of how substitution by regular expressions are used

Escaping HTML
Auto-markup of URLs
Emoticonizing
Colorizing source code
Implementing find and replace
Starring out dirty words

Most but not all of these examples can be accomplised by using the built in function sub(). The sub() function takes three arguments. The first is the input string. the second is pattern in the string we want to replace. The final argument is a substitution string. Thus if I wanted to replace all instances of the word "foo" with "bar" in my input string I would do that as follows :

$do(sub(input,/foo/,"bar"));

A more usefull example would be using sub() in a function to escape reserved HTML characters.

   String st_escapeHTML(String input){
      return sub(sub(sub(input,/&/,"&"),/>/,">"),/</,"<");
   }

The sub function is capable of far more complex substitutions however because the substitution string can itself refer back to last match made. This is done by including variables in the substituion string of the form $# where # is the number of the subexpression in the regular expression you wish to substitute. $0 will substitute the entire last match. A concrete example of when to use this functionality is when doing auto-markup. Auto is the process of finding hyperlinkable strings in text and generating the HTML code around those strings to link them.

${ {
       Regex rx = /([a-zA-Z]+)@(([a-zA-Z]+[.])*[a-zA-Z]+)/;
       String substitution = "<a href=\"mailto:$0\">$1 at $2</a>";

       String input = "dhakim@webcodex.com is my email, my email is dhakim@webcodex.com";

       print "'"+input+"'"; print " =>\n\t"; print sub(input,rx,substitution); print "\n";

} }$

This script outputs :

'dhakim@webcodex.com is my email, my email is dhakim@webcodex.com' =>
<a href="mailto:dhakim@webcodex.com">dhakim at webcodex.com</a> is my email, my email is <a href="mailto:dhakim@webcodex.com">dhakim at webcodex.com</a>