Regular Expression Matching

Back to the Language Shootout
Back to Doug's Homepage

[NEWS]   [FAQ]   [Methodology]   [Acknowledgements]   [Scorecard]   [Conclusion]  

[cpu minus startup time]
    [sort] [sort]  
Source Code CPU (sec) Mem (KB) Lines Code Log
mlton1.00880765log
bigloo1.94501660log
cmucl3.441090848log
ocaml3.6766832log
icon3.74151226log
mawk3.7958032log
ocamlb4.2280832log
perl4.72134428log
g++5.1876854log
gcc5.33400111log
smlnj5.385212767log
pike5.48242018log
lua5.7757617log
python8.88149227log
rep9.39118031log
java10.80889268log
gawk13.8989232log
ruby14.31145223log
tcl30.36109220log
guile30.69176844log
gforth37.5070087log
erlang58.03299672log
Languages that compile to native code are in Bold Italics.

[Note: Values have been normalized to fall in the range of 0-10 for aesthetic reasons. Original value ranges are included on the X-axis.

Click here for more detailed data and graphs.

[Results last updated: Sun Sep 9 13:20:25 2001 CDT]


About this test

Please Note: this test is due for an overhaul, because of the variety of solutions for this test that aren't really using regular expressions. I'll probably split this into 2 tests, one that does some kind of parsing/pattern matching, and another that calls for NFA regular expressions with capture buffers.

For this test, each program should be implemented in the same way.

The purpose of this test is to extract strings that look like phone numbers from a file and print them in a standard format. For the sake of this test, we aren't interested in I/O performance, so we read the file into an array before starting, then extract the phone numbers from the array N times, and on the last iteration, we print the extracted numbers in the standard format. See the detail page for different values of N.

Each program can assume that no line will exceed 128 characters (including newline).

Input file. Output file.

The telephone number pattern we are trying to match can be described this way:

For the C program I wasn't going to implement my own regular expressions from scratch, I use the Perl Compatible Regular Expressions (PCRE) library.

The C++ program uses Bill Lear's PCRE library for C++.

Markus Mottl helped me use his PCRE library for Ocaml.

The Java program uses a 3rd party, mostly-Perl5-compatible regexp library, called ORO. Apparently this package, once available from oroinc.com (defunct), is now maintained by the Apache Jakarta project.

Bigloo's regular grammar facility is very powerful. I wish all languages offered this feature. I think it shows that while it's nice to be able to do complex pattern matching, it is really more important how easily you can do something with the matched data.

I have been a little sloppy in this test specification, and some languages don't implement the same exact pattern matching. I'll try to fix this soon by adding more test strings to the input to take care of some of the sloppy cases.

Observations

Erlang's regular expression support is minimal, it doesn't support captures (backreferences), and strings are represented as linked lists, so it doesn't do very well on this test.

[NEWS]   [FAQ]   [Methodology]   [Acknowledgements]   [Scorecard]   [Conclusion]  


Back to the Language Shootout
Back to Doug's Homepage
Send me comments or suggestions.