Is this code fast?


#1

This is what I did and I passed. What do you think guys ? is this code fast ?

import re
def anti_vowel(text):
return re.sub('[AauEeUuOoIi]',"",text)


8. anti_vowel [solved]
#2

It may be. Where did you pick it up from? Can you explain it, please, so we don't have to remove this post?


#3

Easier to test it than trying to reason about/learn what the regex module does.

I would generate a huge file, 1GB or so (about 1 billion characters, each char takes one byte), of random letters and then run a couple of approaches on it.

For reference, I would use sed which is a standard tool for doing operations like that:

$ command time -v sed s/[aoeuiAOEUI]//g testdata.txt > /dev/null

.. I might actually come back here later do this, I have a lecture to attend to right now


#4

I have a feeling it came from here...

At the time I was hoping someone would come out of the woodwork to criticize this method. And here you are, @ionatan.


#5

So what's going to happen is that each program iterates over each character and
only prints it if it's not a vowel. This should in other words take time relative
to the number of characters, we should produce output as fast as we can read pretty
much.

Okay, so I've got..

A C version, which I imagine will be faster than any of the others.

#include <stdio.h>
#include <ctype.h>

int main() {
    char ch;
    while ((ch = getchar()) != EOF) {
        if (ch == 'a' || ch == 'o' || ch == 'e' ||
            ch == 'u' || ch == 'i' || ch == 'A' ||
            ch == 'O' || ch == 'E' || ch == 'U' || ch == 'I')
            continue;
        putchar(ch);
    }
    return 0;
}

Your Python-regex version:

#!/usr/bin/env python3
import sys
import re

def anti_vowel():
    for line in sys.stdin:
        print(re.sub('[AauEeUuOoIi]', '', line), end='')

anti_vowel()

A non-regex version:

#!/usr/bin/env python3
import sys

def anti_vowel():
    for line in sys.stdin:
        print(''.join([char for char in line
                       if char not in 'aoeuiAOEUI']),
              end='')

anti_vowel()

And finally sed (can (probably) found on any unix-like machine)

$ sed 's/[aoeuiAOEUI]//g' testdata.txt

A program to generate the test-data.. this turned out to be extremely slow (20 mins)
There are better ways to do it I'm sure but meh.

import random
import string

num_bytes = 1 << 30 # 1 gig
line_length = 80
num_lines = num_bytes // line_length
letters = string.ascii_letters

for _ in range(num_lines):
    print(''.join(random.choice(letters) for _ in range(line_length)))

A bash script to run all the programs and time them:

#!/usr/bin/env bash
gcc -O2 antivowel.c && command time -v ./a.out < testdata.txt > /dev/null
echo  # write a blank line
command time -v sed 's/[aoeuiAOEUI]//g' testdata.txt > /dev/null
echo
command time -v ./antivowelregex.py < testdata.txt > /dev/null
echo
command time -v ./antivowelnoregex.py < testdata.txt > /dev/null

Generate the testdata:

$ python gentestdata.py > testdata.txt

All that remains now is to run it:

$ ./runall.sh

output, removed some of it because only wall-clock time is relevant:

Command being timed: "./a.out"
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.55

Command being timed: "sed s/[aoeuiAOEUI]//g testdata.txt"
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.46

Command being timed: "./antivowelregex.py"
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:34.81

Command being timed: "./antivowelnoregex.py"
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:03.41

#6

So in general terms, a regex is a viable approach, with everything in a balance, as it were?


#7

On one hand, less people will be able to understand it, and more knowledge is required to wield it correctly. On the other, regex is something that's present in many languages and tools and should therefore be part of every well-rounded programmer's toolbox.

I'm not at all good with regex, but I'm constantly exposed to it (searching in files with vim/grep/ack/find, matching directory names when navigating, .gitignore's ...) and therefore find myself using it more and more as time goes


#8

@ionatan

If you are going to use the regex module to work on huge data sets then using the built-in compile will speed up the regex by a hefty amount.

Change to

import re
import sys

reVowel = re.compile('ARIOUaeiou')
def anti_vowel():
    for line in sys.stdin:
        print(reVowel.sub(r'', line), end='')

Using compile for something you are going to call a ton speeds it up a huge amount, or is supposed to.

On your non-regex version if you get rid of the list comprehension and turn it into a straight generator you should see improvement there as-well because python doesn't have to put anything into memory.

''.join(char for char in line if char not in 'aoeuiAOEUI'

Just have to make sure when doing it this way you are only using string characters or convert them as needed.


#9

I think the patterns are cached so it doesn't matter.

The machine I'm running this on has 64GB ram and it's a 1GB file, no gain there. (Reading the whole file into a list of lines uses 1.6GB) I'm also reading one line at a time, it's possible that reading the whole file at once is faster, but a generator wouldn't help since there would still be need of storing it all in memory.
generators come with significant overhead, especially considering that there are many iterations and few operations per iteration)

This ends up not being any faster:

pattern = re.compile('AOEUIaoeui')
re.sub(pattern, '', line)

However, the way you wrote it..

pattern = re.compile('AOEUIaoeui')
pattern.sub('', line)

Does run a little faster, very odd.

Command being timed: "./antivowelcompiledregex.py"
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:23.19

Command being timed: "./antivowelgenerator.py"
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:49.88

Reading the whole file all at once seems to have negligible effect.

What other characters are there?


#11

So I learned this code from python documentations because every lesson I learn from codecademy is one option or way of solving a problem. This code seems to be short and concise. Can someone explain why compile makes it speed up? What factor does it depend on mainly, for this code and other codes.


#12

Quoting the documentation:

Module-Level Functions
You don’t have to create a pattern object and call its methods; the
re module also provides top-level functions called match(),
search(), findall(), sub(), and so forth. These functions
take the same arguments as the corresponding pattern method, with
the RE string added as the first argument, and still return either None or a
match object instance.

>>> print re.match(r'From\s+', 'Fromage amk')
None
>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')  
<_sre.SRE_Match object at 0x...>

Under the hood, these functions simply create a pattern object for you
and call the appropriate method on it. They also store the compiled object in a
cache, so future calls using the same RE are faster.
Should you use these module-level functions, or should you get the
pattern and call its methods yourself? That choice depends on how
frequently the RE will be used, and on your personal coding style. If the RE is
being used at only one point in the code, then the module functions are probably
more convenient. If a program contains a lot of regular expressions, or re-uses
the same ones in several locations, then it might be worthwhile to collect all
the definitions in one place, in a section of code that compiles all the REs
ahead of time. To take an example from the standard library, here’s an extract
from the deprecated xmllib module:

ref = re.compile( ... )
entityref = re.compile( ... )
charref = re.compile( ... )
starttagopen = re.compile( ... )

I generally prefer to work with the compiled object, even for one-time uses, but
few people will be as much of a purist about this as I am.