Having trouble with re, specifically re.finditer()

Hey folks, I wonder if someone who has experience with re (regular expressions) in Python might be able to shed some light on something I really don’t understand.

Here’s the code

        for phrase in list_of_phrases:
            matches = re.finditer(phrase, text)
            match_start_index += [match.start() for match in matches]
            match_end_index += [match.end() for match in matches]
        print(match_start_index)
        print(match_end_index)

The output looks like this:

[84, 130, 145, 313, 653, 739, 829, 875, 950, 999, 519, 223, 333, 423, 792, 886, 1007, 1060, 1201, 886, 642, 642, 818, 461, 5, 1213]
[]

As you can see, for some reason the concatenation does not work how i would like it to which would be to output a list of end indexes for all of the matches in the text. For some reason, to get the desired output my code must like this:

        for phrase in list_of_phrases:
            matches = re.finditer(phrase, text)
            match_start_index += [match.start() for match in matches]
            matches = re.finditer(phrase, text)
            match_end_index += [match.end() for match in matches]
        print(match_start_index)
        print(match_end_index)

Which outputs what I want:

[84, 130, 145, 313, 653, 739, 829, 875, 950, 999, 519, 223, 333, 423, 792, 886, 1007, 1060, 1201, 886, 642, 642, 818, 461, 5, 1213]
[87, 133, 148, 316, 656, 742, 832, 878, 953, 1002, 537, 226, 336, 426, 795, 889, 1010, 1063, 1204, 893, 648, 651, 826, 468, 9, 1217]

yes that’s right, I have to copy and paste matches = re.finditer(phrase, text) every time I want to perform a single operation on matches!? This seems ridiculous and not DRY at all.
Could anyone share their wisdom on what’s going on here?

2 Likes

It’s because re.finditer returns an exhaustible iterator. This can be good for memory efficiency since values are only evaluated on the fly (lazy evaluation). If you’re repeatedly using values from an iterator it’s probably best to just create a tuple or something from it so that it can be reused. If you had to you can create multiple iterators to work with but that would only make sense for exceedingly large datasets. Evaluating it once and keeping it in a normal sequence would probably be faster.

1 Like

Hi, thanks for your response,

Makes sense I guess, how is one to tell if an iterator object is exhaustible or not, or are they all?

Is there a way to make conventional iterable objects like dictionaries, lists, tuples etc exhaustible to increase memory efficiency? Moreover, can I in any way delete objects/variables after I have used them within a function?

I’m struggling to even see how I would extract the information I want without repeating the line, since I do not know away to append the start and end indexes of matches to any iterable object in a single command/concatenation

By and large iterators you create are likely to be exhaustible unless you explicitly use one that isn’t. There’s no golden rule for knowing what is and isn’t an iterator though (although you will start to notice what the most likely return will be) so the best option is to check the docs or just check the return type the first time you use it-
https://docs.python.org/3/index.html
https://docs.python.org/3/library/re.html#re.finditer

It’s always worth asking yourself is it’s actually worth saving memory for a script that isn’t permanently running. Something to consider on a case by case basis at least. For the given example unless that sequence was large enough to saturate a dangerous proportion of your memory then there’s no reason not to save the sequence if you have to parse through it twice or more. As soon as you reassign the name associated with that sequence the object it referred to should be scheduled for garbage collection anyway.

As for whether you can the answer is generally yes to all those things but with caveats. Something like a dictionary can only be iterated through whilst it still exists, if you created an iterator from it and then deleted the dictionary prior to exhausting the iterator it would no longer refer to the correct items. You could make use of value generation on the fly with proper tools such as iterators and generator functions (discussed briefly below) but they do limit your options. The del keyword can remove a reference from an object thus scheduling it for garbage collection but you’ll not often see it used, it’s much easier to perform memory management with functions and such instead.

For the proper tools for this you’d want to look into the itertools module, generator functions, lambda functions and maybe building your own iterators. It’s roughly all covered in the functional programming HOW TO guide-
https://docs.python.org/3/howto/functional.html

As for functions, objects created in a function are freed by default when that function is exited which helps keep things nice and neat. If you’re talking about the arguments you passed to the function they aren’t automatically removed and you’d have to remove those references manually.

As with most repeated lines you could probably wrap it a function and save yourself a bit of hassle. If that code snippet itself is a function then this would actually be a good time to use a nested function since it’s of no use in the outer scope. Alternatively there are tools such as itertools.tee which can “copy” iterators.

For that specific problem could you return both at once, e.g. [match.start(), match.end() for match in matches] or even use [match.span() for match in matches]? Check the docs for the best solution to your problem.

1 Like

Wow, thank you. This reply really is a goldmine of info I appreciate you taking the time. I will work my way through the functional programming guide over the coming days.

It sounds like I might be worrying about memory management a little more than I need in this case and may be best to direct the majority of my focus on simply exploring Python’s functionality and finding the fastest and most concise way to meet my objectives until I start trying to handle much bigger quantities of data.

To talk of nested functions, is there any advantage to this method of working from the perspective of the machine? For example, if I use a nested function inside my function does the outer function perform better overall since python is able to collect the garbage it no longer needs (objects inside the nested function) as soon as it is done processing that inner function, as opposed to waiting until the whole outer function is done before releasing it from memory?
Or is the advantage of nested function simply to minimise dependency and make code easier to change, more readable etc.?

Finally, yes I did try [match.start(), match.end() for match in matches] but unfortunately this outputs a SyntaxError ! Which seems odd to me.
Match.span() however worked perfectly I will be sure to refer to the docs more often as that could have easily resolved my problem here without resorting to posting in a forum.

Apologies for a delayed reply, neglected to actually send what I wrote. There’s perhaps a bit too much info but if you’re willing to nose into the docs then it’s worth it in the long term. Some of the HOWTOs are quite well written and readable but they’re not required reading or anything.

I suppose there could be occasions when it performs better due to the freed memory but I doubt it’s a big difference until you’re working with large datasets again so I’d not worry about it overmuch until you need to. I was thinking of it more in terms of readability and only nesting it since it has absolutely no use outside the enclosing function. So I was using it for readability but you may find it useful to save memory. I’m sure there are numerous excellent guides on optimising memory management in Python should you need them but I’m afraid I don’t have any nice links that spring to mind. Hunt it down if and when you need it.

Ah. Sorry that’ll be a wee error on my part. You’d need to return a sequence of some sort for multiple values, e.g. a tuple in which case wrap it in parentheses [(match.start(), match.end()) for... ]. Using the .span method seems easier in that case.