Regex for unstructured legal text

Dear all,

I am trying to parse through some text from an unstructured legal document to extract it in a pre-defined csv readable format.

I am working with regular expressions and would need some help with the following piece of text piece:

`text0 = “”“Guideline 9: Sectoral guideline for retail banks
9.1. For the purpose of these guidelines, retail banking means the provision of banking services
to natural persons and small and medium-sized enterprises. Examples of retail banking
products and services include current accounts, mortgages, savings accounts, consumer and
term loans, and credit lines.
Guideline 10: haha this is a test”""

Section_re = r’(\Guideline+) (\d+:) (.*)'

matches_group1 = re.findall(Section_re, text0, re.IGNORECASE)
print(matches_group1)`

My goal is to search for all text patterns that have:

  1. “Guideline XY: Text”

but receive the error message :

if len(escape) == 2:
    401             if c in ASCIILETTERS:
--> 402                 raise source.error("bad escape %s" % escape, len(escape))
    403             return LITERAL, ord(escape[1])
    404     except ValueError:
error: bad escape \G at position 1

I would be so grateful for any type of help!!
Ps: I work in Jupyter notebook

The error code seems to be pointing to the right spot, I don’t think Python’s regex has “\G” as a valid escape sequence, check the python regex HOW-TO for some useful info/alternative: Regular Expression HOWTO — Python 3.9.4 documentation

1 Like

Thank you for the answer. I checked the expression with https://regex101.com/ but I will need to find another solution as I keep getting the same error.

Did you make sure to select Python on the left hand side? There are a number of different regex engines and quite a few syntax changes if you’re using a different flavour.

If you need to test, use re.compile(r"\G") (or on your real regex if you like) to double-check Python can actually process it.

That returns the same error re.compile(r"\G")

In Regex, indeed, I did select Python and the expression works just fine.

Playing around a bit gives me everything BUT the Guideline: with this Section_re = r'[A-Z] (\d+\:) (.*)'

That was the point I was trying to make, r"\G" is not valid in the regex python uses (whatever you found online is either a different version or is ignoring that escape sequence).

You’ll have to forgive me here but I’d rather direct you to your own solution than just give you a working example. Is there anything wrong with just “Guideline”? What is the purpose of the backslash here?

1 Like

Thanks so much for the speedy replies. The simplest version worked!

Section_re = r'(Guideline) (\d+\:) (.*)'
1 Like