Regular Expressions query


#1

Hi,

I have to read a data as records in plain text. I use open(“filename”).read() to get the data from a text file. The format is as given below:

Name 1
Address : Full Address Phone No : 3333333333
Email : email.address Mobile No : 9999999999
NEXT

Name 2
Address : Full Address 
Email : email.address Mobile No : 9999999999
NEXT

Name 2
Address : Full Address Phone No : 3333333333
Email : email.address Mobile No : 9999999999
NEXT

I am using ‘re.findall()’ to get the details into a string separated with “\t” so that I can paste it into an excel sheet. The format is as below:

NAME              |ADDRESS               |PHONE NO         |EMAIL            |MOBILE NO

So the data should be framed into a string in the following format:
dataline = “Name1\tFull Address\t3333333333\temail.address\t9999999999\n”

The problem is that, some may have Phone No, but some may not. My regex fails to get entire data when one of the field is missing as shown in my second record in the example above. It works for the first and third records but fails for the second record. I prefer to get a blank tuple or list if it is not present. Since the data is huge, the manual check is not possible. Just curious if it can be achieved through Regex.

Thanks,
–Sarad


#2

what code do you have so far? So we can see why and where its failing.


#3

Thank you very much for the interest. Attaching the code, I tried.

import re

d = open("E:\Data\list.txt").read() #avoided readlines intentionally,
rec = d.split("NEXT\n \n") #so that I will get records instead of lines

for r in rec:
    t = re.findall("(.*?)\nAddress\s:\s(.*?)\sPhone\sNo\s:\s(.*?)\n.*?\s:\s(.*?)\s.*?No\s:\s(.*?\n)", r, re.S)[0]
    print("\t".join(t))

I was wondering if we can make any of the missing fields, in any of the records, ignored by the re, and get the available data without breaking the entire re?

The desired output from the re.findall() for the second record can be:
[(‘Name 2’, 'Full Address ', ‘’, ‘email.address’, ‘9999999999\n’)]

Thanks for the help.
–Sarad


#4

I would never go this approach, i would do the following:

import re

final_result = []
  
with open('data.txt', 'r') as f:
    file_data = f.read()
    temp_result = []
    for line in file_data:
         if line != "NEXT": 
              # check whats inside the line, append to temp_result
        else:
              # we are going to get next data set, push temp_result to final result
              # create new temp_result, and handle next adress and so on

This makes your code so much more readable and maintainable then a single very complex regular expression. I think the singular RE would chase you down a rabbit hole

But personally, i would never ever use python for this. I would use sed and awk, but i guess these are tools for linux. These tools are designed to process text files and columns within text files. They could extract all the data real quick, even if you do not know the tools, they are faster then doing this in python. Use the right tools for the job


#5

oh! ok. Thank you. I was on the perception that regex is a fast and powerful tool. But I agree that it is not at all simple and readable. Saying that, I actually solved the issue with simple string manipulations. It was my plan to explore and master the regex patterns. I was really confused with the various functions of re.search, match and findall. When to use it, and how best to use it? So started playing around with it on every possible scenarios. It will be helpful if we can solve real time problems than sample problems, right? How to do the same thing using regex, kind of explorations. I would really appreciate, if you could help me in this by solving it through a regex pattern. Only an academic interest. But, thank you very much for the support. Don’t feel bad for moving away from the coding etiquette.

Regards,
–Sarad


#6

But there is more then one powerful tool. And certain tools might fit specific tasks better

You can still use regex, but finding a single regular expression which is powerful and good enough is not desired here, i think. Its too complicated, to time consuming and difficult to change

i would then invest your time into sed and awk.


#7

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.