Transpose using python


#1

I have a file containing data in following format:

abc 123 456
cde 45 32
efg 322 654
abc 445 856
cde 65 21
efg 147 384
abc 815 078
efg 843 286
and so on. How can transpose it into following format using Python:

abc 123 456 cde 45 32 efg 322 654
abc 445 856 cde 65 21 efg 147 348
abc 815 078 efg 843 286
Also, in case cde is missing after abc, it should insert blank spaces instead, since it is a fixed width file.

With open ('abc. txt') as file:
for lines in file.readlines():
while lines[:3] =='abc' lines.replace('\n','')

But it is not giving desire output


#2

If cde is missing is a blank line present or does it just go straight to efg?


#3

If cde is missing, no blank line is there and it should jump directly to efg but after inserting blan spaces lets say each line consists of 50 words, so it should insert 50 blank space in case cde is missing.


#4

How large is the text file?

Also, is it just "cde" that is missing or could "abc" or "efg" be missing too?

Anyway, here is some code that will do what you wanted. It is only a quick put together (far from the most efficient or best) and will only do exactly what you posted. Also, on very large files it'll eat memory like it's at an all you can eat buffet. You should be able to work it to your exact needs though, assuming the file isn't too large.

def line_position(line):
    start = line[:3]
    if start == "abc":
        return 0
    elif start == "cde":
        return 1
    else:
        return 2

def format_file(lines):
    newLine = []
    newFile = []
    for line in lines:
        line = line.strip(chr(10))
        linePos = line_position(line)
        newLen = len(newLine)
        if newLen == 1 and linePos == 2:
            newLine.append(" "*50)
            newLine.append("{}{}".format(line,chr(10)))
            newFile.append(newLine)
            newLine=[]
        elif newLen == 2:
            newLine.append("{}{}".format(line,chr(10)))
            newFile.append(newLine)
            newLine=[]
        else:
            newLine.append(line)
    return newFile

def main():
    with open("Text.txt","r+") as txt:
        lines = txt.readlines()
        newLines = format_file(lines)
        txt.seek(0)
        for l in newLines:
            txt.write(" ".join(l))
            txt.truncate()

if __name__ == "__main__":
    main()

#5

Thankyou jagking.
Re:""" is it just "cde" that is missing or could "abc" or "efg" be missing too""" - efg can also go missing sometimes but abc will always be there in any case.
Secondly:
The code you provided is eating the space as you mentioned that this is for small file size.
In my case there are ~3.5 Million rows, my bad i didn't mentioned this initially.


#6

Do you get a memory error? Is this just a one time run thing?

Because the quickest(read easiest) fix would be to add a new elif statement to catch there being no efg line. And to massively reduce the memory usage write each new line to a new txt file when it is complete rather than store them all in a list and dump them in a oner.

Still will use a lot of memory because of the .readlines() method loading the whole txt file in to memory rather than one line at a time. If needs be we can do it that way.


#7

Yup i applied one more else if and this is a repetitive thing i will be doing on weekly basis, but then yes dumping data directly to a new txt file improved efficiency .
But then do you think more efficient wat to impliment this, since i am new to python i dont know how to increase efficiency.


#8

Generally to make things quicker and more efficient in python use iterators local variables and make as few method calls as possible.
In terms of efficient writing code out of a function or class is the most efficient, in a function is the second fastest and a class is slowest. But in terms of reusability and readability the opposite is true.

In this case, it will be far more efficient to do a for statement and read each line individually or use .readline() - that is readline not readlines().

How long does it tale to do the file at the moment? There isn't too much point chasing efficiency if you aren't going to save much time. Although changing to reading one line at a time is easy and should get good gains.


#9

def format_file(inFile, outFile,dic):
    appendCount = 0
    newLine = []
    gap = " "*50
    for line in inFile:
        line = line[:-1]
        linePos = dic[line[:3]]
        
        if (appendCount == 0 and linePos == 0) or (appendCount == 1 and linePos == 1):
            newLine.append(line)
            appendCount +=1
            
        elif appendCount == 2 and linePos != 0:
            newLine.append(line)
            print >>outFile," ".join(newLine)
            newLine=[]
            appendCount = 0
            
        elif appendCount == 1 and linePos == 2:
            newLine.append(gap)
            newLine.append(line)
            print >>outFile," ".join(newLine)
            newLine=[]
            appendCount = 0

        elif appendCount == 2 and linePos == 0:
            print >>outFile," ".join(newLine)
            newLine = [line]
            appendCount = 0
        else:
            try:
                print >>outFile,newLine[0]
                newLine = []
            except: pass
            print >>outFile,line
            
def main():
    positions = {"abc":0,
                 "cde":1,
                 "efg":2}
    with open("Text.txt","r") as inTxt, open("NewText.txt","w")as outTxt:
        format_file(inTxt,outTxt,positions)
        
if __name__ == "__main__":
    main()

So the above should catch every possible combo of lines missing (except "abc", which you said will always be there) and act accordingly. As well as being a lot nicer on memory and quicker. Still far from perfect but I don't think you'll see massive gains by making many more changes.

I made a txt file with around 6.5 million lines by repeating your example text in the first post. It seemed to do it alright. But the file I created, even with more lines is probably smaller than yours if you said the gap should be 50 characters.

Biggest gain would come from reworking the if else statements I'd imagine, Probably using a dictionary would be best.


#10

Thankyou for such a clear explanation :slight_smile:
have you check if it is printing last line, in my case it was not.
And what if my sample file contains one more row for transpose, for example:
abc 123 456
cde 45 32
efg 322 654
hij 124 567
abc 445 856
cde 65 21
efg 147 384
hij 127 643
abc....
efg...
hij..

Here hij has also been added, m a bit confused regarding the permutation combination I have to make to accommodate additional row "hij"

Thanks.


#11

It was printing, yes. But I had multiple versions open when making changes and testing the speed. I might have accidentally copied the wrong one. I know it didn't print the last line at one point. The only reason looking above would be if it ran out of lines but the last of statement hit was the first one. That would get appended to the newlist but not printed. Just add a try statement after the for loop and copy a print statement for newLine[] in it. You can pass on the except.

Spotted another poss mistake but I'll double check it when I'm not on a phone. It's that the last print line the in else statement should actually be:

newLine.append(line)
            appendCount +=1

To accommodate "haji", just add it to the dictionary. But the if statements would have to change to accommodate for the possibility of a new line.

Question. What are the actual starting characters that are used? If it was abc, cde, efg etc. Then you could just work with the pattern to identify what should happen.


#12

Thankfully i am able to accommodate hij, had to redesign the permutations combination, but one thing i have observed is- this way it is less scalable to accommodate new line, every time a new line is there, whole code needs to be redesigned.
Regarding starting point- it is always 'abc' , one more thing i have observed is, in case file format is Like :
Abc 123 452
Abc 632 235
Abc......
Abc.....
i.e. No intermediate rows are there, then the code is skipping every second row, though it is appending the gap 3 times(one for bcd, efg, hij) after abc.

I am replying via my cell phone hence not able to share piece of code.
Do u think possible way to make this more scalable?


#13

Yeah, as I said before this code was written to work with the exact example given. That said, the easiest way I can think of to make it scalable is to use another txt file. Maybe even make it a full config file, whatever meets your requirements better. But basically, all the possible first three letter combos will be put into the file in the correct order. E.g.

abc
cde
efg
etc etc

These will the be read and put into a dictionary. All those if statements then get reduced and rewritten as they weren't the best anyway. The new way to work out if it should have a gap put in is to store a next expected position variable. I.e. if you just appended 1 the next one should be 2, if it isn't then all that it will do is do the current position (say 4) and minus the expected position and time the result by the gap (plus a space inbetween). All that needs to really be handled then is:
1) "abc" coming up early and printing the current line starting a new one
2) the end of a line due to finding the last element. (although this might just be able to be included in the above.

This way whenever you need to add a new three letter combo you add it to the reference text file and it should deal with it fine.
Dictionary will be something like:

with open("txtfile.txt", "r") as refText:
    positions = dict()
    for i, line in enumerate(refText):
        positions[line[:3]] = i #shouldn't need the slice, however it will prevent issues with accidental spaces etc being added.

Don't have time to put something together at the moment but should do tomorrow evening and if not then Saturday. But the above logic should do it.

The reason for the abc, abc ,abc missing a line was this bit below: As I said in the post above I thought it was wrong.

        else:
            try:
                print >>outFile,newLine[0]
                newLine = []
            except: pass
            print >>outFile,line # Oops, this bit is wrong

        else:
            try:
                print >>outFile,newLine[0]
                newLine = []
            except: pass
            newLine = [line]  #Should have been
            appendCount = 1 # Like this

#14

This version should work with as many items as you want it to so long as you put them into the dictionary in the right order. I haven't tested it as I'm at work only thing I can think of the might happen is

An empty line may be put at the start of the file. Just need to test the quickest way to handle it. It may be a case of just deleting it at the end rather than having an overhead for the other 3.5 million lines.

def format_file(inFile, outFile,dic):
    expected = 0
    gap = " "*50
    newLine = []
    join = " ".join
    write = outFile.write

    for line in inFile:
        sLine = line[:-1]
        linePos = dic[line[:3]]
        if linePos == 0:  
            if newLine:
                write(join(newLine)+"\n")
            newLine = [sLine]
            expected = 1
        else:
            gaps = linePos - expected
            if gaps != 0:
                newLine.append(gap*gaps+" "*(gaps-1))
                expected = linePos + 1
            else:
                expected += 1
            newLine.append(sLine)
    if newLine:
        newLine[-1] = line
        write(join(newLine))
            
def main():
    positions = {"abc":0,
                 "cde":1,
                 "efg":2,
                 "hij":3}
    with open("Text.txt","r") as inTxt, open("NewText.txt","w")as outTxt:
        format_file(inTxt,outTxt,positions)
        
if __name__ == "__main__":
    main()

I didn't go for the separate config file because it isn't technically required and you haven't said you wanted it. But it would probably make editing stuff easier if you did. Like if you needed to add more three letter codes, change the gap size, output location, input file location.

EDIT: Just tested it and made changes. Works from what I can tell.


Python to merge rows
#15

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.