Python, how to Count the unique commeters in a json text file?


#1

I am working on Python code that counts the number of unique commenters in a chat using their unique id found in the field name “_id” nested in the commenters field name. The JSON looks like this.

Json Code:

{ 
"_id":"123adfvssw",
"content_type":"video",
"content_id":"12345",
"commenter":{ 
"display_name":"student1",
"name":"student1",
"type":"user",
},
"source":"chat",
"state":"published",
"message":{ 
"body":"Hi",
"fragments":[ 
{ 
"text":"Hi"
}
],
"is_action":false
},
"more_replies":false
}

{ 
"_id":"123adfvssw",
"content_type":"video",
"content_id":"12345",
"commenter":{ 
"display_name":"student2",
"name":"student2",
"type":"user",
},
"source":"chat",
"state":"published",
"message":{ 
"body":"Hey!",
"fragments":[ 
{ 
"text":"Hey"
}
],
"is_action":false
},
"more_replies":false
}

{ 
"_id":"123adfvssw",
"content_type":"video",
"content_id":"12345",
"commenter":{ 
"display_name":"student1",
"name":"student1",
"type":"user",
},
"source":"chat",
"state":"published",
"message":{ 
"body":"How are you?",
"fragments":[ 
{ 
"text":"How are you?"
}
],
"is_action":false
},
"more_replies":false
}

In all, the topic received 3 commenters. However, student1 commented more than once. So in retrospect, there are only two unique commenters in this thread. My question is how do I ensure that I only count the unique commenters using their _id field in the JSON? I am able to count all the commenter fields in the text but I am unable to count the unique commenters. The code I wrote counts all the commenters field which prints 3. However, the real answer is 2 since student1 commented twice.

Code that Prints Number of Commenters Field:

import json

import  requests

from collections import Counter

files ="/chatinfo.txt"

with open(files) as f:

    commenters = 0

    for line in f:

        jsondata = json.loads(line)

        if "commenter" in jsondata:

            commenters += 1


print(commenters)

Output
3

An attempt at getting the Commenter _id Field value in an array/list to compare and only count unique commenters _id:

import json
files = "/chatinfo.txt"
with open(files) as f:
	num_with_field = 0
	for line in f:
		jsondata = json.loads(line)
		dictjson = json.dumps(jsondata)
		if "commenter" in jsondata:
			commenterid = []
			commenterid.append(jsondata["commenter"]["_id"])
			print(commenterid)

Output:
			
['193984934']
['157255102']
['100365638']


____________

However, after this, I try to see what's in the array/list. Using print(commenterid). I get ['100365638']. Out of the three, only 1 was stored.  Can anyone help me with filling my array/list with the three values i need? the array/list should contain ['193984934']['157255102']['100365638']. I am new to working with python.

#2

Shouldn’t that be initialized before the loop, not in it?


#3

You, sir, have saved the day for my array/list!!! Thanks soooo much!!!

import json
files = "/chatinfo.txt"
num_with_field = 0
commenterid = []
with open(files) as f:
	for line in f:
		jsondata = json.loads(line)
		dictjson = json.dumps(jsondata)
		if "commenter" in jsondata:
			commenterid.append(jsondata["commenter"]["_id"])

			
print(commenterid)
['193984934', '157255102', '100365638']

#4

I am working on a project to count the unique commenters in a chat and store the file name and number of commenters of that chat in a csv for each file. The code I have now is opening each of the documents and counting all the commenters in that file. However, I have some files where a specific commenter has commented twice or more. I would like to only count the commenter once based off of their id. I feel like I am very close but I am stuck. Can anyone help with this issue or suggest other methods in doing this?

Code:


import json, os
import pandas as pd
import numpy as np
from collections import Counter
TextFiles = []
FName = []
csv_rows = []
commenterid = []
unique_id = []
NC = []
for root, dirs, files in os.walk("/Users/aaaa/Desktop/"):
 for file in files:
     if file.endswith("chatinfo.txt"):
         path = "/Users/aaaa/Desktop/"
         filepath = os.path.join(path,file)
         head, filename = os.path.split(filepath)
         TextFiles.append(filepath)
         FName.append(filename)

         n_commenters = 0
         with open(filepath) as open_file:
             for line in open_file:
                 jsondata = json.loads(line)
                 if "commenter" in jsondata:
                     commenterid.append(jsondata["commenter"]["_id"])

                     list_set = set(commenterid)
                     unique_list = (list(list_set))

                 for x in list_set:
                     n_commenters += 1

                     commenterid.clear()
             csv_rows.append([filename, n_commenters])
df = pd.DataFrame(csv_rows, columns=['FileName', 'Unique_Commenters'])
df.to_csv('CommeterID.csv', index=False)

Current Output:

![New_Output|589x500](upload://tT8Owah995Z1LVEeFf7PQARyLw3.png) 

However, two of these documents has a commenter that has commented more than once. I would like to only count them once. The files that I am working with are text files with json.

Json Sample:

{  
   "_id":"123adfvssw",
   "content_type":"video",
   "content_id":"12345",
   "commenter":{  
      "display_name":"student2",
      "name":"student2",
      "type":"user",
   },
   "source":"chat",
   "state":"published",
   "message":{  
      "body":"Hello",
      "fragments":[  
         {  
            "text":"Hello"
         }
      ],
      "is_action":false
   },
   "more_replies":false
}
{  
   "_id":"123adfvssw",
   "content_type":"video",
   "content_id":"12345",
   "commenter":{  
      "display_name":"student",
      "name":"student",
      "type":"user",
   },
   "source":"chat",
   "state":"published",
   "message":{  
      "body":"Hi !",
      "fragments":[  
         {  
            "text":"Hi !"
         }
      ],
      "is_action":false
   },
   "more_replies":false
}

{  
   "_id":"123adfvssw",
   "content_type":"video",
   "content_id":"12345",
   "commenter":{  
      "display_name":"student2",
      "name":"student2",
      "type":"user",
   },
   "source":"chat",
   "state":"published",
   "message":{  
      "body":"How are you student?",
      "fragments":[  
         {  
            "text":"How are you student?"
         }
      ],
      "is_action":false
   },
   "more_replies":false
}

In the sample, there are in total 3 comments, however only two unique commenters since student2 commented twice on this chat. I would like to count the commenters based off of their ids and if the id appears muliple times then only count it once for the entire file. Can anyone help with this issue or suggest other methods in doing this?


#5

I actually solved this problem and found where I was going wrong. If anyone is interested let me know and I’ll post the solution here :slight_smile: