Hello everyone,
currently I am trying to generate a co-occurrence matrix (or word-word matrix) from a column with several rows and several entries.
Example Data:
Row 1: “Data and Analytics, Design, Information Technology, Software”
Row 2: “Data Science, Design, FinTech, Software, Data and Analytics”
Row 3: “Media, Entertainment, Software, Web”
- First I have to split the lists into separate entries.
- Find unique items: iterate through every row, append unique item to empty list: unique Items
- Iterate through documents (in this case the rows) to find cooccurrences between items
- calculate cooccurrences
Below you find my current steps. My question would be, is this a legit way to calculate the cooccurrences or might this lead to an bias of some kind? Would you suggest a better way / best-practice model?
Thank you in advance
Code - work in progress (based on tutorials / examples I found)
import pandas as pd
import numpy as np
from collections import OrderedDict
# intialise sample dataframe
data = {'industry_field':['Data and Analytics, Design, Information Technology, Software',
'Data Science, Design, FinTech, Software, Data and Analytics',
'Media, Entertainment, Software, Web']}
# Create DataFrame
df = pd.DataFrame(data)
# dropping null value columns to avoid errors
df.dropna(inplace = True)
# new data frame with split value columns
df["industry_field"].str.split(",")
df_fields = df["industry_field"]
df_fields
# Yields a tuple of column name and series for each column in the dataframe
for (columnName, columnData) in df_fields.iteritems():
print('Colunm Name : ', columnName)
print('Column Contents : ', columnData)
Second part to finde unique items and for the last step, calculate cooccurrences - work in progress:
# function to get unique values
def unique(list1):
# intilize a null list
unique_list = []
# traverse for all elements
for x in list1:
# check if exists in unique_list or not
if x not in unique_list:
unique_list.append(x)
# print list
for x in unique_list:
print(x)
document = [['A', 'B'], ['C', 'B'], ['A', 'B', 'C', 'D']]
unique(document)
occurrences = OrderedDict((name, OrderedDict((name, 0) for name in names)) for name in names)
# Find the co-occurrences:
for l in document:
for i in range(len(l)):
for item in l[:i] + l[i + 1:]:
occurrences[l[i]][item] += 1
# Print the matrix:
print(' ', ' '.join(occurrences.keys()))
for name, values in occurrences.items():
print(name, ' '.join(str(i) for i in values.values()))