How to generate a Cooccurrence Matrix

Hello everyone,

currently I am trying to generate a co-occurrence matrix (or word-word matrix) from a column with several rows and several entries.

Example Data:
Row 1: “Data and Analytics, Design, Information Technology, Software”
Row 2: “Data Science, Design, FinTech, Software, Data and Analytics”
Row 3: “Media, Entertainment, Software, Web”

  • First I have to split the lists into separate entries.
  • Find unique items: iterate through every row, append unique item to empty list: unique Items
  • Iterate through documents (in this case the rows) to find cooccurrences between items
  • calculate cooccurrences

Below you find my current steps. My question would be, is this a legit way to calculate the cooccurrences or might this lead to an bias of some kind? Would you suggest a better way / best-practice model?

Thank you in advance

Code - work in progress (based on tutorials / examples I found)

import pandas as pd
import numpy as np
from collections import OrderedDict

# intialise sample dataframe
data = {'industry_field':['Data and Analytics, Design, Information Technology, Software',
'Data Science, Design, FinTech, Software, Data and Analytics',
'Media, Entertainment, Software, Web']} 
# Create DataFrame 
df = pd.DataFrame(data) 
# dropping null value columns to avoid errors 
df.dropna(inplace = True) 
# new data frame with split value columns 

df_fields = df["industry_field"]


# Yields a tuple of column name and series for each column in the dataframe
for (columnName, columnData) in df_fields.iteritems():
   print('Colunm Name : ', columnName)
   print('Column Contents : ', columnData)

Second part to finde unique items and for the last step, calculate cooccurrences - work in progress:

# function to get unique values 
def unique(list1): 
    # intilize a null list 
    unique_list = [] 
    # traverse for all elements 
    for x in list1: 
        # check if exists in unique_list or not 
        if x not in unique_list: 
    # print list 
    for x in unique_list: 

document = [['A', 'B'], ['C', 'B'], ['A', 'B', 'C', 'D']]

occurrences = OrderedDict((name, OrderedDict((name, 0) for name in names)) for name in names)

# Find the co-occurrences:
for l in document:
    for i in range(len(l)):
        for item in l[:i] + l[i + 1:]:
            occurrences[l[i]][item] += 1

# Print the matrix:
print(' ', ' '.join(occurrences.keys()))
for name, values in occurrences.items():
    print(name, ' '.join(str(i) for i in values.values()))

1 Like