Hello everyone!

Here is a script I wrote to generate patterns of categorical variables. I was trying to understand the missingness mechanism across various variables. That is why I needed this.

I require the following Python packages:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

I randomly generate some integers representing categorical/dummy variables:

n=100
np.random.seed(1)
df = pd.DataFrame([np.random.randint(0,3, n), np.random.randint(0,2, n), np.random.randint(0,3, n)]).T
df.head()

The output looks like this:

  0 1 2
0 1 0 1
1 0 1 2
2 0 1 1
3 1 1 1
4 1 1 0

Here is a script that find the frequency of each pattern:

temp_df = df.copy()
patterns = {}
for i in range(temp_df.shape[0]):
    pattern = '-'.join(temp_df.iloc[0,:].values.astype('str'))
    findings = (temp_df== temp_df.iloc[0,:]).all(1)
    patterns[pattern] = findings.sum()
    temp_df = temp_df[findings != True]
    if temp_df.shape[0] < 1:
        break

This function:

  • Starts from the first row pattern ('-'.join(temp_df.iloc[0,:].values.astype('str'))) and finds the similar ones ((temp_df== temp_df.iloc[0,:]).all(1)).
  • Saves the frequency of the pattern (findings.sum())
  • Drops the rows that has the same pattern countet. (temp_df = temp_df[findings != True])
  • Breaks the loop when all patterns are found.

The output looks like that:

plt.figure(figsize=(10,6))
plt.bar(patterns.keys(), patterns.values())
plt.xticks(rotation='vertical')
plt.savefig('patterns.png')

Categorical Variable Patterns - Python