A Fast Method to Create Synthetic Data with Python

I have been working on a project to create synthetic data. I mostly used the R package synthpop in the project. I have been thinking about a very simple algorithm to create synthetic data using the nearest neighbor algorithm since then. I have created a Python package named synloc. I discuss the practical and theoretical here: Generating Synthetic Data with The Nearest Neighbors Algorithm

synloc is an algorithm to sequentially and locally estimate distributions to create synthetic values from a sample. The proposed methodology can be combined with parametric and nonparametric distributions. synloc can be installed through PyPI:

pip install synloc

An Example

Assume that we have a sample with three variables with the following distributions:

\[\begin{aligned} x &\sim Beta(0.1,\,0.1)\\ y &\sim Beta(0.1,\, 0.5)\\ z &\sim 10 * y + Normal(0,\,1) \end{aligned}\]

The distribution can be generated by tools module in synloc:

from synloc.tools import sample_trivariate_xyz
data = sample_trivariate_xyz() # Generates a sample with size 1000 by default. 

Creating synthetic values with Multivariate Normal Distribution

Initializing the resampler:

from synloc import LocalCov
resampler = LocalCov(data = data, K = 30)

Subsample size is defined as K=30. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw “synthetic values.”

syn_data = resampler.fit() 

100%|██████████| 1000/1000 [00:01<00:00, 687.53it/s]

The default sample size is the size of the original sample (i.e., 1000). It can be changed while fitting the distributions:

syn_data = resampler.fit(100) # a sample with size 100 created.

syn_data is a pandas.DataFrame where all variables are synthesized.

Comparing the original sample using a 3-D Scatter:

resampler.comparePlots(['x','y','z'])

Creating synthetic values with Gaussian Copula

Initializing the resampler:

from synloc import LocalGaussianCopula
resampler = LocalGaussianCopula(data = data, K = 30)

We locally estimate the Gaussian Copula and from each estimated distributions we draw “synthetic values.”

syn_data_copula = resampler.fit() 

100%|██████████| 1000/1000 [00:03<00:00, 272.71it/s]

Comparing the original sample using a 3-D Scatter:

resampler.comparePlots(['x','y','z'])

Even though the distribution of the original sample substantially differs from multivariate normal, locally estimated distributions can replicate the original distribution quite accurately.

Customized Models

kNNResampler class can be used to utilize estimate and resample from any distributions.

functional Principal Component Analysis (fPCA)

For example, if the original sample is high-dimensional data, the user can reduce the dimension with fPCA. It is possible to implement it with FPCADataGenerator function in Synthia package:

from synloc import kNNResampler
from synloc.tools import stochastic_rounder
from synthia import FPCADataGenerator

class LocalFPCA(kNNResampler):
    def __init__(self, data, K = 30, normalize = True, clipping = True, Args_NearestNeighbors = {}):
        super().__init__(data, K, normalize, clipping, Args_NearestNeighbors, method = self.method)
    def method(self, data):
        generator = FPCADataGenerator()
        generator.fit(data, n_fpca_components=2)
        return generator.generate(1)[0]

Using kNNResampler as a parent class, we created LocalFPCA. The key component is defining the method for kNNSampler.

super().__init__(data, K, normalize, clipping, Args_NearestNeighbors, method = self.method)

In the LocalFPCA class we define the self.method:

def method(self, data):
    generator = FPCADataGenerator()
    generator.fit(data, n_fpca_components=2) # reducing dimension to 2
    return generator.generate(1)[0]

Example

resampler = LocalFPCA(data = data)
resampler.fit()
resampler.comparePlots(['x','y','z'])

100%|██████████| 1000/1000 [00:02<00:00, 346.47it/s]

Problem with Discrete Variables

So far, we assumed that original data set contains only continuous variables. It is possible to address this problem by rounding these variables (stochastically or deterministically), or we can use some multivariate distributions that can handle the mixed type variables. Then, we need to define a new a subclass of kNNSampler.

One solution is it use mixedvines package. It allows to specify the discrete type variables.

from synloc import kNNResampler
from mixedvines.mixedvine import MixedVine # pip install mixedvines


class LocalMixedVine(kNNResampler):
    def __init__(self, data, cont_cols, K = 30, normalize = True, clipping = True, Args_NearestNeighbors = {}):
        super().__init__(data, K, normalize, clipping, Args_NearestNeighbors, method = self.method)
        self.cont_cols = cont_cols
    
    def method(self, data):
        generator = MixedVine.fit(data.values, self.cont_cols)
        return generator.rvs(1)[0]

LocalMixedVine takes the argument cont_cols which is a boolean list. True if it is a continuous variable, False if it is discrete. Further, specification can be done following the documentation.

Example

import warnings
import pandas as pd
warnings.filterwarnings("ignore", category=RuntimeWarning) 

data2 = pd.read_stata('../extract.dta')
data2 = data2[['age','educ', 'annwage']]
data2 = data2[~data2.isna().any(1)]
data2.age = data2.age.astype('int')
data2.educ = data2.educ.astype('int')
data2 = data2.sample(1000)

resampler = LocalMixedVine(data = data2, K = 50, cont_cols = [False, False, True])
resampler.fit()
resampler.comparePlots(['age','educ', 'annwage'])

100%|██████████| 1000/1000 [02:04<00:00,  8.05it/s]

# Original sample looks like
print(data2.head())

       age  educ  annwage
 28    12  22801.0
 34    16  16000.0
  31    11  12500.0
 36    13  19000.0
  28    16  16993.0

#synthetic sample looks like
print(resampler.synthetic.head())

   age  educ       annwage
 29    12  16236.643091
 33    16  18432.093681
 30    11  12190.666403
 36    13  20126.695065
 27    15  23331.840903