Cluster Resampler

This notebook shows how clusterResampler methods are used to create synthetic samples. clusterResampler relies on a Python package k-means-constrained to cluster the data. There are two methods demonstrated in this notebook. The first one draws synthetic values from a multivariate normal distribution. The second one draws synthetic values from a gaussian copula.

[2]:
from synloc import sample_circulars_xy, clusterCov, clusterGaussCopula

Data

[4]:
df = sample_circulars_xy(1000)
df.head()
[4]:
x y
0 -7.439214 -6.410053
1 -16.626527 -10.295054
2 6.669369 19.920039
3 16.274841 5.968006
4 7.181718 -2.006049

Using Multivariate Normal Distribution

We use clusterCov method to create synthetic data. There are three crucial parameters to define the cluster properties. The first one is the number of clusters, n_cluster. The second and the third ones are the required minimum and maximum cluster sizes respectively. The second and the third ones are optional, however, it is advised to consider the the required minimum cluster size while choosing the resampling method.

[5]:
syn_cov = clusterCov(df, n_clusters=20, size_min=10)
syn_cov.fit()
[5]:
x y
0 -10.447402 7.088786
1 -4.048904 15.440417
2 -8.741493 9.510548
3 -7.061347 14.254181
4 -5.263386 16.549055
... ... ...
31 -6.509278 -25.342745
32 -4.047308 -21.856602
33 -3.537834 -23.911015
34 -4.728510 -21.240394
35 -3.581509 -24.209864

1000 rows × 2 columns

[6]:
syn_cov.comparePlots(['x', 'y'])
../_images/Notebooks_clustering_6_0.png

Using Gaussian Copula

[7]:
syn_cop = clusterGaussCopula(df, n_clusters=20, size_min=10)
syn_cop.fit()
[7]:
x y
0 0.705895 21.506721
1 7.240778 20.261066
2 5.910154 21.063820
3 6.170739 20.511732
4 0.763095 21.702028
... ... ...
33 2.251692 -17.754208
34 3.807215 -18.830935
35 2.790674 -21.364349
36 0.264260 -19.779841
37 -1.980704 -20.683878

1000 rows × 2 columns

[8]:
syn_cop.comparePlots(['x', 'y'])
../_images/Notebooks_clustering_9_0.png
[ ]: