Resamplers

This page provides an auto-generated summary of synloc’s API.

class synloc.kNNResampler

Finds the nearest neighbor of each observation and creates synthetic values by the given method.

Parameters
  • data (pandas.DataFrame) – Original data set to be synthesized

  • method (function) – Function to be used to create synthetic values from each cluster.

  • K (int, optional) – The number of the nearest neighbors used to create synthetic samples, defaults to 30

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

  • Args_NearestNeighbors (dict, optional) – NearestNeighbors function arguments can be specified if needed. See scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html , defaults to {}

comparePlots(variable_list, fig_size=None)

Creating plots to compare the original sample and the synthetic sample.

Parameters
  • variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.

  • fig_size (tuple, optional) – The figure size can be adjusted, defaults to None

fit(sample_size=None) DataFrame

Creating synthetic sample.

Parameters

sample_size (int, optional) – Required minimum size. The synthetic sample size will be the sample size if not specified., defaults to None

Returns

Returns the synthetic sample

Return type

pandas.DataFrame

class synloc.clusterResampler

Creating synthetic sample by clusterig.

This class creates subsamples from a given sample. The subsamples are created by clustering the original sample and then sampling from each cluster. The clustering is done by the KMeansConstrained function from k-means-constrained package.

Parameters
  • data (pandas.DataFrame) – Original data set to be synthesized

  • method (function) – Function to be used to create synthetic values from each cluster.

  • n_clusters (int, optional) – The number of clusters, defaults to 8

  • size_min (int, optional) – Required minimum cluster size, defaults to None

  • size_max (int, optional) – Required maximum cluster size, defaults to None

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

comparePlots(variable_list, fig_size=None)

Creating plots to compare the original sample and the synthetic sample.

Parameters
  • variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.

  • fig_size (tuple, optional) – The figure size can be adjusted, defaults to None

fit(sample_size=None) DataFrame

Creating synthetic sample.

Parameters

sample_size (int, optional) – Required minimum size. The synthetic sample size will be the cluster size if not specified., defaults to None

Returns

Returns the synthetic sample

Return type

pandas.DataFrame