Resamplers

This page provides an auto-generated summary of synloc’s API.

class synloc.kNNResampler

Finds the nearest neighbor of each observation and creates synthetic values by the given method.

Parameters:
  • data (pandas.DataFrame) – Original data set to be synthesized

  • method (function) – Function to be used to create synthetic values from each cluster. Must accept a pandas.DataFrame (the neighbors) and return a 1D array-like object (e.g., pandas.Series, numpy.array) representing the synthetic point.

  • K (int, optional) – The number of the nearest neighbors used to create synthetic samples, defaults to 30

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

  • n_jobs (int, optional) – Number of CPU cores to use for parallel processing. -1 means using all processors. Defaults to -1.

  • Args_NearestNeighbors (dict, optional) – NearestNeighbors function arguments can be specified if needed. See scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html , defaults to {}

comparePlots(variable_list, fig_size=None)

Creating plots to compare the original sample and the synthetic sample.

Parameters:
  • variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.

  • fig_size (tuple, optional) – The figure size can be adjusted, defaults to None

fit(sample_size=None) DataFrame

Creating synthetic sample using parallel processing.

Parameters:

sample_size (int, optional) – Required minimum size. The synthetic sample size will be the sample size if not specified., defaults to None

Returns:

Returns the synthetic sample

Return type:

pandas.DataFrame

round_integers(integer_columns: list, stochastic: bool = True) None

Rounds variables to integers.

Parameters:
  • integer_columns (list) – The list of variables to be rounded.

  • stochastic (bool, optional) – Variables are rounded by a stochastic process, defaults to True

class synloc.clusterResampler

Creating synthetic sample by clustering.

This class creates subsamples from a given sample. The subsamples are created by clustering the original sample and then sampling from each cluster. The clustering is done by standard KMeans with a heuristic for size_min.

Parameters:
  • data (pandas.DataFrame) – Original data set to be synthesized

  • method (function) – Function to be used to create synthetic values from each cluster.

  • n_clusters (int, optional) – The number of clusters, defaults to 8

  • size_min (int, optional) – Required minimum cluster size, defaults to None

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

comparePlots(variable_list, fig_size=None)

Creating plots to compare the original sample and the synthetic sample.

Parameters:
  • variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.

  • fig_size (tuple, optional) – The figure size can be adjusted, defaults to None

fit(sample_size=None) DataFrame

Creating synthetic sample.

Parameters:

sample_size (int, optional) – Required minimum size. The synthetic sample size will be the cluster size if not specified., defaults to None

Returns:

Returns the synthetic sample

Return type:

pandas.DataFrame