Resamplers

The base resampler classes are public for users who want to provide their own local sampling method.

class synloc.kNNResampler(data, method, K=30, normalize=True, clipping=True, n_jobs=-1, Args_NearestNeighbors={}, random_state=None)[source]

Finds the nearest neighbor of each observation and creates synthetic values by the given method.

Parameters:
  • data (pandas.DataFrame) – Original data set to be synthesized

  • method (function) – Function to be used to create synthetic values from each cluster. Must accept a pandas.DataFrame (the neighbors) and return a 1D array-like object (e.g., pandas.Series, numpy.array) representing the synthetic point.

  • K (int, optional) – The number of the nearest neighbors used to create synthetic samples, defaults to 30

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

  • n_jobs (int, optional) – Number of CPU cores to use for parallel processing. -1 means using all processors. Defaults to -1.

  • Args_NearestNeighbors (dict, optional) – NearestNeighbors function arguments can be specified if needed. See scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html , defaults to {}

comparePlots(variable_list, fig_size=None)[source]

Creating plots to compare the original sample and the synthetic sample.

Parameters:
  • variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.

  • fig_size (tuple, optional) – The figure size can be adjusted, defaults to None

compareStats()[source]

Return variable-level quality metrics for the synthetic sample.

fit(sample_size=None)[source]

Creating synthetic sample using parallel processing.

Parameters:

sample_size (int, optional) – Required minimum size. The synthetic sample size will be the sample size if not specified., defaults to None

Returns:

Returns the synthetic sample

Return type:

pandas.DataFrame

qualityReport()[source]

Return per-variable and overall quality metrics for the synthetic sample.

round_integers(integer_columns, stochastic=True)[source]

Rounds variables to integers.

Parameters:
  • integer_columns (list) – The list of variables to be rounded.

  • stochastic (bool, optional) – Variables are rounded by a stochastic process, defaults to True

Return type:

None

class synloc.clusterResampler(data, method, n_clusters=8, size_min=None, normalize=True, clipping=True)[source]

Creating synthetic sample by clustering.

This class creates subsamples from a given sample. The subsamples are created by clustering the original sample and then sampling from each cluster. The clustering is done by standard KMeans with a heuristic for size_min.

Parameters:
  • data (pandas.DataFrame) – Original data set to be synthesized

  • method (function) – Function to be used to create synthetic values from each cluster.

  • n_clusters (int, optional) – The number of clusters, defaults to 8

  • size_min (int, optional) – Required minimum cluster size, defaults to None

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

comparePlots(variable_list, fig_size=None)[source]

Creating plots to compare the original sample and the synthetic sample.

Parameters:
  • variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.

  • fig_size (tuple, optional) – The figure size can be adjusted, defaults to None

compareStats()[source]

Return variable-level quality metrics for the synthetic sample.

fit(sample_size=None)[source]

Creating synthetic sample.

Parameters:

sample_size (int, optional) – Required minimum size. The synthetic sample size will be the cluster size if not specified., defaults to None

Returns:

Returns the synthetic sample

Return type:

pandas.DataFrame

qualityReport()[source]

Return per-variable and overall quality metrics for the synthetic sample.