Resamplers

This page provides an auto-generated summary of synloc’s API.

class synloc.kNNResampler

Finds the nearest neighbor of each observation and creates synthetic values by the given method.

Parameters:

data (pandas.DataFrame) – Original data set to be synthesized
method (function) – Function to be used to create synthetic values from each cluster. Must accept a pandas.DataFrame (the neighbors) and return a 1D array-like object (e.g., pandas.Series, numpy.array) representing the synthetic point.
K (int, optional) – The number of the nearest neighbors used to create synthetic samples, defaults to 30
normalize (bool, optional) – Normalize sample before defining clusters, defaults to True
clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True
n_jobs (int, optional) – Number of CPU cores to use for parallel processing. -1 means using all processors. Defaults to -1.
Args_NearestNeighbors (dict, optional) – NearestNeighbors function arguments can be specified if needed. See scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html , defaults to {}

comparePlots(variable_list, fig_size=None)

Creating plots to compare the original sample and the synthetic sample.

Parameters:

variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.
fig_size (tuple, optional) – The figure size can be adjusted, defaults to None

fit(sample_size=None) → DataFrame

Creating synthetic sample using parallel processing.

Parameters:: sample_size (int, optional) – Required minimum size. The synthetic sample size will be the sample size if not specified., defaults to None
Returns:: Returns the synthetic sample
Return type:: pandas.DataFrame

round_integers(integer_columns: list, stochastic: bool = True) → None

Rounds variables to integers.

Parameters:

integer_columns (list) – The list of variables to be rounded.
stochastic (bool, optional) – Variables are rounded by a stochastic process, defaults to True

class synloc.clusterResampler

Creating synthetic sample by clustering.

This class creates subsamples from a given sample. The subsamples are created by clustering the original sample and then sampling from each cluster. The clustering is done by standard KMeans with a heuristic for size_min.

Parameters:

data (pandas.DataFrame) – Original data set to be synthesized
method (function) – Function to be used to create synthetic values from each cluster.
n_clusters (int, optional) – The number of clusters, defaults to 8
size_min (int, optional) – Required minimum cluster size, defaults to None
normalize (bool, optional) – Normalize sample before defining clusters, defaults to True
clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

comparePlots(variable_list, fig_size=None)

Creating plots to compare the original sample and the synthetic sample.

Parameters:

variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.
fig_size (tuple, optional) – The figure size can be adjusted, defaults to None

fit(sample_size=None) → DataFrame

Creating synthetic sample.

Parameters:: sample_size (int, optional) – Required minimum size. The synthetic sample size will be the cluster size if not specified., defaults to None
Returns:: Returns the synthetic sample
Return type:: pandas.DataFrame