Resamplers
This page provides an auto-generated summary of synloc’s API.
- class synloc.kNNResampler
Finds the nearest neighbor of each observation and creates synthetic values by the given method.
- Parameters:
data (pandas.DataFrame) – Original data set to be synthesized
method (function) – Function to be used to create synthetic values from each cluster. Must accept a pandas.DataFrame (the neighbors) and return a 1D array-like object (e.g., pandas.Series, numpy.array) representing the synthetic point.
K (int, optional) – The number of the nearest neighbors used to create synthetic samples, defaults to 30
normalize (bool, optional) – Normalize sample before defining clusters, defaults to True
clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True
n_jobs (int, optional) – Number of CPU cores to use for parallel processing. -1 means using all processors. Defaults to -1.
Args_NearestNeighbors (dict, optional) – NearestNeighbors function arguments can be specified if needed. See scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html , defaults to {}
- comparePlots(variable_list, fig_size=None)
Creating plots to compare the original sample and the synthetic sample.
- Parameters:
variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.
fig_size (tuple, optional) – The figure size can be adjusted, defaults to None
- fit(sample_size=None) DataFrame
Creating synthetic sample using parallel processing.
- Parameters:
sample_size (int, optional) – Required minimum size. The synthetic sample size will be the sample size if not specified., defaults to None
- Returns:
Returns the synthetic sample
- Return type:
pandas.DataFrame
- round_integers(integer_columns: list, stochastic: bool = True) None
Rounds variables to integers.
- Parameters:
integer_columns (list) – The list of variables to be rounded.
stochastic (bool, optional) – Variables are rounded by a stochastic process, defaults to True
- class synloc.clusterResampler
Creating synthetic sample by clustering.
This class creates subsamples from a given sample. The subsamples are created by clustering the original sample and then sampling from each cluster. The clustering is done by standard KMeans with a heuristic for size_min.
- Parameters:
data (pandas.DataFrame) – Original data set to be synthesized
method (function) – Function to be used to create synthetic values from each cluster.
n_clusters (int, optional) – The number of clusters, defaults to 8
size_min (int, optional) – Required minimum cluster size, defaults to None
normalize (bool, optional) – Normalize sample before defining clusters, defaults to True
clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True
- comparePlots(variable_list, fig_size=None)
Creating plots to compare the original sample and the synthetic sample.
- Parameters:
variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.
fig_size (tuple, optional) – The figure size can be adjusted, defaults to None
- fit(sample_size=None) DataFrame
Creating synthetic sample.
- Parameters:
sample_size (int, optional) – Required minimum size. The synthetic sample size will be the cluster size if not specified., defaults to None
- Returns:
Returns the synthetic sample
- Return type:
pandas.DataFrame