Resamplers
This page provides an auto-generated summary of synloc’s API.
- class synloc.kNNResampler
Finds the nearest neighbor of each observation and creates synthetic values by the given method.
- Parameters
data (pandas.DataFrame) – Original data set to be synthesized
method (function) – Function to be used to create synthetic values from each cluster.
K (int, optional) – The number of the nearest neighbors used to create synthetic samples, defaults to 30
normalize (bool, optional) – Normalize sample before defining clusters, defaults to True
clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True
Args_NearestNeighbors (dict, optional) – NearestNeighbors function arguments can be specified if needed. See scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html , defaults to {}
- comparePlots(variable_list, fig_size=None)
Creating plots to compare the original sample and the synthetic sample.
- Parameters
variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.
fig_size (tuple, optional) – The figure size can be adjusted, defaults to None
- fit(sample_size=None) DataFrame
Creating synthetic sample.
- Parameters
sample_size (int, optional) – Required minimum size. The synthetic sample size will be the sample size if not specified., defaults to None
- Returns
Returns the synthetic sample
- Return type
pandas.DataFrame
- class synloc.clusterResampler
Creating synthetic sample by clusterig.
This class creates subsamples from a given sample. The subsamples are created by clustering the original sample and then sampling from each cluster. The clustering is done by the KMeansConstrained function from k-means-constrained package.
- Parameters
data (pandas.DataFrame) – Original data set to be synthesized
method (function) – Function to be used to create synthetic values from each cluster.
n_clusters (int, optional) – The number of clusters, defaults to 8
size_min (int, optional) – Required minimum cluster size, defaults to None
size_max (int, optional) – Required maximum cluster size, defaults to None
normalize (bool, optional) – Normalize sample before defining clusters, defaults to True
clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True
- comparePlots(variable_list, fig_size=None)
Creating plots to compare the original sample and the synthetic sample.
- Parameters
variable_list (list) – A list of variables in the data set. The maximum list size must be 3. The type of the plot depends o the list size: 1->histogram, 2->scatter plot, 3->3D scatter plot.
fig_size (tuple, optional) – The figure size can be adjusted, defaults to None
- fit(sample_size=None) DataFrame
Creating synthetic sample.
- Parameters
sample_size (int, optional) – Required minimum size. The synthetic sample size will be the cluster size if not specified., defaults to None
- Returns
Returns the synthetic sample
- Return type
pandas.DataFrame