Methods

This page provides an auto-generated summary of synloc’s API.

class synloc.LocalCov(data: DataFrame, K: int = 30, normalize: bool = True, clipping: bool = True, Args_NearestNeighbors: dict = {})

This is a method for clusterResampler class to create synthetic samples from the multivariate normal distribution with the estimated covariance matrix.

Parameters
  • data (pandas.DataFrame) – Original data set to be synthesized

  • K (int, optional) – The number of the nearest neighbors used to create synthetic samples, defaults to 30

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

  • Args_NearestNeighbors (dict, optional) – NearestNeighbors function arguments can be specified if needed. See scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html , defaults to {}

method(subsample: DataFrame)

Estimates covariance matrix and draw samples from the estimated multivariate normal distribution.

Parameters

subsample (pandas.DataFrame) – A subsample defined by the kNNResampler class.

Returns

Synthetic values.

Return type

numpy.darray

round_integers(integer_columns: list, stochastic: bool = True) None

Rounds variables to integers.

Parameters
  • integer_columns (list) – The list of variables to be rounded.

  • stochastic (bool, optional) – Variables are rounded by a stochastic process, defaults to True

class synloc.LocalFPCA(data: DataFrame, n_fpca_components: int = 2, K: int = 30, normalize: bool = True, clipping: bool = True, Args_NearestNeighbors: dict = {})

It is a method for kNNResampler class. The method is based on the FPCADataGenerator class from the synthia package.

Parameters
  • data (pandas.DataFrame) – Original data set to be synthesized

  • n_fpca_components (int, optional) – The number of dimensions after PCA, defaults to 2

  • K (int, optional) – The number of the nearest neighbors used to create synthetic samples, defaults to 30

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

  • Args_NearestNeighbors (dict, optional) – NearestNeighbors function arguments can be specified if needed. See scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html , defaults to {}

method(data)

Creates syntehtic values using FPCADataGenerator class from the synthia package.

Parameters

data (pandas.DataFrame) – A subsample defined by the kNNResampler class.

Returns

Synthetic values.

Return type

numpy.darray

round_integers(integer_columns: list, stochastic: bool = True)

Rounds variables to integers.

Parameters
  • integer_columns (list) – The list of variables to be rounded.

  • stochastic (bool, optional) – Variables are rounded by a stochastic process, defaults to True

class synloc.LocalGaussianCopula(data: DataFrame, K: int = 30, normalize: bool = True, clipping: bool = True, Args_NearestNeighbors: dict = {})

It is a method for kNNResampler class to create synthetic values using gaussian copula.

Parameters
  • data (pandas.DataFrame) – Original data set to be synthesized

  • K (int, optional) – The number of the nearest neighbors used to create synthetic samples, defaults to 30

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

  • Args_NearestNeighbors (dict, optional) – NearestNeighbors function arguments can be specified if needed. See scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html , defaults to {}

method(subsample: DataFrame)

Creating synthetic values using Gaussian copula.

Parameters

subsample (pandas.DataFrame) – A subsample defined by the kNNResampler class.

Returns

Synthetic values.

Return type

numpy.darray

round_integers(integer_columns: list, stochastic: bool = True)

Rounds variables to integers.

Parameters
  • integer_columns (list) – The list of variables to be rounded.

  • stochastic (bool, optional) – Variables are rounded by a stochastic process, defaults to True

class synloc.clusterCov(data: DataFrame, n_clusters=8, size_min: Optional[int] = None, size_max: Optional[int] = None, normalize: bool = True, clipping: bool = True)

clusterCov is a method for clusterResampler class to create synthetic values from the multivariate normal distribution with the covariance matrix estimated from the clusters.

Parameters
  • data (pandas.DataFrame) – Original data set to be synthesized

  • n_clusters (int, optional) – The number of clusters, defaults to 8

  • size_min (int, optional) – Required minimum cluster size, defaults to None

  • size_max (int, optional) – Required maximum cluster size, defaults to None

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

method(cluster: DataFrame, size: int)

Creating synthetic values from the estimated multivariate normal distribution.

Parameters
  • cluster (pandas.DataFrame) – Cluster data

  • size (int) – Required number of synthetic observations. Size is equal to the number of observations in the cluster if not specified.

Returns

Synthetic values

Return type

pandas.DataFrame

class synloc.clusterGaussCopula(data: DataFrame, n_clusters=8, size_min: Optional[int] = None, size_max: Optional[int] = None, normalize: bool = True, clipping: bool = True)

clusterGaussCopula is a method for clusterResampler class to create synthetic values from Gaussian copula.

Parameters
  • data (pandas.DataFrame) – Original data set to be synthesized

  • n_clusters (int, optional) – The number of clusters, defaults to 8

  • size_min (int, optional) – Required minimum cluster size, defaults to None

  • size_max (int, optional) – Required maximum cluster size, defaults to None

  • normalize (bool, optional) – Normalize sample before defining clusters, defaults to True

  • clipping (bool, optional) – trim values greater (smaller) than the maximum (minimum) for each variable, defaults to True

method(cluster: DataFrame, size: int)

Creating synthetic values from Gaussian copula.

Parameters
  • cluster (pandas.DataFrame) – Cluster data

  • size (int) – Required number of synthetic observations. Size is equal to the number of observations in the cluster if not specified.

Returns

Synthetic values

Return type

pandas.DataFrame