Critical CART Hyperparameters in Synthpop

I have been working on a project to create synthetic data for a long while. I have realized that the synthpop package was producing identical values, whereas it was supposed to be producing values from a predictive posterior distribution. I have been using the CART algorithm for its flexibility, but the model must be overfitting, even though I do not have too many variables. So, how can one prevent creating identical values? The answer was given in the article: “synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control”.

Here are some recommendations:

To create synthetic values from the predicted posterior distribution, set the parameter in synthpop as follows: proper = T. (The authors call it the proper way to create synthetic data)
- In addition to setting proper = T, you can also use control=list(stop.iterations=20) to set the maximum number of iterations to 20. This can help to prevent overfitting by stopping the CART algorithm before it becomes too complex.
To prevent overfitting, set cart.minbucket=5 to make sure that there are at least 5 observations in each node.
- You can also use control=list(stop.complexity=0.1) to set the maximum complexity of the tree to 0.1. This can also help to prevent overfitting by stopping the CART algorithm before it becomes too complex.
To create smoothed continuous values, set smoothing='density' or smoothing='spline'. The package creators recommend using smoothing='spline'.
- When using the smoothing parameter, it’s important to note that you can also use smoothing=list(var1='density', var2='spline') to specify different smoothing methods for different variables.
It’s also important to keep in mind that it’s a good practice to evaluate the quality of the synthetic data using various metrics such as the mean squared error.