I have been working on a project to create synthetic data for a long while. I have realized that the synthpop package was producing identical values, whereas it was supposed to be producing values from a predictive posterior distribution. I have been using the CART algorithm for its flexibility, but the model must be overfitting, even though I do not have too many variables. So, how can one prevent creating identical values? The answer was given in the article: “synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control”.
[Read More]
An Implementation of Double Machine Learning with XGboost in R
A Benchmark Estimate
This is an attempt to estimate Double Machine Learning with XGboost algorithm in R. The purpose is to create a benchmark estimation with DML. The user can choose various machine learning algorithms, where optimizing hyperparameters can be time-consuming. XGboost is a very useful in this regard. This script can be used to produce substantially accurate preliminary results. Repository is here.
[Read More]
Solution to Memory Leak in R with callr package
An example calling other packages in callr
I have been working with huge samples recently. When you work with large samples, memory leak is a common problem. I have been extensively using garbage collector, but it is not helping much. So, you need to write your codes efficiently.
[Read More]
A Fast Method to Create Synthetic Data with Python
Python package available in PyPI: synloc
I have been working on a project to create synthetic data. I mostly used the R package synthpop in the project. I have been thinking about a very simple algorithm to create synthetic data using the nearest neighbor algorithm since then. I have created a Python package named synloc. I discuss the practical and theoretical here: Generating Synthetic Data with The Nearest Neighbors Algorithm
[Read More]
Creating Sparse Adjacency Matrix from Group Membership with igraph - R Programming
Block-diagonal matrix with data.table package
It took me days to come up with an efficient solution to create an adjacency matrix from group membership. Consider the following data:
[Read More]