Density estimation: multivariate normal distribution
[79]:
from arfpy import arf
import pandas as pd
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from numpy import random
random.seed(seed=2023)
Let’s generate some multivariate normal data for which we then try to recapture the joint density with FORDE. We have 2 variables, var1
and var2
that each have a variance of 1 and they exhibit a correlation coefficient of 0.8. The mean values for var1
and var2
are set to 1 and 5, respectively. As an example, we generate 2000 observations.
[80]:
mean = (1, 5)
cov = [[1, 0.8], [0.8, 1]]
df = pd.DataFrame(random.multivariate_normal(mean, cov, (2000, )))
df.columns = ['var1', 'var2']
# resulting data frame has 1000 observations for the 2 variables
df.shape
[80]:
(2000, 2)
Let’s fit the adversarial random forest and estimate the density:
[81]:
my_arf = arf.arf(x = df)
FORDE = my_arf.forde()
Initial accuracy is 0.6515
Iteration number 1 reached accuracy of 0.3485.
We can now investigate details on the estimated density. We have two continuous variables in the data set, so we can have a look at the estimated parameters by browsing through the 'cnt'
data frame.
What this data frame tells us is the following:
The first column tree
indicates the tree in the forest. Note that we get parameters for 30 trees because we have used the default value for num_trees = 30
in the ARF. The second column nodeid
gives the unique identifier to each leaf (terminal node) in the respective forest. We are interested in those parameters because we know that if the forest has converged, it is reasonable to assume that the variables in those terminal nodes are mutually independent. So to get to the joint
density, we can use these univariate densities and apply some basic probability theory instead of having to model the multivariate density directly.
Recap that we have generated two variables that are named var1
andvar2
. We have column variable
indicating for which of the variables the parameters are estimated. Finally, the estimated parameters for the mean (mean
) and standard deviation (sd
) are given in respective columns of the table.
[82]:
FORDE['cnt'].iloc[:,:5]
[82]:
tree | nodeid | variable | mean | sd | |
---|---|---|---|---|---|
0 | 0 | 5 | var1 | -1.094664 | 0.514677 |
1 | 0 | 5 | var2 | 2.453952 | 0.325082 |
2 | 0 | 7 | var1 | -1.044197 | 0.563873 |
3 | 0 | 7 | var2 | 2.881324 | 0.039782 |
4 | 0 | 11 | var1 | -0.776401 | 0.527718 |
... | ... | ... | ... | ... | ... |
403 | 29 | 716 | var2 | 6.977322 | 0.331542 |
404 | 29 | 717 | var1 | 3.761665 | 0.101248 |
405 | 29 | 717 | var2 | 6.639773 | 0.332960 |
406 | 29 | 718 | var1 | 3.186145 | 0.629556 |
407 | 29 | 718 | var2 | 7.686327 | 0.121825 |
12360 rows × 5 columns
Let’s generate some data to visualize results. For each new observation, we sample a leaf from the forest, i.e., a nodeid
from a tree
in FORDE['cnt']
. Then, the algorithm plugs in the values for the mean (mean
) and standard deviation (sd
) in a truncated normal distribution and samples the new value.
[83]:
df_syn = my_arf.forge(n = 2000)
Let’s plot the resulting data!
[84]:
plt.subplot(2, 2, 1)
df_test = df
plt.plot(df.to_numpy()[:, 0], df.to_numpy()[:, 1], '.', alpha=0.5)
plt.grid()
plt.title('Original Data', fontsize = 30)
plt.subplot(2, 2, 2)
plt.plot(df_syn.to_numpy()[:, 0], df_syn.to_numpy()[:, 1], '.', alpha=0.5)
plt.grid()
plt.title('Synthesized Data', fontsize = 30)
plt.rcParams['figure.figsize'] = [30, 25]
plt.show()