Scaling the data
The wish of every data analyst is to get an idea of what the data looks like. There is an excellent way to do this in random forests. 数据挖掘实验室
Using metric scaling the proximities can be projected down onto a low dimensional Euclidian space using "canonical coordinates". D canonical coordinates will project onto a D-dimensional space. To get 3 canonical coordinates, the options are as follows: 数据挖掘研究院
parameter(
c DESCRIBE DATA
1 mdim=4682, nsample0=81, nclass=3, maxcat=1,
1 ntest=0, labelts=0, labeltr=1,
c
c SET RUN PARAMETERS
2 mtry0=150, ndsize=1, jbt=1000, look=100, lookcls=1,
2 jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c SET IMPORTANCE OPTIONS
3 imp=0, interact=0, impn=0, impfast=0,
c
c SET PROXIMITY COMPUTATIONS
4 nprox=1, nrnn=50,
c
c SET OPTIONS BASED ON PROXIMITIES
5 noutlier=0, nscale=3, nprot=0,
c
c REPLACE MISSING VALUES
6 code=-999, missfill=0, mfixrep=0,
c
c GRAPHICS
7 iviz=1,
c
c SAVING A FOREST
8 isaverf=0, isavepar=0, isavefill=0, isaveprox=0,
c
c RUNNING A SAVED FOREST
9 irunrf=0, ireadpar=0, ireadfill=0, ireadprox=0)
Note that imp and mdim2nd have been set back to zero and nscale set equal to 3. nrnn is set to 50 which instructs the program to compute the 50 largest proximities for each case. Set iscaleout=1. Compiling gives an output with nsample rows and these columns giving case id, true class, predicted class and 3 columns giving the values of the three scaling coordinates. Plotting the 2nd canonical coordinate vs. the first gives: 数据挖掘研究院
The three classes are very distinguishable. Note: if one tries to get this result by any of the present clustering algorithms, one is faced with the job of constructing a distance measure between pairs of points in 4682-dimensional space - a low payoff venture. The plot above, based on proximities, illustrates their intrinsic connection to the data.
Prototypes
Two prototypes are computed for each class in the microarray data
The settings are mdim2nd=15, nprot=2, imp=1, nprox=1, nrnn=20. The values of the variables are normalized to be between 0 and 1. Here is the graph 数据挖掘研究院
Outliers
An outlier is a case whose proximities to all other cases are small. Using this idea, a measure of outlyingness is computed for each case in the training sample. This measure is different for the different classes. Generally, if the measure is greater than 10, the case should be carefully inspected. Other users have found a lower threshold more useful. To compute the measure, set nout =1, and all other options to zero. Here is a plot of the measure: 数据挖掘研究院
There are two possible outliers-one is the first case in class 1, the second is the first case in class 2.

