RSS
热门关键字:  数据挖掘  数据仓库  人工智能  搜索引擎  数据挖掘导论

Random Forests : A case study-microarray data[2]

来源: 作者:unkonwn 时间:2004-12-09 点击:

 

Scaling the data

The wish of every data analyst is to get an idea of what the data looks like. There is an excellent way to do this in random forests. 数据挖掘实验室

Using metric scaling the proximities can be projected down onto a low dimensional Euclidian space using "canonical coordinates". D canonical coordinates will project onto a D-dimensional space. To get 3 canonical coordinates, the options are as follows: 数据挖掘研究院

        parameter(
c               DESCRIBE DATA
     1          mdim=4682, nsample0=81, nclass=3, maxcat=1,
     1          ntest=0, labelts=0, labeltr=1,
c
c               SET RUN PARAMETERS
     2          mtry0=150, ndsize=1, jbt=1000, look=100, lookcls=1,
     2          jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c               SET IMPORTANCE OPTIONS
     3          imp=0, interact=0, impn=0, impfast=0,
c
c               SET PROXIMITY COMPUTATIONS
     4          nprox=1, nrnn=50,
c
c               SET OPTIONS BASED ON PROXIMITIES
     5          noutlier=0, nscale=3, nprot=0,
c
c               REPLACE MISSING VALUES  
     6          code=-999, missfill=0, mfixrep=0, 
c
c               GRAPHICS
     7          iviz=1,
c
c               SAVING A FOREST
     8          isaverf=0, isavepar=0, isavefill=0, isaveprox=0,
c
c               RUNNING A SAVED FOREST
     9          irunrf=0, ireadpar=0, ireadfill=0, ireadprox=0)

 

数据挖掘研究院

Note that imp and mdim2nd have been set back to zero and nscale set equal to 3. nrnn is set to 50 which instructs the program to compute the 50 largest proximities for each case. Set iscaleout=1. Compiling gives an output with nsample rows and these columns giving case id, true class, predicted class and 3 columns giving the values of the three scaling coordinates. Plotting the 2nd canonical coordinate vs. the first gives: 数据挖掘研究院

数据挖掘研究院

The three classes are very distinguishable. Note: if one tries to get this result by any of the present clustering algorithms, one is faced with the job of constructing a distance measure between pairs of points in 4682-dimensional space - a low payoff venture. The plot above, based on proximities, illustrates their intrinsic connection to the data.

Prototypes

Two prototypes are computed for each class in the microarray data

The settings are mdim2nd=15, nprot=2, imp=1, nprox=1, nrnn=20. The values of the variables are normalized to be between 0 and 1. Here is the graph 数据挖掘研究院

数据挖掘实验室

数据挖掘实验室

Outliers

An outlier is a case whose proximities to all other cases are small. Using this idea, a measure of outlyingness is computed for each case in the training sample. This measure is different for the different classes. Generally, if the measure is greater than 10, the case should be carefully inspected. Other users have found a lower threshold more useful. To compute the measure, set nout =1, and all other options to zero. Here is a plot of the measure: 数据挖掘研究院

There are two possible outliers-one is the first case in class 1, the second is the first case in class 2.

数据挖掘实验室

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?