RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论
当前位置 :| 首页>数据挖掘知识>预测>

Microsoft Logistic Regression Algorithm

来源: 作者:互联网作品 时间:2007-04-11 点击:

The Microsoft Logistic Regression algorithm is a variation of the Microsoft Neural Network algorithm, where the HIDDEN_NODE_RATIO parameter is set to 0. This setting will create a neural network model that does not contain a hidden layer, and that therefore is equivalent to logistic regression.

Suppose the predictable column contains only two states, yet you still want to perform a regression analysis, relating input columns to the probability that the predictable column will contain a specific state. The following diagram illustrates the results you will obtain if you assign 1 and 0 to the states of the predictable column, calculate the probability that the column will contain a specific state, and perform a linear regression against an input variable. 数据挖掘研究院

Poorly modeled data using linear regression

The x-axis contains values of an input column. The y-axis contains the probabilities that the predictable column will be one state or the other. The problem with this is that the linear regression does not constrain the column to be between 0 and 1, even though those are the maximum and minimum values of the column. A way to solve this problem is to perform logistic regression. Instead of creating a straight line, logistic regression analysis creates an "S" shaped curve that contains maximum and minimum constraints. For example, the following diagram illustrates the results you will achieve if you perform a logistic regression against the same data as used for the previous example. 数据挖掘实验室

Data modeled by using logistic regression

Notice how the curve never goes above 1 or below 0. You can use logistic regression to describe which input columns are important in determining the state of the predictable column.

数据挖掘实验室

Using the AlgorithmUsing the Algorithm

Use the Microsoft Neural Network Viewer to explore a linear regression mining model. 数据挖掘实验室

A logistic regression model must contain a key column, one or more input columns, and one or more predictable columns. 数据挖掘研究院

The Microsoft Logistic Regression algorithm supports specific input column content types, predictable column content types, and modeling flags, which are listed in the following table.

数据挖掘研究院

Input column content types 数据挖掘实验室

Continuous, Cyclical, Discrete, Discretized, Key, Table, and Ordered

数据挖掘研究院

Predictable column content types 数据挖掘实验室

Continuous, Cyclical, Discrete, Discretized, and Ordered 数据挖掘研究院

Modeling flags

数据挖掘实验室

MODEL_EXISTENCE_ONLY and NOT NULL

数据挖掘研究院

All Microsoft algorithms support a common set of functions. However, the Microsoft Logistic Regression algorithm supports additional functions, listed in the following table.

数据挖掘研究院

IsDescendant

数据挖掘研究院

PredictStdev

数据挖掘研究院

PredictAdjustedProbability 数据挖掘研究院

PredictSupport 数据挖掘实验室

PredictHistogram

数据挖掘实验室

PredictVariance 数据挖掘实验室

PredictProbability 数据挖掘实验室

   

For a list of the functions that are common to all Microsoft algorithms, see Data Mining Algorithms. For more information about how to use these functions, see Data Mining Extensions (DMX) Function Reference.

Models that use the Microsoft Logistic Regression algorithm do not support drillthrough or data mining dimensions, because the structure of nodes in the mining model does not necessarily correspond directly to the underlying data. 数据挖掘实验室

The Microsoft Logistic Regression algorithm supports several parameters that affect the performance and accuracy of the resulting mining model. The following table describes each parameter. 数据挖掘研究院

Parameter Description

HOLDOUT_PERCENTAGE 数据挖掘研究院

Specifies the percentage of cases within the training data used to calculate the holdout error. HOLDOUT_PERCENTAGE is used as part of the stopping criteria while training the mining model.

The default is 30. 数据挖掘研究院

HOLDOUT_SEED 数据挖掘研究院

Specifies a number to use to seed the pseudo-random generator when randomly determining the holdout data. If HOLDOUT_SEED is set to 0, the algorithm generates the seed based on the name of the mining model, to guarantee that the model content remains the same during reprocessing. 数据挖掘研究院

The default is 0.

MAXIMUM_INPUT_ATTRIBUTES 数据挖掘研究院

Defines the number of input attributes that the algorithm can handle before it invokes feature selection. Set this value to 0 to turn off feature selection. 数据挖掘研究院

The default is 255. 数据挖掘研究院

MAXIMUM_OUTPUT_ATTRIBUTES 数据挖掘研究院

Defines the number of output attributes that the algorithm can handle before it invokes feature selection. Set this value to 0 to turn off feature selection. 数据挖掘研究院

The default is 255.

数据挖掘研究院

MAXIMUM_STATES 数据挖掘实验室

Specifies the maximum number of attribute states that the algorithm supports. If the number of states that an attribute has is larger than the maximum number of states, the algorithm uses the most popular states of the attribute and ignores the remaining states. 数据挖掘研究院

The default is 100.

数据挖掘研究院

SAMPLE_SIZE 数据挖掘研究院

Specifies the number of cases to be used to train the model. The algorithm provider uses either this number or the percentage of total of cases that are not included in the holdout percentage as specified by the HOLDOUT_PERCENTAGE parameter, whichever value is smaller.

数据挖掘实验室

In other words, if HOLDOUT_PERCENTAGE is set to 30, the algorithm will use either the value of this parameter, or a value that is equal to 70 percent of the total number of cases, whichever is smaller. 数据挖掘研究院

The default is 10000. 数据挖掘研究院

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?