Hello World question.(about entropy, feature selection)

Hi, I am new to here and this is my first post. I have a classification
problem and I would like to use features to do it.

Suppose I have two datasets, positive and negative. I pre-select a
batch of features. Now I want to weight the features according to the
datasets. There are some methods to do this weighting. Two of them are
chi-square test and entropy. I use entropy to do it. So the smallest
entropy (=0) value says a feature is exclusively embedded in only one
dataset while the largest entropy (=1 for binary classification) means
the feature is equally embedded in two datasets. An example can be: f1
(+5, -0) has entropy as 0 because 5 positive instances contain it while
no negative instance does. f2 (+5, -5) has entropy as 1 on the other
hand.
数据挖掘实验室

While entropy can be useful I found that it is not a good weighting
function. Suppose I have two features f1(+5, -0) and f2 (+100, -0). The
entropy values for f1 and f2 are the same (=0). But suppose there are
totally 100 positive instances and 100 negative instances. Obviously f2
is better than f1 because it is not only unambiguous but very common in
positive set. Base on this observation I would like to give f2 a higher
weight than f1.

数据挖掘工具

This requires the entropy and the support to work together. Does anyone
have any idea about this combination of a weighting or scoring
function?
数据挖掘实验室

Thank you very much.
数据挖掘实验室

Sticker

数据挖掘论坛

How is it that feature 'f1' could have only 5 positive and 0 negative
observations?  If there are 200 instances total, what value do they
have for 'f1'?  Is 'f1' missing?
数据挖掘工具

数据挖掘实验室

Yes. I mean if there are totally 200 instances in + and 200 in -
(totally 400 instances). f1 is only observed in 5 + instances and 0 -
instance. For the 195 other + instances f1 is missing and 200 -
instances are missing f1 as well. f1 can be an arbitrary feature. For
instance, f1 is an symptom of a certain disease.
Hope that make things clearer. 数据挖掘实验室

Thank you 数据挖掘论坛

数据挖掘研究院

Sticker wrote:
> Yes. I mean if there are totally 200 instances in + and 200 in -
> (totally 400 instances). f1 is only observed in 5 + instances and 0 -
> instance. For the 195 other + instances f1 is missing and 200 -
> instances are missing f1 as well. f1 can be an arbitrary feature. For
> instance, f1 is an symptom of a certain disease.

The next step is to clarify the nature of the missing values.  If
missing values are correlated with the target variable, then perhaps it
is better to consider them as a third symbol and re-calculate entropy? 数据挖掘交友

-Will Dwinnell

数据挖掘论坛

  数据挖掘交友

How to measure the correlationship between two instances supposing they
are sequences rather than relational data records? Sequences are
proteins let say, relational data records are database table rows with
columns and values.


数据挖掘交友

i think this is similar to the similarity measure in web sessions. someone
has used the cosine function as the measurement, and i'm still trying to
figure out a way to measure the similarity of web sessions, how?
数据挖掘工具

数据挖掘研究院

To Jackie,
I am not so sure about the definition of web sessions. Do you mean web
logs? I know how they measure two web pages' content using cosine
similarity function. They change the sequences of words into bags of
words (removing the duplicated words and may disorder them). This is
similar to measuring itemsets. Alternatively you can use Euclidean
distance or dot product as well.

数据挖掘交友

The sequence similarity is solved by Smith-Waterman algorithm (locally)
and Needleman-Wunsch algorithm (globally).
数据挖掘工具

My question is about entropies and supports of the features. I do not
know how to combine them together to be a better scoring functions

数据挖掘研究院

数据挖掘实验室

 

 

数据挖掘工具

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:Cluster using predefined seeds value
下一篇:ClearForest Launches Semantic Web Service - $2,000 Mashup Contest
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • EI 核心数据库收录的生物信息/生物医学工程
  • Oracle TimesTen In-Memory Database
  • 关于贝叶斯网络工具, 如BNT工具包等
  • Teradata Named Best Global Data Warehous
  • ECSQARU'07 -- The 9th European Conferenc
  • 复旦大学智能信息处理实验室 博士后招聘
  • 有关CHI-SQUARE的问题
  • Special session on feature selection and
  • 数据挖掘研究院 2007.8(1)邮件列表
  • 小专题:企业内部怎样进行网络流量监控及数
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 数据挖掘研究院 2007.8(1)邮件列表
  • 复旦大学智能信息处理实验室 博士后招聘
  • Searching through biological images
  • MICAI-2007, Artificial Intelligence, Spr
  • IEEE Internet Computing: Call for specia
  • ECSQARU'07 -- The 9th European Conferenc
  • Karen Sparck Jones passes away
  • Teradata Named Best Global Data Warehous
  • EI 核心数据库收录的生物信息/生物医学工程
  • 有关CHI-SQUARE的问题
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静