classification 问题求教!! - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - classification 问题求教!!

相关主题
● 请问几个回归的sas code	● 高维问题
● machine learning救助模型在1数据集上表现好其他烂	● logistic regression crossvalidation in SAS
● proc logistic: how to build 2 X 2 classification table	● 急问有关SVM，randomforest的问题（gene expression data）
● ##如果logistic回归自变量x不是线性的，怎么办？##	● 请问怎么建立变量全是ordinal data的model？急，谢谢。
● 报面筋求实习合租 (转载)	● Multinomial Regression
● 问一个 classification 的问题	● 这种情况应该用什么模型？
● good classification methods for high dimension data	● 问卷调查的数据如何处理？
● 统计专业找银行工作，需要有哪些金融的知识	● how do you deal with sparse data?

相关话题的讨论汇总
话题: logistic话题: svm话题: rf话题: 方法

进入Statistics版参与讨论

(共1页)

s****i
发帖数: 197

小弟目前是一二流大学在读stat phd 刚刚成为phd candidate不久, 老板是做
classification
方向的, 最近正在拉合作搞到一套数据让我用classification (主要是random
forest, SVM
boosting这些啦)去套希望看到的是用classification方法做出来的预测比用
multinomial
logistic regression做出来的准确但是小弟用R的package改parameter无论如何也做
不出这效
果啊即使用binomial的都是logistic比class的强 multi的logistic就更准了每次见
老板都被
一顿臭骂唉~ 请问各位大大该怎样改用何种方法或者model才能提高classification
方法的准确
率啊再次先谢谢众位大大了!!
==============更新==================
先谢谢楼下几位大大的回答这个问题是说的有点模糊这套数据的response是一个
ordinal variable (0 1 2 3 4 5 五级越大越显著 0表示没有), predictor中包含
numeric和categorical的(categorical的都被我
as.factor了) 我R package用的是randomforest e1071 mlogit这几个建立的RF是no
prune的 with replacement, SVM radial linear sig poly都试过了结果是一样的,
randomforest这个package给的model貌似可以调parameter没几个如果按照nominal的
response RF SVM怎样也干不过logistic的现在正头痛即使merge0-5到0-1也是, 正确
率的计算
就是靠predict出来的和预留test的response比较取错误率低的model,
crossvalidation已经
做了使用McNemar test检验各个model的prediction是否趋同 0-1的0-1-2的RF SVM和
logistic没有显著区别 0-1-2-3-4-5的logistic和RF SVM的prediction有显著不同但
是正确
率都差不多(m-logistic好点) 小弟准备回去试试2楼这位大大的方法一时也想不出其他方法搞啊

N****n
发帖数: 1208

我也菜鸟一个。
你试试用R的RPART PACKAGE，FIT一个分类树模型，然后计算错误率。如果太高了，做
一个10-FOLD CROSSVALIDATION。或者把树给PRUNE了。
还可以用RESAMPLE的方法把POPULATION换了，再FIT。
说得不对请指正。

【在 s****i 的大作中提到】

: 小弟目前是一二流大学在读stat phd 刚刚成为phd candidate不久, 老板是做
: classification
: 方向的, 最近正在拉合作搞到一套数据让我用classification (主要是random
: forest, SVM
: boosting这些啦)去套希望看到的是用classification方法做出来的预测比用
: multinomial
: logistic regression做出来的准确但是小弟用R的package改parameter无论如何也做
: 不出这效
: 果啊即使用binomial的都是logistic比class的强 multi的logistic就更准了每次见
: 老板都被

l***o
发帖数: 5337

这问题问得太大了，天下没一个人能答得好。。。
何况这和你的数据有关，除非你不怕overfitting，否则那个方法预测最好是完全依赖
于data的。
logistic regression是最好的model完全可能啊。。。

【在 s****i 的大作中提到】

g********r
发帖数: 8017

这个问题不懂.但我可以给出一个陈派的回答:
为什么搞classfication? 是因为你对自然的理解太浅薄.世间万物千变万化,不能够用
简单的二分法去理解.陈大师刚刚推翻了统计经典理论.如果你想深刻地理解你的数据,
那么你要忘掉你的所学,做一个小学生,從陈大师的理论学起.

【在 s****i 的大作中提到】

g********r
发帖数: 8017

先说说准确率怎么比较的.

【在 s****i 的大作中提到】

D******n
发帖数: 2836

好啊，最好创个立功体出来。

【在 g********r 的大作中提到】

: 这个问题不懂.但我可以给出一个陈派的回答:
: 为什么搞classfication? 是因为你对自然的理解太浅薄.世间万物千变万化,不能够用
: 简单的二分法去理解.陈大师刚刚推翻了统计经典理论.如果你想深刻地理解你的数据,
: 那么你要忘掉你的所学,做一个小学生,從陈大师的理论学起.

l*********s
发帖数: 5409

good idea

【在 N****n 的大作中提到】

: 我也菜鸟一个。
: 你试试用R的RPART PACKAGE，FIT一个分类树模型，然后计算错误率。如果太高了，做
: 一个10-FOLD CROSSVALIDATION。或者把树给PRUNE了。
: 还可以用RESAMPLE的方法把POPULATION换了，再FIT。
: 说得不对请指正。

c********d
发帖数: 253

logistic regression比其他classification方法表现得好完全是可能的，这跟数据有
关，同意三楼的看法. 个人觉得在这种情况下，没有必要硬要去找一个比logistic
model更好的方法，不过我的意见仅供参考。

h***i
发帖数: 3844

估计是你boss 打算把logistics 当成baseline model，用其他的
把logistics 狠批一通，结果你搞了半天，发现logistics就已经很不错，完全不用deep的research。。。估计paper危险了。。你boss 不甘心，想用这灌些水。。

【在 s****i 的大作中提到】

h***i
发帖数: 3844

用SVM之类的时候，加kernel，用nonparametric 方法搞一个kernel出来，不要用现成
的那些个kernel，或许能行，但是，本人认为，这就是没事找事。

【在 s****i 的大作中提到】

相关主题
● 问一个 classification 的问题	● 高维问题
● good classification methods for high dimension data	● logistic regression crossvalidation in SAS
● 统计专业找银行工作，需要有哪些金融的知识	● 急问有关SVM，randomforest的问题（gene expression data）
进入Statistics版参与讨论

A*******s
发帖数: 3942

乱说一句，logistic loss长得本来就挺像hinge loss. 如果logistic没有overfitting
的话，表现差不多是正常的吧。你的数据p多少N多少？

【在 s****i 的大作中提到】

d******e
发帖数: 7844

这俩真没啥差别，logistic regression还有计算上和多类别的优势

overfitting

【在 A*******s 的大作中提到】

: 乱说一句，logistic loss长得本来就挺像hinge loss. 如果logistic没有overfitting
: 的话，表现差不多是正常的吧。你的数据p多少N多少？

F****n
发帖数: 3271

不知道你上过SVM的课没有, 上过的话应该记得有个常见的象棋盘黑白格子状的SVM分类
例子，两类无法用一条多项曲线分开，这种情况SVM很容易 out-perform 回归。相反，
如果可以用一条多项曲线分开，多项回归永远可以通过加项来达到 perfect fitting.
估计你的数据是第二类，所以不要在精度上纠结，work on over-fitting /
transferability, i.e. using CV to test robustness of the classification.

【在 s****i 的大作中提到】

A*******s
发帖数: 3942

since the response is ordinal and your purpose is just publishing a paper,
you could try some learning to rank models, which i think is a very hot
topic now.

【在 s****i 的大作中提到】

h***i
发帖数: 3844

灌水啊。。
hehe

【在 A*******s 的大作中提到】

: since the response is ordinal and your purpose is just publishing a paper,
: you could try some learning to rank models, which i think is a very hot
: topic now.

A*******s
发帖数: 3942

definitely not the style of Master Chen.
Master chen always uses a lot of math terminologies such as functional,
continuity, measurable, prob space, etc. and he also likes to create some
new terminology by combing some common ones, like random constant...
let's look at his new work:

A CRV X and its n random points x_i can be expressed in X{x_i}(i=1,2,.,n).
We define a point-to-point differentiality D_j(j=i) with its range R_X for x
_i as D_j{d_ij}=|X-x_i|/R_X and a similarity S_j{s_ij}=1-D_j{d_ij}. A
product V{v_i} of the sum of D_j and the sum of S_j will be a real measure
in a range R_V. We define C{c_i}=1-[V-min(V)]/R_V as an unbiased self-weight
for the X{x_i} to the E(X). Then, we will have a convex-concave self-weight
curve, i.e. it looks like a normal curve if the X is normal. Based on
examining two properties of sample size n, we tried to unify the definitions
of the weighted and non-weighted basic statistics, in which the degree of
freedom may be defined as the sum of weights minus the self-weighted mean of
the weight. These unified definitions can be used to substitute various
optimizations in advanced statistical methodological constructions. We also
tried to infer the representativeness of arithmetical mean and propose a
self-weighted t-test for the microarray data analysis to obtain a random
variable P-value in multiple tests. A sample illustration and a series of
simulations have shown that the new algorithms are extremely accurate and
robust.

【在 g********r 的大作中提到】

s**f
发帖数: 365

Just curious,
Accepted as contributed papers for JSM. does this mean someone reviewed his
papers and accepted them?

x

【在 A*******s 的大作中提到】

: definitely not the style of Master Chen.
: Master chen always uses a lot of math terminologies such as functional,
: continuity, measurable, prob space, etc. and he also likes to create some
: new terminology by combing some common ones, like random constant...
: let's look at his new work:
:
: A CRV X and its n random points x_i can be expressed in X{x_i}(i=1,2,.,n).
: We define a point-to-point differentiality D_j(j=i) with its range R_X for x
: _i as D_j{d_ij}=|X-x_i|/R_X and a similarity S_j{s_ij}=1-D_j{d_ij}. A
: product V{v_i} of the sum of D_j and the sum of S_j will be a real measure

g********r
发帖数: 8017

JSM不review的。

his

【在 s**f 的大作中提到】

: Just curious,
: Accepted as contributed papers for JSM. does this mean someone reviewed his
: papers and accepted them?
:
: x

s**f
发帖数: 365

呵呵
偶一个manuscript觉得不是很成熟，老板建议下投了个poster，没自信阿

【在 g********r 的大作中提到】

: JSM不review的。
:
: his

(共1页)

进入Statistics版参与讨论

相关主题
● how do you deal with sparse data?	● 报面筋求实习合租 (转载)
● 陈大师的两大硬伤和两大法宝	● 问一个 classification 的问题
● 陈大师，　我很好奇	● good classification methods for high dimension data
● classification, 如果category有近上千个，大家一般用什么办法？	● 统计专业找银行工作，需要有哪些金融的知识
● 请问几个回归的sas code	● 高维问题
● machine learning救助模型在1数据集上表现好其他烂	● logistic regression crossvalidation in SAS
● proc logistic: how to build 2 X 2 classification table	● 急问有关SVM，randomforest的问题（gene expression data）
● ##如果logistic回归自变量x不是线性的，怎么办？##	● 请问怎么建立变量全是ordinal data的model？急，谢谢。

相关话题的讨论汇总
话题: logistic话题: svm话题: rf话题: 方法

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天