由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - 请教一道面试题
相关主题
请问关于小的dataset evaluation的问题报面筋求实习合租 (转载)
困惑: 用cross validationce 来评估performance的时候,还需要把原始的dataset区分为train 和test吗?用10-fold cross-validation 之后怎么挑Model?
training dataset和unbalanced dataset的设计我觉得neural network应用范围不大啊
datascientist几个基本问题大数据时代的最大挑战(一)?
ask for help for R programming (转载)SE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)
紧急求救: SMOTE-NC 处理categorical data for unbalanced class!!!一个面试题(predictive model) (转载)
Random forests on imbalanced datakaggle上这个restaurant-revenue-prediction的题目有人考虑过么?
[Data Science Project Case] Data Monitoring一般data scientist都是什么背景,一定要phd吗?
相关话题的讨论汇总
话题: sampling话题: imbalanced话题: br话题: class话题: classes
进入DataSciences版参与讨论
1 (共1页)
a****l
发帖数: 21
1
这个题的point是什么?谢谢
Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5%
negative, how do you take a sample from this datasets to build a reasonable
model.
z*******1
发帖数: 206
2
Combat Imbalanced Classes
"You can change the dataset that you use to build your predictive model to
have more balanced data.
This change is called sampling your dataset and there are two main methods
that you can use to even-up the classes:
You can add copies of instances from the under-represented class called over
-sampling (or more formally sampling with replacement), or
You can delete instances from the over-represented class, called under-
sampling.
These approaches are often very easy to implement and fast to run. They are
an excellent starting point.
In fact, I would advise you to always try both approaches on all of your
imbalanced datasets, just to see if it gives you a boost in your preferred
accuracy measures.
You can learn a little more in the the Wikipedia article titled “
Oversampling and undersampling in data analysis“."
m******r
发帖数: 1033
3
我来抛个砖。
看见这个2.5% vs 97.5% 是不是可以imbalanced sampling?
另外,怎么会有这么多feature ? 有的feature一眼看过去就没用 直接garbage
collection.
y********g
发帖数: 81
4
1. class imbalance决定了你选两类sample的比例
2. feature size决定了你至少应该选多少数据出来才能获得有意义的model
3. n/p 在这个问题里面挺大的,所以regularization不是什么大问题。只要注意不要
overfit就可以了。

reasonable

【在 a****l 的大作中提到】
: 这个题的point是什么?谢谢
: Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5%
: negative, how do you take a sample from this datasets to build a reasonable
: model.

s*****n
发帖数: 134
W*******e
发帖数: 590
6
Over sampling under sampling techniques. From the link u provided, this only
applies to cases that sampling is biased from population and u know it
beforehand. Confusion mertrics and classification report may be one tool
with purposely adjusting the class probability and use f score as a measure.
The features are big, probably need do Sth on it first. Feeling need reduce
the dimensions first instead of only shrinking it.
Rookie一个, please feel free to comment .


: Combat Imbalanced Classes

: "You can change the dataset that you use to build your predictive
model to

: have more balanced data.

: This change is called sampling your dataset and there are two main
methods

: that you can use to even-up the classes:

: You can add copies of instances from the under-represented class
called over

: -sampling (or more formally sampling with replacement), or

: You can delete instances from the over-represented class, called under-

: sampling.

: These approaches are often very easy to implement and fast to run.
They are



【在 z*******1 的大作中提到】
: Combat Imbalanced Classes
: "You can change the dataset that you use to build your predictive model to
: have more balanced data.
: This change is called sampling your dataset and there are two main methods
: that you can use to even-up the classes:
: You can add copies of instances from the under-represented class called over
: -sampling (or more formally sampling with replacement), or
: You can delete instances from the over-represented class, called under-
: sampling.
: These approaches are often very easy to implement and fast to run. They are

b*****s
发帖数: 11267
7
4000000*2.5% 这个postive size对我来说已经很奢侈了
干嘛需要up sampling或者down sampling,虽然我老板就是搞sampling的,但是我个人
觉得up sampling或者down sampling之后就没法provide unbiased estimation了

reasonable

【在 a****l 的大作中提到】
: 这个题的point是什么?谢谢
: Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5%
: negative, how do you take a sample from this datasets to build a reasonable
: model.

d****n
发帖数: 12461
8
是不是sampling最后都要搞到1:1?
a*z
发帖数: 294
9
second this one:
"4000000*2.5% 这个postive size对我来说已经很奢侈了"
would like to do dimension reduction first.
t******g
发帖数: 2253
10
这个是问怎么处理imbalanced samples,然后如何在这种情况下build model
x***t
发帖数: 263
11
尽管4M*2.5% 绝对数量很大,但是还是2.5% vs 97.5% 的imbalanced class problem。
一般策略是:
1)over sampling on minority class (缺点:overfitting,只是把decision
boundary 做细,没有genralize)
2) under sampling on majority class
3) synthesize data points
第三个参考SMOTE和ADASYN 两种方法。python有现成package:imbalanced-learn
SMOTE和ADASYN的papers:
https://www.jair.org/media/953/live-953-2037-jair.pdf
http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf

reasonable

【在 a****l 的大作中提到】
: 这个题的point是什么?谢谢
: Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5%
: negative, how do you take a sample from this datasets to build a reasonable
: model.

S*****o
发帖数: 715
12
Oversampling is v. bad for decision tree based pipelines, as the decision
policy usually based on gini index, info gain or whatever, affected by
distribution of classes. But it could work v. well in some cases, penalty
based balancing is often upsampling in disguise.

【在 x***t 的大作中提到】
: 尽管4M*2.5% 绝对数量很大,但是还是2.5% vs 97.5% 的imbalanced class problem。
: 一般策略是:
: 1)over sampling on minority class (缺点:overfitting,只是把decision
: boundary 做细,没有genralize)
: 2) under sampling on majority class
: 3) synthesize data points
: 第三个参考SMOTE和ADASYN 两种方法。python有现成package:imbalanced-learn
: SMOTE和ADASYN的papers:
: https://www.jair.org/media/953/live-953-2037-jair.pdf
: http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf

x***t
发帖数: 263
13
所以我推荐SMOTE或ADASYN,详细参见原paper

【在 S*****o 的大作中提到】
: Oversampling is v. bad for decision tree based pipelines, as the decision
: policy usually based on gini index, info gain or whatever, affected by
: distribution of classes. But it could work v. well in some cases, penalty
: based balancing is often upsampling in disguise.

a*****s
发帖数: 838
14
Can someone explain more about these steps and where to learn all these? I
am learning DS from scratch, and.. mainly self-teaching as well by looking
for online resources.
Thanks.

【在 x***t 的大作中提到】
: 尽管4M*2.5% 绝对数量很大,但是还是2.5% vs 97.5% 的imbalanced class problem。
: 一般策略是:
: 1)over sampling on minority class (缺点:overfitting,只是把decision
: boundary 做细,没有genralize)
: 2) under sampling on majority class
: 3) synthesize data points
: 第三个参考SMOTE和ADASYN 两种方法。python有现成package:imbalanced-learn
: SMOTE和ADASYN的papers:
: https://www.jair.org/media/953/live-953-2037-jair.pdf
: http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf

1 (共1页)
进入DataSciences版参与讨论
相关主题
一般data scientist都是什么背景,一定要phd吗?ask for help for R programming (转载)
Colah 关于 neural network 的一篇博客紧急求救: SMOTE-NC 处理categorical data for unbalanced class!!!
做credit risk scorecard的朋友们, 请进来, 有问题求教 (转载)Random forests on imbalanced data
怎么处理categorical variable有很多个level的[Data Science Project Case] Data Monitoring
请问关于小的dataset evaluation的问题报面筋求实习合租 (转载)
困惑: 用cross validationce 来评估performance的时候,还需要把原始的dataset区分为train 和test吗?用10-fold cross-validation 之后怎么挑Model?
training dataset和unbalanced dataset的设计我觉得neural network应用范围不大啊
datascientist几个基本问题大数据时代的最大挑战(一)?
相关话题的讨论汇总
话题: sampling话题: imbalanced话题: br话题: class话题: classes