由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Statistics版 - 问个R的数据处理的问题,在线等
相关主题
急问:到哪里能买或租到computing facility/server?弱弱地,一个关于SPSS的问题........
how to set the seed in R问一下, data mining是具体做什么工作?
在sas里引用变量,如v1 - v100, 如果变量的数目不是固定的,怎么办?求助!该如何准备这个电话面试?
请教一个SPSS处理数据的问题,非常感谢!!Urgent! Intra-observer variability
请教用SAS的一个数据处理的问题请教一个qualification
【R】关于R的variable typemedical device 和 drug 的clinical trial 有什么不同
用R出现怪问题。los angeles的openings (转载)
my R job问一个R&R measuring system的问题。
相关话题的讨论汇总
话题: df1话题: df2话题: varlist1话题: varlist2话题: ifelse
进入Statistics版参与讨论
1 (共1页)
v*******e
发帖数: 133
1
工作当中遇到的,纠结了半天了
df1;
v1 v2 v3 v4.......v99 v100
df2;
Varlist1 Varlist 2
V1 V10
V2 V16
V5 V39
. .
. .
需要的是把df1里面存在于df2里Varlist1的variables的missing value用Varlist 2对应
的variable来置换
比较傻的做法,手工填写:
df1$V1=ifelse(is.na(df1$V1),df1$V10, df1$V1)
and so on...
可是df2有可能change dynamically.
如何来写这个loop.
g******2
发帖数: 234
2
use data.table to make df2 a data.table, and setkey(df2, Varlist1)
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
}
H*H
发帖数: 472
3
LZ这个貌似没必要写loop吧,直接data frame操作:
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
g******2
发帖数: 234
4
please test your code before post your answer:
df1 <- data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13)
df1$x1[2:4] <- NA
df1$x2[4:7] <- NA
df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
setkey(df2, Varlist1)
#HJH approach:
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
#warning message, and df1$x1 still has NA
#my approach
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
}
v*******e
发帖数: 133
5
Thank you so much for your answer, getdown2 and HJH!
I did not get a chance to test HJH's approach. But I've just used getdown2's
approach and it worked!

【在 g******2 的大作中提到】
: please test your code before post your answer:
: df1 <- data.frame(x1=1:10,
: x2=2:11,
: x3=3:12,
: x4=4:13)
: df1$x1[2:4] <- NA
: df1$x2[4:7] <- NA
: df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
: setkey(df2, Varlist1)
: #HJH approach:

H**********f
发帖数: 2978
6
HJH's code also works with some minor modifications:
df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as.
character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as
.matrix(df1[, as.character(df2$Varlist1)]))
kinda ugly, but no loops

's

【在 v*******e 的大作中提到】
: Thank you so much for your answer, getdown2 and HJH!
: I did not get a chance to test HJH's approach. But I've just used getdown2's
: approach and it worked!

H*H
发帖数: 472
7
Thanks, I think there is no need to change the code. Just adjust the format
of df1 and df2 a little bit, and it works perfectly. LZ didn't provide the
exact data format and class of each column. Thus it is hard to give an exact
answer. The code is just to give an idea of removing for loop.
df1 <- as.matrix(data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13))
df1[2:4, 'x1'] <- NA
df1[4:7, 'x2'] <- NA
df2 <- data.frame(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"),
stringsAsFactors = FALSE)
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
df1
x1 x2 x3 x4
[1,] 1 2 3 4
[2,] 4 3 4 5
[3,] 5 4 5 6
[4,] 6 7 6 7
[5,] 5 8 7 8
[6,] 6 9 8 9
[7,] 7 10 9 10
[8,] 8 9 10 11
[9,] 9 10 11 12
[10,] 10 11 12 13

as

【在 H**********f 的大作中提到】
: HJH's code also works with some minor modifications:
: df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as.
: character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as
: .matrix(df1[, as.character(df2$Varlist1)]))
: kinda ugly, but no loops
:
: 's

H*H
发帖数: 472
8
I am sorry if your test example doesn't work with my code, but I did test it
before I posted. Because LZ didn't provide a reproducible example, it is
difficult to give an exact answer he/she wants. Data.table is a
good package, and I used it a lot, especially for big data manipulation. The
idea behind my code is to avoid the for loop, and it is not in conflict
with data.table.

【在 g******2 的大作中提到】
: please test your code before post your answer:
: df1 <- data.frame(x1=1:10,
: x2=2:11,
: x3=3:12,
: x4=4:13)
: df1$x1[2:4] <- NA
: df1$x2[4:7] <- NA
: df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
: setkey(df2, Varlist1)
: #HJH approach:

k*******a
发帖数: 772
9
very good discussion, here is my way:
## create dictionary
dic <- df2$Varlist2
names(dic) <- df2$Varlist1
for (name in df2$Varlist1) {
df1[[name]] <- ifelse(is.na(df1[[name]]), df1[[dic[name]]], df1[[name]])
}
l******n
发帖数: 9344
10
用sqldf应该是最容易的吧

【在 v*******e 的大作中提到】
: 工作当中遇到的,纠结了半天了
: df1;
: v1 v2 v3 v4.......v99 v100
: df2;
: Varlist1 Varlist 2
: V1 V10
: V2 V16
: V5 V39
: . .
: . .

相关主题
【R】关于R的variable type弱弱地,一个关于SPSS的问题........
用R出现怪问题。问一下, data mining是具体做什么工作?
my R job求助!该如何准备这个电话面试?
进入Statistics版参与讨论
c******y
发帖数: 3269
11
Efficiency too low when dataset is big
That's one thing I dislike in R

【在 l******n 的大作中提到】
: 用sqldf应该是最容易的吧
l******n
发帖数: 9344
12
嗯,数据大了R handle不了,那是另外的问题。上边说的问题正好,其实excel就可以
解决,而且人家还是多先multicore的,不慢

【在 c******y 的大作中提到】
: Efficiency too low when dataset is big
: That's one thing I dislike in R

v*******e
发帖数: 133
13
工作当中遇到的,纠结了半天了
df1;
v1 v2 v3 v4.......v99 v100
df2;
Varlist1 Varlist 2
V1 V10
V2 V16
V5 V39
. .
. .
需要的是把df1里面存在于df2里Varlist1的variables的missing value用Varlist 2对应
的variable来置换
比较傻的做法,手工填写:
df1$V1=ifelse(is.na(df1$V1),df1$V10, df1$V1)
and so on...
可是df2有可能change dynamically.
如何来写这个loop.
g******2
发帖数: 234
14
use data.table to make df2 a data.table, and setkey(df2, Varlist1)
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
}
H*H
发帖数: 472
15
LZ这个貌似没必要写loop吧,直接data frame操作:
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
g******2
发帖数: 234
16
please test your code before post your answer:
df1 <- data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13)
df1$x1[2:4] <- NA
df1$x2[4:7] <- NA
df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
setkey(df2, Varlist1)
#HJH approach:
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
#warning message, and df1$x1 still has NA
#my approach
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
}
v*******e
发帖数: 133
17
Thank you so much for your answer, getdown2 and HJH!
I did not get a chance to test HJH's approach. But I've just used getdown2's
approach and it worked!

【在 g******2 的大作中提到】
: please test your code before post your answer:
: df1 <- data.frame(x1=1:10,
: x2=2:11,
: x3=3:12,
: x4=4:13)
: df1$x1[2:4] <- NA
: df1$x2[4:7] <- NA
: df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
: setkey(df2, Varlist1)
: #HJH approach:

H**********f
发帖数: 2978
18
HJH's code also works with some minor modifications:
df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as.
character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as
.matrix(df1[, as.character(df2$Varlist1)]))
kinda ugly, but no loops

's

【在 v*******e 的大作中提到】
: Thank you so much for your answer, getdown2 and HJH!
: I did not get a chance to test HJH's approach. But I've just used getdown2's
: approach and it worked!

H*H
发帖数: 472
19
Thanks, I think there is no need to change the code. Just adjust the format
of df1 and df2 a little bit, and it works perfectly. LZ didn't provide the
exact data format and class of each column. Thus it is hard to give an exact
answer. The code is just to give an idea of removing for loop.
df1 <- as.matrix(data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13))
df1[2:4, 'x1'] <- NA
df1[4:7, 'x2'] <- NA
df2 <- data.frame(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"),
stringsAsFactors = FALSE)
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
df1
x1 x2 x3 x4
[1,] 1 2 3 4
[2,] 4 3 4 5
[3,] 5 4 5 6
[4,] 6 7 6 7
[5,] 5 8 7 8
[6,] 6 9 8 9
[7,] 7 10 9 10
[8,] 8 9 10 11
[9,] 9 10 11 12
[10,] 10 11 12 13

as

【在 H**********f 的大作中提到】
: HJH's code also works with some minor modifications:
: df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as.
: character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as
: .matrix(df1[, as.character(df2$Varlist1)]))
: kinda ugly, but no loops
:
: 's

H*H
发帖数: 472
20
I am sorry if your test example doesn't work with my code, but I did test it
before I posted. Because LZ didn't provide a reproducible example, it is
difficult to give an exact answer he/she wants. Data.table is a
good package, and I used it a lot, especially for big data manipulation. The
idea behind my code is to avoid the for loop, and it is not in conflict
with data.table.

【在 g******2 的大作中提到】
: please test your code before post your answer:
: df1 <- data.frame(x1=1:10,
: x2=2:11,
: x3=3:12,
: x4=4:13)
: df1$x1[2:4] <- NA
: df1$x2[4:7] <- NA
: df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
: setkey(df2, Varlist1)
: #HJH approach:

相关主题
Urgent! Intra-observer variabilitylos angeles的openings (转载)
请教一个qualification问一个R&R measuring system的问题。
medical device 和 drug 的clinical trial 有什么不同谁有这本书的电子版吗
进入Statistics版参与讨论
k*******a
发帖数: 772
21
very good discussion, here is my way:
## create dictionary
dic <- df2$Varlist2
names(dic) <- df2$Varlist1
for (name in df2$Varlist1) {
df1[[name]] <- ifelse(is.na(df1[[name]]), df1[[dic[name]]], df1[[name]])
}
l******n
发帖数: 9344
22
用sqldf应该是最容易的吧

【在 v*******e 的大作中提到】
: 工作当中遇到的,纠结了半天了
: df1;
: v1 v2 v3 v4.......v99 v100
: df2;
: Varlist1 Varlist 2
: V1 V10
: V2 V16
: V5 V39
: . .
: . .

c******y
发帖数: 3269
23
Efficiency too low when dataset is big
That's one thing I dislike in R

【在 l******n 的大作中提到】
: 用sqldf应该是最容易的吧
l******n
发帖数: 9344
24
嗯,数据大了R handle不了,那是另外的问题。上边说的问题正好,其实excel就可以
解决,而且人家还是多先multicore的,不慢

【在 c******y 的大作中提到】
: Efficiency too low when dataset is big
: That's one thing I dislike in R

c******y
发帖数: 3269
25
前面我没表达清楚
我的意思是sqldf和R的其他方法比,数据大的情况下sqldf效率比较低
R没有比较高效的sql package,这点是R比较弱的

【在 l******n 的大作中提到】
: 嗯,数据大了R handle不了,那是另外的问题。上边说的问题正好,其实excel就可以
: 解决,而且人家还是多先multicore的,不慢

l******n
发帖数: 9344
26
其实R不需要呀,你直接用高效的sql工具处理,把结果在导入R就行了。R主要是作数据
处理的,不包括BI。sqldf很多时候比较简洁易懂,用起来方便

【在 c******y 的大作中提到】
: 前面我没表达清楚
: 我的意思是sqldf和R的其他方法比,数据大的情况下sqldf效率比较低
: R没有比较高效的sql package,这点是R比较弱的

c******y
发帖数: 3269
27
sqldf很多时候比较简洁易懂,用起来方便
Agree, it's just not efficient.
Therefore, if the users ETL in R, I'd recommend other R packages instead of
sqldf, so they can get a better idea of how R manipulates data.
If users prefer SQL, I would suggest the same as you do: let the pro-
software do SQL.

【在 l******n 的大作中提到】
: 其实R不需要呀,你直接用高效的sql工具处理,把结果在导入R就行了。R主要是作数据
: 处理的,不包括BI。sqldf很多时候比较简洁易懂,用起来方便

t********m
发帖数: 939
28
A very good post! Mark.
1 (共1页)
进入Statistics版参与讨论
相关主题
问一个R&R measuring system的问题。请教用SAS的一个数据处理的问题
谁有这本书的电子版吗【R】关于R的variable type
Senior Statistician position in Santa Monica CA用R出现怪问题。
Statistician/Data Mining Position in Santa Monica, CAmy R job
急问:到哪里能买或租到computing facility/server?弱弱地,一个关于SPSS的问题........
how to set the seed in R问一下, data mining是具体做什么工作?
在sas里引用变量,如v1 - v100, 如果变量的数目不是固定的,怎么办?求助!该如何准备这个电话面试?
请教一个SPSS处理数据的问题,非常感谢!!Urgent! Intra-observer variability
相关话题的讨论汇总
话题: df1话题: df2话题: varlist1话题: varlist2话题: ifelse