v*******e 发帖数: 133 | 1 工作当中遇到的,纠结了半天了
df1;
v1 v2 v3 v4.......v99 v100
df2;
Varlist1 Varlist 2
V1 V10
V2 V16
V5 V39
. .
. .
需要的是把df1里面存在于df2里Varlist1的variables的missing value用Varlist 2对应
的variable来置换
比较傻的做法,手工填写:
df1$V1=ifelse(is.na(df1$V1),df1$V10, df1$V1)
and so on...
可是df2有可能change dynamically.
如何来写这个loop. | g******2 发帖数: 234 | 2 use data.table to make df2 a data.table, and setkey(df2, Varlist1)
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
} | H*H 发帖数: 472 | 3 LZ这个貌似没必要写loop吧,直接data frame操作:
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1]) | g******2 发帖数: 234 | 4 please test your code before post your answer:
df1 <- data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13)
df1$x1[2:4] <- NA
df1$x2[4:7] <- NA
df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
setkey(df2, Varlist1)
#HJH approach:
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
#warning message, and df1$x1 still has NA
#my approach
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
} | v*******e 发帖数: 133 | 5 Thank you so much for your answer, getdown2 and HJH!
I did not get a chance to test HJH's approach. But I've just used getdown2's
approach and it worked!
【在 g******2 的大作中提到】 : please test your code before post your answer: : df1 <- data.frame(x1=1:10, : x2=2:11, : x3=3:12, : x4=4:13) : df1$x1[2:4] <- NA : df1$x2[4:7] <- NA : df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4")) : setkey(df2, Varlist1) : #HJH approach:
| H**********f 发帖数: 2978 | 6 HJH's code also works with some minor modifications:
df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as.
character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as
.matrix(df1[, as.character(df2$Varlist1)]))
kinda ugly, but no loops
's
【在 v*******e 的大作中提到】 : Thank you so much for your answer, getdown2 and HJH! : I did not get a chance to test HJH's approach. But I've just used getdown2's : approach and it worked!
| H*H 发帖数: 472 | 7 Thanks, I think there is no need to change the code. Just adjust the format
of df1 and df2 a little bit, and it works perfectly. LZ didn't provide the
exact data format and class of each column. Thus it is hard to give an exact
answer. The code is just to give an idea of removing for loop.
df1 <- as.matrix(data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13))
df1[2:4, 'x1'] <- NA
df1[4:7, 'x2'] <- NA
df2 <- data.frame(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"),
stringsAsFactors = FALSE)
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
df1
x1 x2 x3 x4
[1,] 1 2 3 4
[2,] 4 3 4 5
[3,] 5 4 5 6
[4,] 6 7 6 7
[5,] 5 8 7 8
[6,] 6 9 8 9
[7,] 7 10 9 10
[8,] 8 9 10 11
[9,] 9 10 11 12
[10,] 10 11 12 13
as
【在 H**********f 的大作中提到】 : HJH's code also works with some minor modifications: : df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as. : character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as : .matrix(df1[, as.character(df2$Varlist1)])) : kinda ugly, but no loops : : 's
| H*H 发帖数: 472 | 8 I am sorry if your test example doesn't work with my code, but I did test it
before I posted. Because LZ didn't provide a reproducible example, it is
difficult to give an exact answer he/she wants. Data.table is a
good package, and I used it a lot, especially for big data manipulation. The
idea behind my code is to avoid the for loop, and it is not in conflict
with data.table.
【在 g******2 的大作中提到】 : please test your code before post your answer: : df1 <- data.frame(x1=1:10, : x2=2:11, : x3=3:12, : x4=4:13) : df1$x1[2:4] <- NA : df1$x2[4:7] <- NA : df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4")) : setkey(df2, Varlist1) : #HJH approach:
| k*******a 发帖数: 772 | 9 very good discussion, here is my way:
## create dictionary
dic <- df2$Varlist2
names(dic) <- df2$Varlist1
for (name in df2$Varlist1) {
df1[[name]] <- ifelse(is.na(df1[[name]]), df1[[dic[name]]], df1[[name]])
} | l******n 发帖数: 9344 | 10 用sqldf应该是最容易的吧
【在 v*******e 的大作中提到】 : 工作当中遇到的,纠结了半天了 : df1; : v1 v2 v3 v4.......v99 v100 : df2; : Varlist1 Varlist 2 : V1 V10 : V2 V16 : V5 V39 : . . : . .
| | | c******y 发帖数: 3269 | 11 Efficiency too low when dataset is big
That's one thing I dislike in R
【在 l******n 的大作中提到】 : 用sqldf应该是最容易的吧
| l******n 发帖数: 9344 | 12 嗯,数据大了R handle不了,那是另外的问题。上边说的问题正好,其实excel就可以
解决,而且人家还是多先multicore的,不慢
【在 c******y 的大作中提到】 : Efficiency too low when dataset is big : That's one thing I dislike in R
| v*******e 发帖数: 133 | 13 工作当中遇到的,纠结了半天了
df1;
v1 v2 v3 v4.......v99 v100
df2;
Varlist1 Varlist 2
V1 V10
V2 V16
V5 V39
. .
. .
需要的是把df1里面存在于df2里Varlist1的variables的missing value用Varlist 2对应
的variable来置换
比较傻的做法,手工填写:
df1$V1=ifelse(is.na(df1$V1),df1$V10, df1$V1)
and so on...
可是df2有可能change dynamically.
如何来写这个loop. | g******2 发帖数: 234 | 14 use data.table to make df2 a data.table, and setkey(df2, Varlist1)
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
} | H*H 发帖数: 472 | 15 LZ这个貌似没必要写loop吧,直接data frame操作:
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1]) | g******2 发帖数: 234 | 16 please test your code before post your answer:
df1 <- data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13)
df1$x1[2:4] <- NA
df1$x2[4:7] <- NA
df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
setkey(df2, Varlist1)
#HJH approach:
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
#warning message, and df1$x1 still has NA
#my approach
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
} | v*******e 发帖数: 133 | 17 Thank you so much for your answer, getdown2 and HJH!
I did not get a chance to test HJH's approach. But I've just used getdown2's
approach and it worked!
【在 g******2 的大作中提到】 : please test your code before post your answer: : df1 <- data.frame(x1=1:10, : x2=2:11, : x3=3:12, : x4=4:13) : df1$x1[2:4] <- NA : df1$x2[4:7] <- NA : df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4")) : setkey(df2, Varlist1) : #HJH approach:
| H**********f 发帖数: 2978 | 18 HJH's code also works with some minor modifications:
df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as.
character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as
.matrix(df1[, as.character(df2$Varlist1)]))
kinda ugly, but no loops
's
【在 v*******e 的大作中提到】 : Thank you so much for your answer, getdown2 and HJH! : I did not get a chance to test HJH's approach. But I've just used getdown2's : approach and it worked!
| H*H 发帖数: 472 | 19 Thanks, I think there is no need to change the code. Just adjust the format
of df1 and df2 a little bit, and it works perfectly. LZ didn't provide the
exact data format and class of each column. Thus it is hard to give an exact
answer. The code is just to give an idea of removing for loop.
df1 <- as.matrix(data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13))
df1[2:4, 'x1'] <- NA
df1[4:7, 'x2'] <- NA
df2 <- data.frame(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"),
stringsAsFactors = FALSE)
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
df1
x1 x2 x3 x4
[1,] 1 2 3 4
[2,] 4 3 4 5
[3,] 5 4 5 6
[4,] 6 7 6 7
[5,] 5 8 7 8
[6,] 6 9 8 9
[7,] 7 10 9 10
[8,] 8 9 10 11
[9,] 9 10 11 12
[10,] 10 11 12 13
as
【在 H**********f 的大作中提到】 : HJH's code also works with some minor modifications: : df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as. : character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as : .matrix(df1[, as.character(df2$Varlist1)])) : kinda ugly, but no loops : : 's
| H*H 发帖数: 472 | 20 I am sorry if your test example doesn't work with my code, but I did test it
before I posted. Because LZ didn't provide a reproducible example, it is
difficult to give an exact answer he/she wants. Data.table is a
good package, and I used it a lot, especially for big data manipulation. The
idea behind my code is to avoid the for loop, and it is not in conflict
with data.table.
【在 g******2 的大作中提到】 : please test your code before post your answer: : df1 <- data.frame(x1=1:10, : x2=2:11, : x3=3:12, : x4=4:13) : df1$x1[2:4] <- NA : df1$x2[4:7] <- NA : df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4")) : setkey(df2, Varlist1) : #HJH approach:
| | | k*******a 发帖数: 772 | 21 very good discussion, here is my way:
## create dictionary
dic <- df2$Varlist2
names(dic) <- df2$Varlist1
for (name in df2$Varlist1) {
df1[[name]] <- ifelse(is.na(df1[[name]]), df1[[dic[name]]], df1[[name]])
} | l******n 发帖数: 9344 | 22 用sqldf应该是最容易的吧
【在 v*******e 的大作中提到】 : 工作当中遇到的,纠结了半天了 : df1; : v1 v2 v3 v4.......v99 v100 : df2; : Varlist1 Varlist 2 : V1 V10 : V2 V16 : V5 V39 : . . : . .
| c******y 发帖数: 3269 | 23 Efficiency too low when dataset is big
That's one thing I dislike in R
【在 l******n 的大作中提到】 : 用sqldf应该是最容易的吧
| l******n 发帖数: 9344 | 24 嗯,数据大了R handle不了,那是另外的问题。上边说的问题正好,其实excel就可以
解决,而且人家还是多先multicore的,不慢
【在 c******y 的大作中提到】 : Efficiency too low when dataset is big : That's one thing I dislike in R
| c******y 发帖数: 3269 | 25 前面我没表达清楚
我的意思是sqldf和R的其他方法比,数据大的情况下sqldf效率比较低
R没有比较高效的sql package,这点是R比较弱的
【在 l******n 的大作中提到】 : 嗯,数据大了R handle不了,那是另外的问题。上边说的问题正好,其实excel就可以 : 解决,而且人家还是多先multicore的,不慢
| l******n 发帖数: 9344 | 26 其实R不需要呀,你直接用高效的sql工具处理,把结果在导入R就行了。R主要是作数据
处理的,不包括BI。sqldf很多时候比较简洁易懂,用起来方便
【在 c******y 的大作中提到】 : 前面我没表达清楚 : 我的意思是sqldf和R的其他方法比,数据大的情况下sqldf效率比较低 : R没有比较高效的sql package,这点是R比较弱的
| c******y 发帖数: 3269 | 27 sqldf很多时候比较简洁易懂,用起来方便
Agree, it's just not efficient.
Therefore, if the users ETL in R, I'd recommend other R packages instead of
sqldf, so they can get a better idea of how R manipulates data.
If users prefer SQL, I would suggest the same as you do: let the pro-
software do SQL.
【在 l******n 的大作中提到】 : 其实R不需要呀,你直接用高效的sql工具处理,把结果在导入R就行了。R主要是作数据 : 处理的,不包括BI。sqldf很多时候比较简洁易懂,用起来方便
| t********m 发帖数: 939 | 28 A very good post! Mark. |
|