问个R的数据处理的问题，在线等 - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - 问个R的数据处理的问题，在线等

相关主题
● 急问：到哪里能买或租到computing facility／server？	● 弱弱地,一个关于SPSS的问题........
● how to set the seed in R	● 问一下， data mining是具体做什么工作？
● 在sas里引用变量，如v1 - v100, 如果变量的数目不是固定的，怎么办？	● 求助！该如何准备这个电话面试？
● 请教一个SPSS处理数据的问题，非常感谢！！	● Urgent! Intra-observer variability
● 请教用SAS的一个数据处理的问题	● 请教一个qualification
● 【R】关于R的variable type	● medical device 和 drug 的clinical trial 有什么不同
● 用R出现怪问题。	● los angeles的openings (转载)
● my R job	● 问一个R&R measuring system的问题。

相关话题的讨论汇总
话题: df1话题: df2话题: varlist1话题: varlist2话题: ifelse

进入Statistics版参与讨论

(共1页)

v*******e
发帖数: 133

工作当中遇到的，纠结了半天了
df1;
v1 v2 v3 v4.......v99 v100
df2;
Varlist1 Varlist 2
V1 V10
V2 V16
V5 V39
. .
. .
需要的是把df1里面存在于df2里Varlist1的variables的missing value用Varlist 2对应
的variable来置换
比较傻的做法，手工填写：
df1$V1=ifelse(is.na(df1$V1),df1$V10, df1$V1)
and so on...
可是df2有可能change dynamically.
如何来写这个loop.

g******2
发帖数: 234

use data.table to make df2 a data.table, and setkey(df2, Varlist1)
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
}

H*H
发帖数: 472

LZ这个貌似没必要写loop吧，直接data frame操作：
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])

g******2
发帖数: 234

please test your code before post your answer:
df1 <- data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13)
df1$x1[2:4] <- NA
df1$x2[4:7] <- NA
df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
setkey(df2, Varlist1)
#HJH approach:
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
#warning message, and df1$x1 still has NA
#my approach
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
}

v*******e
发帖数: 133

Thank you so much for your answer, getdown2 and HJH!
I did not get a chance to test HJH's approach. But I've just used getdown2's
approach and it worked!

【在 g******2 的大作中提到】

: please test your code before post your answer:
: df1 <- data.frame(x1=1:10,
: x2=2:11,
: x3=3:12,
: x4=4:13)
: df1$x1[2:4] <- NA
: df1$x2[4:7] <- NA
: df2 <- data.table(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"))
: setkey(df2, Varlist1)
: #HJH approach:

H**********f
发帖数: 2978

HJH's code also works with some minor modifications:
df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as.
character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as
.matrix(df1[, as.character(df2$Varlist1)]))
kinda ugly, but no loops

's

【在 v*******e 的大作中提到】

: Thank you so much for your answer, getdown2 and HJH!
: I did not get a chance to test HJH's approach. But I've just used getdown2's
: approach and it worked!

H*H
发帖数: 472

Thanks, I think there is no need to change the code. Just adjust the format
of df1 and df2 a little bit, and it works perfectly. LZ didn't provide the
exact data format and class of each column. Thus it is hard to give an exact
answer. The code is just to give an idea of removing for loop.
df1 <- as.matrix(data.frame(x1=1:10,
x2=2:11,
x3=3:12,
x4=4:13))
df1[2:4, 'x1'] <- NA
df1[4:7, 'x2'] <- NA
df2 <- data.frame(Varlist1 = c("x1", "x2"), Varlist2 = c("x3", "x4"),
stringsAsFactors = FALSE)
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])
df1
x1 x2 x3 x4
[1,] 1 2 3 4
[2,] 4 3 4 5
[3,] 5 4 5 6
[4,] 6 7 6 7
[5,] 5 8 7 8
[6,] 6 9 8 9
[7,] 7 10 9 10
[8,] 8 9 10 11
[9,] 9 10 11 12
[10,] 10 11 12 13

as

【在 H**********f 的大作中提到】

: HJH's code also works with some minor modifications:
: df1[, as.character(df2$Varlist1)] = ifelse(as.matrix(is.na(df1[, as.
: character(df2$Varlist1)])), as.matrix(df1[, as.character(df2$Varlist2)]), as
: .matrix(df1[, as.character(df2$Varlist1)]))
: kinda ugly, but no loops
:
: 's

H*H
发帖数: 472

I am sorry if your test example doesn't work with my code, but I did test it
before I posted. Because LZ didn't provide a reproducible example, it is
difficult to give an exact answer he/she wants. Data.table is a
good package, and I used it a lot, especially for big data manipulation. The
idea behind my code is to avoid the for loop, and it is not in conflict
with data.table.

【在 g******2 的大作中提到】

k*******a
发帖数: 772

very good discussion, here is my way:
## create dictionary
dic <- df2$Varlist2
names(dic) <- df2$Varlist1
for (name in df2$Varlist1) {
df1[[name]] <- ifelse(is.na(df1[[name]]), df1[[dic[name]]], df1[[name]])
}

l******n
发帖数: 9344

用sqldf应该是最容易的吧

【在 v*******e 的大作中提到】

: 工作当中遇到的，纠结了半天了
: df1;
: v1 v2 v3 v4.......v99 v100
: df2;
: Varlist1 Varlist 2
: V1 V10
: V2 V16
: V5 V39
: . .
: . .

相关主题
● 【R】关于R的variable type	● 弱弱地,一个关于SPSS的问题........
● 用R出现怪问题。	● 问一下， data mining是具体做什么工作？
● my R job	● 求助！该如何准备这个电话面试？
进入Statistics版参与讨论

c******y
发帖数: 3269

Efficiency too low when dataset is big
That's one thing I dislike in R

【在 l******n 的大作中提到】

: 用sqldf应该是最容易的吧

l******n
发帖数: 9344

嗯，数据大了R handle不了，那是另外的问题。上边说的问题正好，其实excel就可以
解决，而且人家还是多先multicore的，不慢

【在 c******y 的大作中提到】

: Efficiency too low when dataset is big
: That's one thing I dislike in R

v*******e
发帖数: 133

g******2
发帖数: 234

use data.table to make df2 a data.table, and setkey(df2, Varlist1)
for (i in df2$Varlist1) {
df1[is.na(df1[, i]), i] <- df1[is.na(df1[, i]), df2[i]$Varlist2]
}

H*H
发帖数: 472

LZ这个貌似没必要写loop吧，直接data frame操作：
df1[, df2$Varlist1] <- ifelse(is.na(df1[, df2$Varlist1]),
df1[, df2$Varlist2], df1[, df2$Varlist1])

g******2
发帖数: 234

v*******e
发帖数: 133

Thank you so much for your answer, getdown2 and HJH!
I did not get a chance to test HJH's approach. But I've just used getdown2's
approach and it worked!

【在 g******2 的大作中提到】

H**********f
发帖数: 2978

: Thank you so much for your answer, getdown2 and HJH!
: I did not get a chance to test HJH's approach. But I've just used getdown2's
: approach and it worked!

H*H
发帖数: 472

相关主题
● Urgent! Intra-observer variability	● los angeles的openings (转载)
● 请教一个qualification	● 问一个R&R measuring system的问题。
● medical device 和 drug 的clinical trial 有什么不同	● 谁有这本书的电子版吗
进入Statistics版参与讨论

k*******a
发帖数: 772

l******n
发帖数: 9344

用sqldf应该是最容易的吧

【在 v*******e 的大作中提到】

: 工作当中遇到的，纠结了半天了
: df1;
: v1 v2 v3 v4.......v99 v100
: df2;
: Varlist1 Varlist 2
: V1 V10
: V2 V16
: V5 V39
: . .
: . .

c******y
发帖数: 3269

Efficiency too low when dataset is big
That's one thing I dislike in R

【在 l******n 的大作中提到】

: 用sqldf应该是最容易的吧

l******n
发帖数: 9344

嗯，数据大了R handle不了，那是另外的问题。上边说的问题正好，其实excel就可以
解决，而且人家还是多先multicore的，不慢

【在 c******y 的大作中提到】

: Efficiency too low when dataset is big
: That's one thing I dislike in R

c******y
发帖数: 3269

前面我没表达清楚
我的意思是sqldf和R的其他方法比，数据大的情况下sqldf效率比较低
R没有比较高效的sql package，这点是R比较弱的

【在 l******n 的大作中提到】

: 嗯，数据大了R handle不了，那是另外的问题。上边说的问题正好，其实excel就可以
: 解决，而且人家还是多先multicore的，不慢

l******n
发帖数: 9344

其实R不需要呀，你直接用高效的sql工具处理，把结果在导入R就行了。R主要是作数据
处理的，不包括BI。sqldf很多时候比较简洁易懂，用起来方便

【在 c******y 的大作中提到】

: 前面我没表达清楚
: 我的意思是sqldf和R的其他方法比，数据大的情况下sqldf效率比较低
: R没有比较高效的sql package，这点是R比较弱的

c******y
发帖数: 3269

sqldf很多时候比较简洁易懂，用起来方便
Agree, it's just not efficient.
Therefore, if the users ETL in R, I'd recommend other R packages instead of
sqldf, so they can get a better idea of how R manipulates data.
If users prefer SQL, I would suggest the same as you do: let the pro-
software do SQL.

【在 l******n 的大作中提到】

: 其实R不需要呀，你直接用高效的sql工具处理，把结果在导入R就行了。R主要是作数据
: 处理的，不包括BI。sqldf很多时候比较简洁易懂，用起来方便

t********m
发帖数: 939

A very good post! Mark.

(共1页)

进入Statistics版参与讨论

相关主题
● 问一个R&R measuring system的问题。	● 请教用SAS的一个数据处理的问题
● 谁有这本书的电子版吗	● 【R】关于R的variable type
● Senior Statistician position in Santa Monica CA	● 用R出现怪问题。
● Statistician/Data Mining Position in Santa Monica, CA	● my R job
● 急问：到哪里能买或租到computing facility／server？	● 弱弱地,一个关于SPSS的问题........
● how to set the seed in R	● 问一下， data mining是具体做什么工作？
● 在sas里引用变量，如v1 - v100, 如果变量的数目不是固定的，怎么办？	● 求助！该如何准备这个电话面试？
● 请教一个SPSS处理数据的问题，非常感谢！！	● Urgent! Intra-observer variability

相关话题的讨论汇总
话题: df1话题: df2话题: varlist1话题: varlist2话题: ifelse

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天