generating percentile-percentage charts - DataSciences版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - generating percentile-percentage charts

相关主题
● "Simplifying Big Data with Hadoop" live on ACM webinar	● 求问一个线性模型的题目
● 分享个MIT big data 的slides，对新手很有帮助-已更新下载	● generating percentile-percentage charts (转载)
● 做大数据的好混学术界么？	● 求助！求相关的医学资料，谢谢
● [Data Science Project Case] Generate Categories for Product	● Generate and Retrieve Many Objects with Sequential Names
● 有没有人上过stanford data mining 或者bioinformatics graduate certificate 的课	● 请问什么是patient cohort？
● Pig word count	● 请问各位高人意见
● Career talk --你问我答-Next Tuesday 8PM CDT（May 26） (转载)	● 【弱问】如何查同时期引用的top percentage啊？
● healthcare data analyst VS SAS clinical programmer?	● how to calculate top percentage?

相关话题的讨论汇总
话题: patient话题: age话题: clinic话题: visits

进入DataSciences版参与讨论

1

(共1页)

c***z 发帖数: 6348	1 Spent some time generating this kind of charts from raw data. There might be better ways of doing so, but I would just post my method and 抛砖引玉。 Raw table has three columns: clinic \| age \| count, which records the age of patients, rather, how many of each age category. Target table has three columns: clinic \| age_percentile \| count_percentage, which records the percentage of patients in each age category, with the categories in percentiles form (e.g. if there are only two age categories, then the percentiles would be 50 and 100). Here is the R code (I knew Scala code must be simpler but my company is not using it) # order by clinic and age visits <- visits[with(visits, order(clinic, age)), ] # percentiles of age percentiles <- by(visits$age, list(visits$clinic), function(x) trunc(rank(x)/length(x) * 100), simplify = T) # percentages of count percentages <- by(visits$count, list(visits$clinic), function(x) x / sum(x), simplify = T) # put them together patient_percentiles <- cbind(row.names(percentiles), percentiles, percentages) patient_percentiles <- data.frame(patient_percentiles) # unpack list elements patient_percentiles <- with(patient_percentiles, cbind(melt(percentiles), melt(percentages))) # clean up patient_percentiles <- patient_percentiles[, c(2,1,3)] colnames(patient_percentiles) <- c("clinic", "age_percentiles", "count_ percentages")
f***8 发帖数: 571	2 能不能贴点数据？不是太清楚 Raw table has three columns: clinic \| age \| count, which records the age of patients, rather, how many of each age category. 的意思。感觉用dplyr可能会简洁一些？
c***z 发帖数: 6348	3 sorry, here is an example clinic \| age \| count A \| 12 \| 3 A \| 18 \| 2 B \| 22 \| 4 B \| 40 \| 2 就是说A家有3位12岁的病人，2位18岁的病人；B家有4位22岁的病人，2位40岁的病人。谢谢回复，我去看看dplyr
c***z 发帖数: 6348	4 sorry 忘了一步 # add up for each percentile patient_percentiles_fin <- aggregate(count_ percentages ~ clinic + age_percentiles, FUN = sum, data = patient_percentiles)
c***z 发帖数: 6348	5 老板又有新花样，这次要cumulative的percentages patient_percentiles_cum <- patient_percentiles_fin[, c(1,102)] colnames(patient_percentiles_cum)[2] <- "top.0" for (k in 1:100) { # k <- 1 temp <- patient_percentiles_fin[, c(102:(102-k))] top <- apply(temp, 1, FUN = sum) top <- data.frame(top) patient_percentiles_cum <- cbind(patient_percentiles_cum, top) colnames(patient_percentiles_cum)[2+k] <- paste("top", k, sep = ".") }
H****E 发帖数: 254	6 http://stats.stackexchange.com/questions/8225/how-to-summarize-
c***z 发帖数: 6348	7 Thanks a lot for the tip! 【在 H****E 的大作中提到】 : http://stats.stackexchange.com/questions/8225/how-to-summarize-
f***8 发帖数: 571	8 合成的数据： library(dplyr) # version: ≥0.3 set.seed(123) visits <- data_frame(clinic=sample(LETTERS[1:5], 20, replace=TRUE)) %>% group_by(clinic) %>% mutate(age=sample(1:50, length(clinic), replace=FALSE), count=sample(1:100, length(clinic), replace=TRUE)) %>% arrange(clinic, age) 我的做法： patient_percentiles2 <- visits %>% group_by(clinic) %>% mutate(age.percentile=as.integer(min_rank(age)/length(age)100), count.percentage=count/sum(count)) %>% select(clinic, age.percentile, count.percentage) 抛砖引玉，欢迎指教！【在 c**z 的大作中提到】 : sorry, here is an example : clinic \| age \| count : A \| 12 \| 3 : A \| 18 \| 2 : B \| 22 \| 4 : B \| 40 \| 2 : 就是说A家有3位12岁的病人，2位18岁的病人；B家有4位22岁的病人，2位40岁的病人。 : 谢谢回复，我去看看dplyr
c***z 发帖数: 6348	9 Thanks a lot! Definitely will try out. 【在 f***8 的大作中提到】 : 合成的数据： : library(dplyr) # version: ≥0.3 : set.seed(123) : visits <- data_frame(clinic=sample(LETTERS[1:5], 20, replace=TRUE)) %>% : group_by(clinic) %>% : mutate(age=sample(1:50, length(clinic), replace=FALSE), : count=sample(1:100, length(clinic), replace=TRUE)) %>% : arrange(clinic, age) : 我的做法： : patient_percentiles2 <- visits %>%
c***z 发帖数: 6348	10 Yes, it works like a charm! Thanks a lot! 【在 f***8 的大作中提到】 : 合成的数据： : library(dplyr) # version: ≥0.3 : set.seed(123) : visits <- data_frame(clinic=sample(LETTERS[1:5], 20, replace=TRUE)) %>% : group_by(clinic) %>% : mutate(age=sample(1:50, length(clinic), replace=FALSE), : count=sample(1:100, length(clinic), replace=TRUE)) %>% : arrange(clinic, age) : 我的做法： : patient_percentiles2 <- visits %>%
c***z 发帖数: 6348	11 And it is beautiful in style, I can feel the flow. :)
f***8 发帖数: 571	12 The credit goes to Hadley Wickham.. 【在 c***z 的大作中提到】 : And it is beautiful in style, I can feel the flow. :)

1

(共1页)

进入DataSciences版参与讨论

相关主题
● how to calculate top percentage?	● 有没有人上过stanford data mining 或者bioinformatics graduate certificate 的课
● eb1b pl 请教如何论证自己在本领域是top	● Pig word count
● UW correct percentage question	● Career talk --你问我答-Next Tuesday 8PM CDT（May 26） (转载)
● 没有收入，申请meidcaid如何证明低收入	● healthcare data analyst VS SAS clinical programmer?
● "Simplifying Big Data with Hadoop" live on ACM webinar	● 求问一个线性模型的题目
● 分享个MIT big data 的slides，对新手很有帮助-已更新下载	● generating percentile-percentage charts (转载)
● 做大数据的好混学术界么？	● 求助！求相关的医学资料，谢谢
● [Data Science Project Case] Generate Categories for Product	● Generate and Retrieve Many Objects with Sequential Names

相关话题的讨论汇总
话题: patient话题: age话题: clinic话题: visits

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)