由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - generating percentile-percentage charts
相关主题
"Simplifying Big Data with Hadoop" live on ACM webinar求问一个线性模型的题目
分享个MIT big data 的slides,对新手很有帮助-已更新下载generating percentile-percentage charts (转载)
做大数据的好混学术界么?求助!求相关的医学资料,谢谢
[Data Science Project Case] Generate Categories for ProductGenerate and Retrieve Many Objects with Sequential Names
有没有人上过stanford data mining 或者bioinformatics graduate certificate 的课请问什么是patient cohort?
Pig word count请问各位高人意见
Career talk --你问我答-Next Tuesday 8PM CDT(May 26) (转载)【弱问】如何查同时期引用的top percentage啊?
healthcare data analyst VS SAS clinical programmer?how to calculate top percentage?
相关话题的讨论汇总
话题: patient话题: age话题: clinic话题: visits
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
Spent some time generating this kind of charts from raw data. There might be
better ways of doing so, but I would just post my method and 抛砖引玉。
Raw table has three columns: clinic | age | count, which records the age of
patients, rather, how many of each age category.
Target table has three columns: clinic | age_percentile | count_percentage,
which records the percentage of patients in each age category, with the
categories in percentiles form (e.g. if there are only two age categories,
then the percentiles would be 50 and 100).
Here is the R code (I knew Scala code must be simpler but my company is not
using it)
# order by clinic and age
visits <- visits[with(visits,
order(clinic, age)), ]
# percentiles of age
percentiles <- by(visits$age,
list(visits$clinic),
function(x) trunc(rank(x)/length(x) * 100),
simplify = T)
# percentages of count
percentages <- by(visits$count,
list(visits$clinic),
function(x) x / sum(x),
simplify = T)
# put them together
patient_percentiles <- cbind(row.names(percentiles),
percentiles,
percentages)

patient_percentiles <- data.frame(patient_percentiles)

# unpack list elements
patient_percentiles <- with(patient_percentiles,
cbind(melt(percentiles),
melt(percentages)))
# clean up
patient_percentiles <- patient_percentiles[, c(2,1,3)]
colnames(patient_percentiles) <- c("clinic", "age_percentiles", "count_
percentages")
f***8
发帖数: 571
2
能不能贴点数据?不是太清楚
Raw table has three columns: clinic | age | count, which records the age of
patients, rather, how many of each age category.
的意思。
感觉用dplyr可能会简洁一些?
c***z
发帖数: 6348
3
sorry, here is an example
clinic | age | count
A | 12 | 3
A | 18 | 2
B | 22 | 4
B | 40 | 2
就是说A家有3位12岁的病人,2位18岁的病人;B家有4位22岁的病人,2位40岁的病人。
谢谢回复,我去看看dplyr
c***z
发帖数: 6348
4
sorry 忘了一步
# add up for each percentile
patient_percentiles_fin <- aggregate(count_ percentages
~ clinic + age_percentiles,
FUN = sum,
data = patient_percentiles)
c***z
发帖数: 6348
5
老板又有新花样,这次要cumulative的percentages
patient_percentiles_cum <- patient_percentiles_fin[, c(1,102)]
colnames(patient_percentiles_cum)[2] <- "top.0"
for (k in 1:100) {
# k <- 1

temp <- patient_percentiles_fin[, c(102:(102-k))]

top <- apply(temp,
1,
FUN = sum)
top <- data.frame(top)

patient_percentiles_cum <- cbind(patient_percentiles_cum,
top)

colnames(patient_percentiles_cum)[2+k] <- paste("top",
k,
sep = ".")
}
H****E
发帖数: 254
c***z
发帖数: 6348
7
Thanks a lot for the tip!

【在 H****E 的大作中提到】
: http://stats.stackexchange.com/questions/8225/how-to-summarize-
f***8
发帖数: 571
8
合成的数据:
library(dplyr) # version: ≥0.3
set.seed(123)
visits <- data_frame(clinic=sample(LETTERS[1:5], 20, replace=TRUE)) %>%
group_by(clinic) %>%
mutate(age=sample(1:50, length(clinic), replace=FALSE),
count=sample(1:100, length(clinic), replace=TRUE)) %>%
arrange(clinic, age)
我的做法:
patient_percentiles2 <- visits %>%
group_by(clinic) %>%
mutate(age.percentile=as.integer(min_rank(age)/length(age)*100),
count.percentage=count/sum(count)) %>%
select(clinic, age.percentile, count.percentage)
抛砖引玉,欢迎指教!

【在 c***z 的大作中提到】
: sorry, here is an example
: clinic | age | count
: A | 12 | 3
: A | 18 | 2
: B | 22 | 4
: B | 40 | 2
: 就是说A家有3位12岁的病人,2位18岁的病人;B家有4位22岁的病人,2位40岁的病人。
: 谢谢回复,我去看看dplyr

c***z
发帖数: 6348
9
Thanks a lot! Definitely will try out.

【在 f***8 的大作中提到】
: 合成的数据:
: library(dplyr) # version: ≥0.3
: set.seed(123)
: visits <- data_frame(clinic=sample(LETTERS[1:5], 20, replace=TRUE)) %>%
: group_by(clinic) %>%
: mutate(age=sample(1:50, length(clinic), replace=FALSE),
: count=sample(1:100, length(clinic), replace=TRUE)) %>%
: arrange(clinic, age)
: 我的做法:
: patient_percentiles2 <- visits %>%

c***z
发帖数: 6348
10
Yes, it works like a charm! Thanks a lot!

【在 f***8 的大作中提到】
: 合成的数据:
: library(dplyr) # version: ≥0.3
: set.seed(123)
: visits <- data_frame(clinic=sample(LETTERS[1:5], 20, replace=TRUE)) %>%
: group_by(clinic) %>%
: mutate(age=sample(1:50, length(clinic), replace=FALSE),
: count=sample(1:100, length(clinic), replace=TRUE)) %>%
: arrange(clinic, age)
: 我的做法:
: patient_percentiles2 <- visits %>%

c***z
发帖数: 6348
11
And it is beautiful in style, I can feel the flow. :)
f***8
发帖数: 571
12
The credit goes to Hadley Wickham..

【在 c***z 的大作中提到】
: And it is beautiful in style, I can feel the flow. :)
1 (共1页)
进入DataSciences版参与讨论
相关主题
how to calculate top percentage?有没有人上过stanford data mining 或者bioinformatics graduate certificate 的课
eb1b pl 请教如何论证自己在本领域是topPig word count
UW correct percentage questionCareer talk --你问我答-Next Tuesday 8PM CDT(May 26) (转载)
没有收入,申请meidcaid如何证明低收入healthcare data analyst VS SAS clinical programmer?
"Simplifying Big Data with Hadoop" live on ACM webinar求问一个线性模型的题目
分享个MIT big data 的slides,对新手很有帮助-已更新下载generating percentile-percentage charts (转载)
做大数据的好混学术界么?求助!求相关的医学资料,谢谢
[Data Science Project Case] Generate Categories for ProductGenerate and Retrieve Many Objects with Sequential Names
相关话题的讨论汇总
话题: patient话题: age话题: clinic话题: visits