关于dataset的讨论汇总 - 话题女王

全部话题 - 话题: dataset

d*******1
发帖数: 293

来自主题: Statistics版 - how to get variable names and # of variable names in sas dataset

I want to draw frequency distribution plot based on each column in dataset. I do not know how
many columns and the column names since it will change with dataset.
So i want to write a macro to draw a plot for a column of specific dataset
and then use loop.
%Macro drawplot(&dataset, &column_name)
and use loop to draw all plots for all column.
but problem is:
1. how can I get column name of dataset (if I know it is #i column in
dataset)
2. how can I get # of columns in dataset

f*******h
发帖数: 1269

来自主题: Faculty版 - Dataset of 2200 faculty in 50 top US Computer Science Prog

Dataset of 2200 faculty in 50 top US Computer Science Programs
Alexandra Papoutsaki*, Hua Guo, Danae Metaxa-Kakavouli, Connor Gramazio,
Jeff Rasley, Wenting Xie, Guan Wang, Jeff Huang
Brown University, Providence, RI, USA
The Dataset
We provide the first free dataset of all professors in 50 top US Computer
Science Graduate Programs. We believe that we offer a valuable resource to
the academic community and to anyone who is interested in the shape of the
Computer Science in the most competitive i... 阅读全帖

s**c
发帖数: 6

来自主题: Statistics版 - 请教大牛，将多个dataset合并后，如何知道哪部分数据来自哪个dataset。

我用sas，proc append合并了2000多个dataset
由于部分dataset是空的，所以无法按照合并后dataset的顺序来判断哪些record来自哪
个dataset。
先谢了

g******7
发帖数: 1433

来自主题: Statistics版 - A question in splitting dataset

I have a dataset, and I want to randomly split it into two datasets.
For example,
Obs Policy #
1 67
2 67
3 67
4 78
5 78
....
10000000 77821178
10000001 77821178
10000002 77821178
I want all the same (unique) policy# in one splitted dataset (like all the
67 in first dataset), not in both of splitted datasets(like obs 1 in
dataset1, obs 2 in dataset2), how would I do that in SAS?
Sorry that i cant type chinese,
Thanks!

y******u
发帖数: 1460

来自主题: CS版 - RGB-Depth Human Activity Video Dataset

Dear fellow researchers,
I would like to introduce a new RGB-Depth human daily activity video
benchmark. It has 12 action categories such as make a phone call, eat meal,
mop the floor etc. It totally has 1189 manually segmented and annotated
video segments. Synchronized RGB and depth images are provided at 30 FPS.
The activities are performed by 30 subjects.
You can now download this video dataset from two sources:
1) Dropbox: you can log into the dropbox using the following public
account an... 阅读全帖

u*********e
发帖数: 9616

来自主题: Database版 - SSRS report failing to display dataset string

Hi,
I am working on a SSRS report from a store procedure I created.
The returned the dataset is showing in the picture
I create a dataset using tableadapter. I can preview the dataset without any
issue (in pic2)
Later I bind the dataset using objectDataSource, and select the datasource
in the rdlc file.
I drage and drop the values from the field to the table.
When I run the result, the first two columns display values correctly.
However, the rest columns all return empty string.
any idea why?

P*****P
发帖数: 57

来自主题: XML版 - help 问个c# xml schema dataset 的问题

新手上路，问得胡涂请别见怪
想把一些数据serialize成xml stream，xml schema已经做好了，dataset的class也自动
生成了，现在就是不知道怎么把数据“赋值”到dataset里，
m_xmldataset.Tables.Add()
m_xmldataset.Tables[0].Rows.Add(？？) -- 这里就不知道怎么做了，求高手相助，多
谢
我也看了一些简单的例子，比如生成一个简单的dataset，然后把数据按row加进去，但xm
l schema生成的dataset怎么做我就不懂了，糊涂中

y******u
发帖数: 1460

来自主题: Computation版 - RGB-Depth Human Activity Video Dataset

z**********i
发帖数: 12276

来自主题: Statistics版 - Dataset merge的一个问题

我有两个很大的datasets,1个1.3G,另一个11G，我正在用SAS来merge这两个datasets.
已经运行了2天多，形成的新dataset已经500多G。
有经验的给说说，还要多久呀？
有什么好的办法来处理这样大的dataset吗？
多谢！！

A*******a
发帖数: 60

来自主题: Statistics版 - Dataset name太长的问题。。。

SQL里的一个dataset，名字33个characters，正好超过了SAS要求的32个characters，
所以如果用SAS去读这个dataset，总是会出错，说名字太长了。有没有什么办法，不改
dataset的名字，而让SAS可以读这个dataset呀？谢谢啦~

S********a
发帖数: 359

来自主题: Statistics版 - 【包子】生成RAW SAS DATASET问题

比如通过data manipulate删除了一个raw dataset里的一些变量，从而生成了一个新的
dataset, 然后怎么把这个dataset从output window里输出，变成另一个raw dataset (
即后缀是.sas7bdat)?
包子答谢！！

a***r
发帖数: 420

来自主题: Statistics版 - 【求助】Large Dataset Management

需要生产一个格式为
FAMID IID F M Sex SNP1 SNP2 SNP3...的text file，
用作一个软件（MACH，版上搞生统的牛人应该知道）的input file
SNP的个数为2.5 million，IID有100个
原来的数据是以每个IID的每个SNP为一个observation存储在很长的dataset里的（250
million observation）
为了生成上述的文件，最直接的方法可能是对原dataset做proc transpose及其它相应操
作，生成一个上述格式的dataset然后export；
可是我仅仅是对原dataset的两个变量进行了一点改变，就从早上到现在还没跑完（服务
器上），服务器是32位的linux
我不知道要做完我计划的proc sql和proc transpose，会花多长时间
我完全没有处理这么大数据库的经验，实在有点了无头绪
要生成这样的text file，用SAS是合适的选择么？如果用SAS,有没有更好的方法呢？或
者，应该选择其他的软件和方法？
诚心求教，望大家指点！
先谢过~bow

a***r
发帖数: 420

来自主题: Statistics版 - [SAS]一个比较大的dataset中求特定对variable的R2

我的意思是，比如有两个dataset，分别是某实验before和after的各种数值
dataset1：
obs var1.1 var1.2 var1.3 var1.4...
1 2 3 3 2
2 1 2 2 2
......
dataset2：
obs var2.1 var2.2 var2.3 var2.4...
1 5 2 3 5
2 4 2 2 4
...
我想求var1.1-var2.1; var1.2-var2.2...的R2
最后存在一个新的dataset里
我初步的想法是把两个dataset合起来成，叫它comb，
然后proc corr data=comb out=out； _numeric_;
然后在out里选取和保留var1.1-var2.1...这样的组合
可是感觉这样很浪费空间啊？
不知大家有没有什么更好的建议？
比如，用macro直接只算需要的组合，可是这样我不知道如何把所有R2合到一个dat... 阅读全帖

r****5
发帖数: 618

来自主题: DataSciences版 - R describe dataset

我有一个dataset，我想产生描述的东西。就像下表一样的。这个用什么语句？谢谢
--------------------------------
brief function ourput for dataset
--------------------------------
this dataset has 445 rows 89 attributes
real valued attribute
------------------------
dataset
-----------------------
attribute_ID, attribue_name.....0 ,2, No(471)

n********e
发帖数: 1630

来自主题: DataSciences版 - 请问关于小的dataset evaluation的问题

我是新手在练习ML的东西。我用的是400个data point小的dataset做 classification
（0 or 1），python，sklearn
由于dataset unbalanced, 我用了stratified shuffle split 在grid search CV
training，找到最优的estimator （scoring = f1）
之后我用几个不同的algorithm 最优的estimator clf 去 evaluate performance的时
候，应该用什么样的strategy？
1. 我用了整个dataset，只用一次，求prediction，然后比较得出accuracy，
precision，recall。这样score很高，高达0.9 以上
2. 我也是用stratified shuffle split 去create 1000 folds, 每个fold train，然
后test，把accuracy，precision，recall 结果average。这样的话结果很低，只有0.
3-0.6
哪个可以作为evlauate的sc... 阅读全帖

u*********e
发帖数: 9616

来自主题: Database版 - SSRS report failing to display dataset string

thanks for responding. I use VS to design the rdlc file. It's was a straight
Fields!Stmt_Created_Date.Value drop in textbox. The field of "Group_Code"
and "Group_Member_Code" can be displayed without any problem but not for the
rest.
I did some research. One person raised the similar issue like mine. His
storeprocedure created a table variable and used cursor to combine different
rows into one row then put into the table variable and select the result as
the return set. That's very much like min... 阅读全帖

L*******r
发帖数: 1011

来自主题: DotNet版 - DataReader vs. DataSet

My words on this:
dataset way is more scalable, especially for busy service(a lot of
connections). But datareader is faster if not that many requests.
hehe, dataset just "looks faster" for busy web serive.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnbda/html/bd
adotnetarch031.asp
In all of the preceding tests, we saw that DataReader outperformed DataSet. As
mentioned earlier, the DataReader offers better performance because it avoids
the performance and memory overhead ass

s**********0
发帖数: 41

来自主题: Statistics版 - 请教大牛，将多个dataset合并后，如何知道哪部分数据来自哪个dataset。

append之前在你的每个dataset里加个variable标明dataset名，合并后不就知道来源了

y******0
发帖数: 401

来自主题: Statistics版 - Dataset merge的一个问题

Something wrong with your code.Maybe it is cartesian merge.
Never merge two large datasets by many-to-many. I update a dataset by
merging 3 medical claims datasets (each about 20-50G) through one-to-many.
It take about 50 minutes.

A*******s
发帖数: 3942

来自主题: Statistics版 - [SAS]怎么快捷地删除Macro 里创建的临时dataset和macro variab

谢谢。还有一个问题想请教一下大牛，如果我要写一个自我嵌套的self-referential/
self-nested macro，不同层次的macro里面创建的temporal datasets重名了怎么办？
sas有没有local datasets这种概念，就像local macro variable一样，不同层次macro
创建的datasets不互相影响？

P****D
发帖数: 11146

来自主题: Statistics版 - 怎样检查俩个大的dataset一样

如果有不同，这么大的dataset，加了这两个选项会死人的，output瞬间就满了，然后
就问你是不是要清除，然后瞬间又满了……
我看还是就按大胖猫的来，有任何显示这两个dataset不同的结果出来，楼主就回去检
查他生成dataset的程序。

b******s
发帖数: 345

来自主题: Statistics版 - 请教怎样可以得到这样的一个dataset?

不知道还有没有更简单的方法？帖子中我只把已有的dataset给出了一部分，想得到的
dataset也是给出了一部分。实际的dataset有400多行，我需要加入的twins信息是26个
twins。我得写26行类似下面的命令。还有没有更简单些的方法呢？谢谢！
if id=1019 then do;
id=9002;
output;
end;

a**y
发帖数: 335

来自主题: DotNet版 - DataReader vs. DataSet

From my experience, most of the time, DataReader could do the job fine except
sometimes, I need some complex calculation towards a big data set. Probably,
i could achieve the same goal with SQL, but that would make the database too
slow. In this case, dataset is used. We do some sacrifice some memory, but
we saved the database.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnbda/html/bd
DataSet.
the
data,
enough
the

c**t
发帖数: 2744

来自主题: DotNet版 - 请问可能对C# dataSet作query吗？

DataSet is your friend. You can think DataSet is the in-memory database, you
may have "linked tables" from different data sources; you may have your own
index, pk...

N********n
发帖数: 8363

来自主题: Programming版 - C#: DataSet vs. Class

Google ".net strong typed dataset". You can represent dataset in a
class and Visual Studio or some 3rd party tools can generate it for
you w/ a few drag&drops.

records

x**n
发帖数: 461

来自主题: Programming版 - dot net Q: dataset, entity data model, LINQ, entity framework

There is no 大牛 in this world. The difference is just experiences.
If your platform has native support for active record, you may use dataset.
Generally, if your system is simple enough, you may go dataset. Most time,
if starting from domain modeling, and you want your domain model(the core of
your application) to exist for long time, model it as persistence ignorance
(PI), not knowing how to persist itself. Then use OR/M to map it to
persistence (most time it is relational database). The most po... 阅读全帖

s********l
发帖数: 245

来自主题: Statistics版 - How to open SAS dataset

The professor gave to me a SAS dataset, It supposes that could be opened by
double-click right button. But since when people clean this dataset, he used
proc format procedure, so I tried several times, still can not open it.
Anyone could give me some suggestions? Thanks

l*g
发帖数: 46

来自主题: Statistics版 - 请问如何验证已知的logistic regression models是不是能很好predict 自己的dataset

有多个这样的equations，每个包含不同变量，outcome都是death
我尝试用自己的dataset代入所需要的X，算出Y，就是individual death rate，然后需
要找cutoff point决定如何分组（death or not）
问题是：1）怎么找到cutoff point？提示要用sensitivity/specificity test来找，
可是我没明白怎么找。我试过用眼睛大概看看dataset的death分布和算出的death rate
来估计一个cutoff point，可是这样做并不符合要求
2）分好组后以这个作为outcome（0/1），再用之前的变量一起建model，就是想还原已
知的model，以便做diagnostics，可是还原不出。。。
抱歉，我比较菜。。。做的比较混乱，请教大家！谢谢！

z********n
发帖数: 710

来自主题: Statistics版 - how to output cumulative percent to a dataset from Proc Freq?

proc freq data=x;
table var1;
by var2;
run;
I need to get a new dataset from this freq table with cumulative percent in
it.
How can I do it?
proc freq data=x;
table var1/noprint output=freq;
by var2;
run;
By running the previous code, I got percent not cumulative one.
The dataset is large. Thank you!!!

z**********i
发帖数: 12276

来自主题: Statistics版 - Dataset merge的一个问题

一个dataset是claims file,包括住院期间的、医生的，所以每个病人有几个claims。
另一个dataset是breast cancer file,这些人的乳腺癌确诊情况，包括确诊的日期，可
能每个人也有几行（不确定）。
好象是many to many merge.
我想看的是，病人在确诊癌症前1年的身体综合情况。
多谢大家的帮助！

t***1
发帖数: 18

来自主题: Statistics版 - Dataset merge的一个问题

Dataset merge can't handle many-to-many, only one-to-many, isn't it?
For large datasets merge, 1. you should run sql on by-variable for each of
them and have a rough idea how many obs are going to be created; 2. index
the data and drop of all other variables except by-variable and index.

h******e
发帖数: 1791

来自主题: Statistics版 - 如何强行合并两个datasets？

两个dataset的column的名字不一样，想强行纵向合并，抹掉第二个dataset的变量，如
何做？

b*********e
发帖数: 29

来自主题: Statistics版 - [提问]怎样sort这个dataset?

data Test;
input input $ outcome $ @@;
datalines;
A 0 A 0 A 0
A 1 A 1 A 1
A 2 A 2 A 2
B 0 B 0 B 0
B 1 B 1 B 1
B 2 B 2 B 2
;
run;
proc sort data=test;
by input outcome;
run;
data test2; set test;
input class @@;
cards;
1 2 3 1 2 3
1 2 3 1 2 3
1 2 3 1 2 3
;
run;
proc sort data = test2;
by input class;
run;
我忘了如何把一个dataset中的三个变量中的两个存到另外一个dataset中了。
可以考虑用sql.

n****u
发帖数: 229

来自主题: Statistics版 - sas dataset -> xml

不知道版上谁干过这事情，把sas dataset output 到xml文件里。
我在sas网站上看了一下，依样画葫芦，但是xml不符合我们要的格式。
例如原来的dataset是id, birthday, labtest1, labtest2
拿那出来的xml是

但是我们要的格式有点不一样

不知道各位有何办法？我用xmlmap也没整出来，不知道是方向不对还是没写对

s*******d
发帖数: 132

来自主题: Statistics版 - How to set initial dataset to zero in a SAS macro?

In a sas macro, I want to setup an empty dataset . like
%macro use;
data nlp0;
input estimate appstderr true mse relabias nlpupper nlplower nlpcp;
lines;
;
run;
….calculations..
%mend;
SAS doesn't allow this. The error: The macro USE generated CARDS (data lines
) for the DATA step, which could cause incorrect results. The DATA step and
the macro will stop executing.
Oh my lady Gaga!
Problem is I need to clear up the dataset from time to time so that new resu
lts won't mix with old ones.
Any

g*******r
发帖数: 270

来自主题: Statistics版 - 可以把算出的quantiles（比如Q1,median,Q3）用dataset保存吗？

data a;
input group $ x @@;
cards;
a 1 a 2 a 4 a 2 a -2 a 10 a 4 a 0 a 3 a 2 a 1 a 4 a 5 a 3 a 21 a
b 3 b 1 b 3 b 9 b 12 b 3 b 4 b 33 b 4 b 7 b 23 b -11 b 8 b 0 b 5
c 12 c 2 c 3 c 4 c 21 c 34 c 5 c 7 c -10 c 15 c 5 c -2 c 6
d . . .
e . . .
.
.
,
;
有一dataset，想算出每个group里的quantiles(e.g. Q1,MEDIAN,Q3),并把这些结果保
存在dataset里，以备下一步调用。谢谢！

A*******s
发帖数: 3942

来自主题: Statistics版 - [SAS]怎么快捷地删除Macro 里创建的临时dataset和macro variab

在一个macro里创建了一堆dataset和local macro variable，为了不占用空间，有什么
方法在macro结尾处可以方便地删除它们？有啥方便点的方法么？谢谢
还有，multiple-layer nested macro里面有啥原则可以避免创建重名的datasets呢？

A*******s
发帖数: 3942

来自主题: Statistics版 - [SAS]怎么快捷地删除Macro 里创建的临时dataset和macro variab

我扫了一眼，似乎这是说global/local macro variables的
我想解决的问题是，怎么让互相嵌套的，不同层次的macro程序里创建的datasets不互
相冲突，如果这些dataset也想macro variable一样，有local/global的性质就好了。
但是还没google到相关的资料。

A*******s
发帖数: 3942

来自主题: Statistics版 - [SAS]怎么快捷地删除Macro 里创建的临时dataset和macro variab

想了一下，我能想出来的解决方法就是用一个global macro variable &layer, 每次调
用一个macro program，&layer就+1，然后把macro里面创建的temporary datasets都以
_&layer为前缀命名，macro结尾再删去_&layer: 的所有datasets。
不知道有没有更简单的方法。

A*******s
发帖数: 3942

来自主题: Statistics版 - [SAS] how to process tree-structure dataset

suppose we have binary-tree-structure dataset containing id and their parent
id like:
id parent
1 .
2 1
3 1
4 2
5 2
6 4
7 4
8 5
9 5
for example, 1 is the parent of 2 and 3, and 8 and 9 are the children of 5...
I want to transform the dataset as following :
id offspring
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
2 4
2 5
2 6
2 7
2 8
2 9
3 .
4 6
4 7
5 8
5 9
6 .
7 .
8 .
9 .
any idea?

a********a
发帖数: 346

来自主题: Statistics版 - how to save dataset generated by a function in R

x=function(n){
sim=rnorm(n)
return(sim)
}
x(n=10)
> x(n=10)
[1] 0.3466314 0.7477493 1.1274950 0.3848275 0.9549582 1.1843009
[7] 0.3804086 -1.4802425 1.4219162 1.5410326
> sim
Error: object 'sim' not found
I generate dataset 'sim' from a R function. As you can see from the result,
I can see the data I generated, but why the sim could not be found? As I
can not get sim, so I can not save the data as a text file. Do you know how
to save a dataset generated within a function?
Thanks

s******r
发帖数: 1524

来自主题: Statistics版 - 如何判断一个dataset是不是空的？

It should be 0 if the dataset is empty. It will reduce the processing time
if the dataset is huge.

r*****g
发帖数: 99

来自主题: Statistics版 - How to change sas dataset column order

请教：
我有一个dataset 有一下variables：var1, var2, var4, var3
我想把var3 放到var4 前面，
除了把dataset export 到excel 然后借助excel 来switch order.有没有简便易行的
sas code可以解决这个问题呢?
谢谢啦！

S********a
发帖数: 359

来自主题: Statistics版 - 怎样检查俩个大的dataset一样

比如有俩个大的dataset，300个变量，600000个obs，我用两种方法生成的，目的是一
样的，如何check俩个datasets的结果是一模一样的呢？
谢谢。

a***r
发帖数: 420

来自主题: Statistics版 - 【求助】Large Dataset Management

raw文本是genomestudio产生的final report，text file，20G
用infile，input读入SAS，生成的dataset 30G...
SNP是char变量，还没有code成num，现在是“G G”的形式，所以level要说的话，应该
认为有16个
我原来也怀疑这么大的dataset行不行，因为这个读入就花了4,5个小时，但后来还是硬
着头皮上了
如果需要学习其他的软件来做data management，我也很乐意，但是不知道学什么好？
因为后面还有一个778G的final report，转成dataset380G，我还没有处理 ...
谢谢!

a***r
发帖数: 420

来自主题: Statistics版 - 【求助】Large Dataset Management

嗯，原来的dataset就是这样的格式的
是因为需要上述格式的text input，我想做一个这样格式的dataset然后输出
现在看来可能不太行

b******s
发帖数: 345

来自主题: Statistics版 - 请教怎样可以得到这样的一个dataset?

现在的dataset是这样的,是一个sampling后的输出(即使同一个ID重复后其famid1是不
同的)：
obs ID famid famid1
1 1002 2 1
2 1003 3 2
3 1003 3 3
4 1010 10 4
5 1044 37 5
6 1044 37 6
7 1089 49 7
想得到的是这样的：
9020和9080是与1010及1044相应的twins，9020和9080（twins信息）加在每一个的1010
及1044的后面，并且与每一个1010及1044的famid1相同。
obs ID famid famid1
1 1002 2 1
2 1003 3 2
3 1003 3 3
4 1010 10 4
5 9020 10 4
6 1044 37 5
7 9080 37 5
8 1044 ... 阅读全帖

a****g
发帖数: 8131

来自主题: Statistics版 - building prediction models from large dataset

请教各位一下,
这个large dataset跟一般的dataset的model building到底有什么区别?
比较model好坏的几个指标aic之类的, 有什么具体区别和好坏
thanks a lot

a****g
发帖数: 8131

来自主题: Statistics版 - building prediction models from large dataset

请教各位一下,
这个large dataset跟一般的dataset的model building到底有什么区别?
比较model好坏的几个指标aic之类的, 有什么具体区别和好坏
thanks a lot

r*****g
发帖数: 99

来自主题: Statistics版 - SAS Code 求助，如何把在另一个dataset的id找出来

我想从whole dataset中找出在另外一个dataset的id，下面的code是错误的，请教如何
改正？
data two;
set whole;
if id is in dataset1 then status=1;
else status=2;
run;
不胜感激！

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天