请问关于交易量的一个SAS编程问题 - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - 请问关于交易量的一个SAS编程问题

相关主题
● 请教一个SAS问题	● 请教SAS ODS to Excel
● 请问如果用SAS 解决这个问题	● [合集] 新手首次发贴 SAS 问题急问，谢谢大家的帮助
● 求助：一个SAS的小问题	● 问个简单的SAS如何找出某个变量最大之所在的行？
● SAS Technical Interview Questions	● SAS code help
● 求助：SAS使用问题(读数据)	● help for one simple SAS DATA
● 问一个简单的SAS问题，多谢	● 问SAS code怎么写
● SAS菜鸟请教如果使SAS的output的结果放到一个文件内？	● 如何用SAS找出含有特定字符的observation?
● 急问一个SAS 的常见问题	● 每个ID出现一次，missing去掉，请问高手用SAS怎么做？

相关话题的讨论汇总
话题: date话题: data话题: investor话题: time话题: proc

进入Statistics版参与讨论

(共1页)

F****3
发帖数: 1504

数据是是关于交易量的：
Investor Date Time
ABC 21 01:58:28
XYZ 21 02:39:12
ABC 22 03:11:13
.
.
.
我想把在任何1分钟interval内内交易了10次的人挑出来，标记为“机器人”。请问怎
么搞啊？
我SAS搞来搞去就会那几个常用的，板上牛人写的iml程序我看都看不懂，打击很大。。
。顺便问一问有没有必要系统学SAS编程虽然目前还没用到（可能是因为不会用）？主
要是以前看了没用，全忘了！
谢谢！

l****u
发帖数: 529

I did not check it
proc sql;
create table a as
select distinct a.investor, 1 as flag
from yourdata as a, yourdata as b
where a.investor=b.investor and a.date=b.date and a.time<=b.time<=a.time+
60
group by a.investor, a.date, a.time
having count(a.investor) >=10;
quit;

F****3
发帖数: 1504

谢谢！！！
很有启发，今天放到服务器上算。单个文件160GB，希望机器不会爆炸。
a.time<=b.time<=a.time+60
这个语句是不是就是建立一个matrix啊？

j*****g
发帖数: 36

the dataset is too big. sql step will take forever to run. may try to use
data step with by and retain statement together with lag function.

F****3
发帖数: 1504

请问用retain应该怎么算呢？if a given investor traded 10 times or more within
any 60 seconds interval on that day, then he is labeled as robot。
请问这个怎么编写啊？从来没有用过retain。。。谢谢！

【在 j*****g 的大作中提到】

: the dataset is too big. sql step will take forever to run. may try to use
: data step with by and retain statement together with lag function.

l****u
发帖数: 529

data two;
set one(rename=(time=atime id=aid date=adate)) nobs=n;
do i=_N_ to n;
set one(rename=(time=btime id=bid date=bdate)) point=i;
if aid=bid and adate=bdate and atime<=btime<=atime+60;
output;
end;
proc sort; by aid adate atime;
run;
data three;
set two;
by aid adate atime;
retain cnt;
if first.atime then cnt=0;
cnt+1;
if last.atime and cnt>=10 then do;
indicator='robot';
keep aid adate indicator ;
output;
end;
run;

j******o
发帖数: 127

你的data是怎么排序的？按date和time？

F****3
发帖数: 1504

我的data是按照date然后time排序的，但是应该可以用proc sort进行重新排序，就是
费点时间。这个有很大关系吗？
labeling of investors as robots are specific to that specific date, if the
next day the investors never trade more than 10 times over any 60 seconds
interval, then he is NOT labelled as a robot.

【在 j******o 的大作中提到】

: 你的data是怎么排序的？按date和time？

F****3
发帖数: 1504

太谢谢您了!!!
我回去仔细研究一下。以前从来没有用过retain。。。。

【在 l****u 的大作中提到】

: data two;
: set one(rename=(time=atime id=aid date=adate)) nobs=n;
: do i=_N_ to n;
: set one(rename=(time=btime id=bid date=bdate)) point=i;
: if aid=bid and adate=bdate and atime<=btime<=atime+60;
: output;
: end;
: proc sort; by aid adate atime;
: run;
: data three;

j*****g
发帖数: 36

not tested, not perfect either (first 10 not labelled). but it should be
more efficient if it works
proc sort data = dataset0 out= dataset1;
by investor date time;
run;
data dataset2(drop = count); set dataset1;
by investor date;
lag10time = lag10(time);
retain count 0 group "not robot";
if first.investor and first.date then do;
count = 0;
group = "not robot";
end;
if count >=10 then do;
count = 0;
group = "not robot";
end;
if count > 0 and count < 10 then do ;
count = count + 1;
if not missing lag10time and time - lag10time <=60 then do;
count = 1;
end;
end;
if count = 0 then do;
if not missing lag10time and time - lag10time <= 60 then do;
count = 1;
group = "robot";
end;
end;
run;

相关主题
● 问一个简单的SAS问题，多谢	● 请教SAS ODS to Excel
● SAS菜鸟请教如果使SAS的output的结果放到一个文件内？	● [合集] 新手首次发贴 SAS 问题急问，谢谢大家的帮助
● 急问一个SAS 的常见问题	● 问个简单的SAS如何找出某个变量最大之所在的行？
进入Statistics版参与讨论

F****3
发帖数: 1504

太感激了！！谢谢你！
我今天晚上把你的程序也提交到服务器上去。第一个程序已经run了两天了。第二个程
序昨天晚上提交的，现在还在run，不过已经完成第一部分了。
统计版的SAS水平实在太牛了。。。一定要好好学！

【在 j*****g 的大作中提到】

: not tested, not perfect either (first 10 not labelled). but it should be
: more efficient if it works
: proc sort data = dataset0 out= dataset1;
: by investor date time;
: run;
: data dataset2(drop = count); set dataset1;
: by investor date;
: lag10time = lag10(time);
: retain count 0 group "not robot";
: if first.investor and first.date then do;

c**d
发帖数: 104

Maybe you can try to use PROC summary first since it is very efficient to
summary large data without sorting:
/* step 1: generate a new data that includes all unique combinations*/
/* of Investor, Date, and Time with frequency */
proc summary data = yourdata nway noprint;
class Investor Date Time;
output out = freq_data;
run;
/*
part your new data (it has sorted by proc summary, you don't need sort):
Investor Date Time freq(# traded)
ABC 21 01:58:28 3
ABC 21 01:59:28 7
ABC 21 02:00:00 4
ABC 21 02:01:00 4
*/
/* step 2: after you reduce your data size, you can modify above proc sql or
data step codes to get your result */

F****3
发帖数: 1504

谢谢你的帮助，我现在就试试你的方法。
服务器还在算，问了SAS的人，但是里面提到的SPDE，不知道是什么东
西啊。。。
你说的proc summary和下面提到的proc freq方法可能有点相似
I would do a proc freq on the investor id, then create multiple data sets
with all of the observations for any one investor id.
I'd limit the size of the data sets to .5M observations. That maybe 10
different investors in a data set or 100K investors in a data set, depending
on how many observations each investor has. Obviously investors with a
count <11 need not be included at all.
Sort the data set by investor date time then spin through with a DATA Step.
Compare the datetime with the LAG10 of the datetime. If the difference is <
60 then output the observation to a data set that will contain the investor
and Date of those observations to a data set that will contain the investor
id and Date of those observations to delete.
Finally merge together the data set with the investor/Date combinations to
remove and DELETE the intersection. You can do that with MERGE or PROC SQL,
which ever you like.
Given the size of the master data set, you will probably want to index it by
a composite index containing investor & Date. One other idea that might
speed this up, you may want to put the master data set into a SPDE data set
across many directories. It will take a while to load, but retrieval is
very fast, and you will do a lot of retrieving. You might be able to pull
out each investor into a single temporary data set and process it very
quickly. SPDE can return data in sorted order so you could eliminate the
PROC SORT completely. Create a temporary file of those investor/Date
combinations to delete and APPEND it to the master list of investor/Date
combinations to delete. Once all of the searching is done, use SQL or the
MODIFY statement to Remove the observations.

【在 c**d 的大作中提到】

: Maybe you can try to use PROC summary first since it is very efficient to
: summary large data without sorting:
: /* step 1: generate a new data that includes all unique combinations*/
: /* of Investor, Date, and Time with frequency */
: proc summary data = yourdata nway noprint;
: class Investor Date Time;
: output out = freq_data;
: run;
: /*
: part your new data (it has sorted by proc summary, you don't need sort):

l****u
发帖数: 529

这个能简单一些？到现在我还没发现逻辑错误
proc sort data=yourdata;
by id date time;
run;
data yourdata;
set yourdata;
lagid=lag10(id);
lagtime=lag10(time);
lagdate=lag10(time);
if id=lagid and date=lagdate and time-lagtime<=60 then do;
indicator='robot';
output;
end;
run;

l****u
发帖数: 529

又想了一下，作如下改动会更好
lagid=lag9(id);
lagtime=lag9(time);
lagdate=lag9(date);

【在 l****u 的大作中提到】

: 这个能简单一些？到现在我还没发现逻辑错误
: proc sort data=yourdata;
: by id date time;
: run;
: data yourdata;
: set yourdata;
: lagid=lag10(id);
: lagtime=lag10(time);
: lagdate=lag10(time);
: if id=lagid and date=lagdate and time-lagtime<=60 then do;

c**d
发帖数: 104

你的答案没有什么错。他的问题是data 太大，不能先用proc sort.The server will
collapse. 比如我们的data warehouse 有个monitoring table，那里有所有病人的
minute-by-minute data。sort and sql 通常是在我们reduce or aggregate data 后
才考虑。
他的这个还是简单的去统计count。

【在 l****u 的大作中提到】

F****3
发帖数: 1504

我是放在服务器上算的，proc sort时间稍微长一点。但是是可以完成的。
第一个方法好像不行，算了四天了还没有出结果，马上就要到CPU time了。后面几个好
像都可以run，但是不知道结果对不对。
另外请问：
proc summary data = yourdata nway noprint;
class Investor Date Time;
output out = freq_data;
run;
是不是相当于：

proc sql;
create table new (compress=yes) as
select distinct id, date, time, count(stock_id) as visit_count
from old (keep=id date time stock_id)
group by id, date, time
order by id, date, time;
quit;
谢谢！

F****3
发帖数: 1504

太谢谢了，今天把你的方法也放到server上。
现在我一个人霸占了学校6个node，呵呵。

【在 l****u 的大作中提到】

: 又想了一下，作如下改动会更好
: lagid=lag9(id);
: lagtime=lag9(time);
: lagdate=lag9(date);

o****o
发帖数: 8077

我倒是可以给你写一个非常快的程序来达到你的要求，如果I/O够快，1--2个小时吧
不过这个要收费了

【在 F****3 的大作中提到】

: 太谢谢了，今天把你的方法也放到server上。
: 现在我一个人霸占了学校6个node，呵呵。

F****3
发帖数: 1504

是什么先进方法啊，好神秘啊！！！
传说中包子可以吗？不过我倒现在还不知道怎么发包子。。。呵呵

【在 o****o 的大作中提到】

: 我倒是可以给你写一个非常快的程序来达到你的要求，如果I/O够快，1--2个小时吧
: 不过这个要收费了

相关主题
● SAS code help	● 如何用SAS找出含有特定字符的observation?
● help for one simple SAS DATA	● 每个ID出现一次，missing去掉，请问高手用SAS怎么做？
● 问SAS code怎么写	● SAS 求助，一个小问题，包子答谢
进入Statistics版参与讨论

o****o
发帖数: 8077

实打实的收费，如果愿意私信联系

【在 F****3 的大作中提到】

: 是什么先进方法啊，好神秘啊！！！
: 传说中包子可以吗？不过我倒现在还不知道怎么发包子。。。呵呵

j******o
发帖数: 127

如果data可以按ID,Date,Time排序的话，试一下下面的code。如果
排序太花时间，我觉得在data step以10秒的窗口过一遍raw data，结合hash或dynamic
array等技术应该可以以更快的速度完成。
-------------------------------------------
*Please test: Assume your data ("have") has no duplicates and no matter "
Stock";
proc sort data=have(keep=investor date time) out=sorted_have; by investor
date time; run;

*only output the robot investor list in final data "obtain";
data obtain;
set sorted_have(keep=investor date time);
by Investor date time;
lag9_date=lag9(date);
lag9_time=lag9(time);
if first.Investor then do obs_count=.; robot=.; end;
obs_count+1;
retain robot;
if obs_count>=10 then do;
if date=lag9_date and 0<=time-lag9_time<10 then robot=1;
else if date-lag9_date=1 and -86399 1;
end;
if last.Investor and robot=1;
keep Investor;
run;
-------------------------------------------

【在 F****3 的大作中提到】

: 我的data是按照date然后time排序的，但是应该可以用proc sort进行重新排序，就是
: 费点时间。这个有很大关系吗？
: labeling of investors as robots are specific to that specific date, if the
: next day the investors never trade more than 10 times over any 60 seconds
: interval, then he is NOT labelled as a robot.

t********m
发帖数: 939

Mark一下，都是牛人啊！

F****3
发帖数: 1504

太谢谢了!!!
我把你的也提交上去了。前面几个方法的job已经run完了，我在用的台式机看看结果。
另外请问用proc expand是不是也可以啊？好像学校服务器还可以，除了第一种方法
现在都run完了。

dynamic

【在 j******o 的大作中提到】

: 如果data可以按ID,Date,Time排序的话，试一下下面的code。如果
: 排序太花时间，我觉得在data step以10秒的窗口过一遍raw data，结合hash或dynamic
: array等技术应该可以以更快的速度完成。
: -------------------------------------------
: *Please test: Assume your data ("have") has no duplicates and no matter "
: Stock";
: proc sort data=have(keep=investor date time) out=sorted_have; by investor
: date time; run;
:
: *only output the robot investor list in final data "obtain";

v********9
发帖数: 35

可不可以先用PROC TRANSPOSE, BY INVESTOR DATE;
让后用ARRAY 进行两两比较。
不知道PROC TRANSPOSE这个耗不耗时间和内存

F****3
发帖数: 1504

请问你觉得应该怎么编写这个程序呢？我前几天提交的sort的程序挂了，号称
insufficient space。我现在有提交了一个上去。看看有什么反应。。。

【在 v********9 的大作中提到】

: 可不可以先用PROC TRANSPOSE, BY INVESTOR DATE;
: 让后用ARRAY 进行两两比较。
: 不知道PROC TRANSPOSE这个耗不耗时间和内存

h***x
发帖数: 586

现在用SAS来研究股票的不少啊，您是想把这些机器人挑出来，研究他们的behavior
pattern吗?好奇的问！

【在 F****3 的大作中提到】

: 请问你觉得应该怎么编写这个程序呢？我前几天提交的sort的程序挂了，号称
: insufficient space。我现在有提交了一个上去。看看有什么反应。。。

s*********e
发帖数: 1051

for sorting large data, create an index first

【在 F****3 的大作中提到】

: 请问你觉得应该怎么编写这个程序呢？我前几天提交的sort的程序挂了，号称
: insufficient space。我现在有提交了一个上去。看看有什么反应。。。

s******r
发帖数: 1524

I worked on some project similar to yours. I do not think your dataset would
be larger than mine. I used database other than SAS but the basic logic
should be similar. For one thing, you can try if there is multiple
transactions per second, try to group by second first, it would reduce the
volume a lot.

【在 F****3 的大作中提到】

: 请问你觉得应该怎么编写这个程序呢？我前几天提交的sort的程序挂了，号称
: insufficient space。我现在有提交了一个上去。看看有什么反应。。。

F****3
发帖数: 1504

Thank you! You are right. I did a proc summary as suggested by a man of
great wisdom above, and it shrink the size by quite a bit.

would

【在 s******r 的大作中提到】

: I worked on some project similar to yours. I do not think your dataset would
: be larger than mine. I used database other than SAS but the basic logic
: should be similar. For one thing, you can try if there is multiple
: transactions per second, try to group by second first, it would reduce the
: volume a lot.

相关主题
● Please help with a SAS macro	● 请问如果用SAS 解决这个问题
● 请教如何用SAS处理这个RANDOM SAMPLING的问题	● 求助：一个SAS的小问题
● 请教一个SAS问题	● SAS Technical Interview Questions
进入Statistics版参与讨论

F****3
发帖数: 1504

不要意思，大哥能解释一下吗？-86399没看懂。。。
我水平很菜。

dynamic

【在 j******o 的大作中提到】

F****3
发帖数: 1504

目前还在是摸索阶段。想看看高频交易的算法是怎么赚钱的。好像银行很多都用SAS。
。。

【在 h***x 的大作中提到】

: 现在用SAS来研究股票的不少啊，您是想把这些机器人挑出来，研究他们的behavior
: pattern吗?好奇的问！

s******8
发帖数: 102

我也试一下:
你的问题是数据太大,而又必须排序.所以在排序方法上着手. 若你知道日期跨度,第一
步安天拆分数据,然后对每天排序并检查,最后把结果合并起来.
假如最早date as macro variable Day1, last date as macro variable day2;
%let date1=mdy(1,1,1990);
%let date2=mdy(12,31,2012);
%macro trybest(day1=&date1,day2=&date2);
data %do i=&day1 %to &day2;dt_&i %end;;
set yourdate;
select(date);
%do i=&day1 %to &day2;
when(i) output dt_&i;
%end;
otherwise put "ERROR: other date found " date;
end;
drop date;
run;
%do i=&day1 %to &day2;
%let dsid=%sysfunc(open(dt_&I,i));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let rc=%sysfunc(close(&dsid));
%if &nobs > 0 %then %do;
proc sort data=dt_&i;
by id time;
run;
data rob_&i;
set dt_&i;
by id time;
retain robot 0;
if first.id then robot=0;
if id=lag9(id) and time-lag9(time) le 60 then robot=1;
if last.id and robot=1 then output;
keep id;
run;
%end;
%end;
data allrobot;
set rob_:;
run;
proc sort data=allrobot nodupkey;
by id;
run;
%mend;
%trybest;

j******o
发帖数: 127

假设有些'聪明'的机器人跨午夜0点作案，如果已知不可能，请删除。

【在 F****3 的大作中提到】

: 不要意思，大哥能解释一下吗？-86399没看懂。。。
: 我水平很菜。
:
: dynamic

F****3
发帖数: 1504

谢谢大哥帮忙啊！小弟这个东西终于完成了，在大家的帮助之下！！！

would

【在 s******r 的大作中提到】

F****3
发帖数: 1504

谢谢大哥帮忙啊！小弟这个东西终于完成了，在大家的帮助之下！！！

【在 l****u 的大作中提到】

F****3
发帖数: 1504

刚刚才看到，太谢谢你了！
我试一试你的macro！
其实排序还好，我已经把他们按照id, date, time排序了

【在 s******8 的大作中提到】

: 我也试一下:
: 你的问题是数据太大,而又必须排序.所以在排序方法上着手. 若你知道日期跨度,第一
: 步安天拆分数据,然后对每天排序并检查,最后把结果合并起来.
: 假如最早date as macro variable Day1, last date as macro variable day2;
: %let date1=mdy(1,1,1990);
: %let date2=mdy(12,31,2012);
: %macro trybest(day1=&date1,day2=&date2);
: data %do i=&day1 %to &day2;dt_&i %end;;
: set yourdate;
: select(date);

(共1页)

进入Statistics版参与讨论

相关主题
● 每个ID出现一次，missing去掉，请问高手用SAS怎么做？	● 求助：SAS使用问题(读数据)
● SAS 求助，一个小问题，包子答谢	● 问一个简单的SAS问题，多谢
● Please help with a SAS macro	● SAS菜鸟请教如果使SAS的output的结果放到一个文件内？
● 请教如何用SAS处理这个RANDOM SAMPLING的问题	● 急问一个SAS 的常见问题
● 请教一个SAS问题	● 请教SAS ODS to Excel
● 请问如果用SAS 解决这个问题	● [合集] 新手首次发贴 SAS 问题急问，谢谢大家的帮助
● 求助：一个SAS的小问题	● 问个简单的SAS如何找出某个变量最大之所在的行？
● SAS Technical Interview Questions	● SAS code help

相关话题的讨论汇总
话题: date话题: data话题: investor话题: time话题: proc

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天