由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Programming版 - Help! Read random number of lines in a input file.
相关主题
random number generator in C++急! Python 如何从文件读取数据(整数) ~~在线等
multiple random number generator 问一个小程序python得
请问释放容器内存的方法A C++ exception question
一种新型的推测编程ostream& operator << (ostream& s, int cnt) error
STL感觉实在太变态了delete files in windows
[合集] C++ STL question还是awk牛B
github有地方能看到一个Project行数么你们吵着要javascript的: How to delete the last line in the file in node.js?
How to write a VB Macro to convert text to hyperlink in Excelhelp understanding code (random number)
相关话题的讨论汇总
话题: random话题: lines话题: file话题: read话题: 1000
进入Programming版参与讨论
1 (共1页)
s*********x
发帖数: 1923
1
Hi, guys,
I need your expert opinions on this problem. I need to do a random sampling
of a input file. Suppose this
file is consisted of 1000 lines (1000 data). I want to randomly choose 60%,
70% and 85% of them and do
some analysis. Does anyone have any suggestions? Thanks!
h*******e
发帖数: 225
2
才1000行,怎么搞都行啊。生成1000*60%, *70%, *85%个[1,1000]的随机数,然后一行
行读,记个行号,把对应的挑出来。

sampling
,

【在 s*********x 的大作中提到】
: Hi, guys,
: I need your expert opinions on this problem. I need to do a random sampling
: of a input file. Suppose this
: file is consisted of 1000 lines (1000 data). I want to randomly choose 60%,
: 70% and 85% of them and do
: some analysis. Does anyone have any suggestions? Thanks!

s*********x
发帖数: 1923
3
Sorry I have to type in English. My question is that if you generate (60% of
1000) random numbers, some
of them will definitely be the same. Then you are not covering 60% of the
population, maybe only 55%.
What I need is 600 different random numbers between 1 to 1000.
Is my understanding correct?

【在 h*******e 的大作中提到】
: 才1000行,怎么搞都行啊。生成1000*60%, *70%, *85%个[1,1000]的随机数,然后一行
: 行读,记个行号,把对应的挑出来。
:
: sampling
: ,

g*****g
发帖数: 34805
4
You first read all the lines into memory, 1000 is nothing.
Then you generate random numbers, you can keep all generated
numbers in a hashtable, and make sure you get enough distinct.

of

【在 s*********x 的大作中提到】
: Sorry I have to type in English. My question is that if you generate (60% of
: 1000) random numbers, some
: of them will definitely be the same. Then you are not covering 60% of the
: population, maybe only 55%.
: What I need is 600 different random numbers between 1 to 1000.
: Is my understanding correct?

h*******e
发帖数: 225
5
that's a different problem. there's a well-known technique to generate say
600 numbers from 1 to 1000 uniformly. for simplicity, you can just use stl's
random_shuffle to get a random permutation of 1 to 1000, and use the first
600 numbers.

of

【在 s*********x 的大作中提到】
: Sorry I have to type in English. My question is that if you generate (60% of
: 1000) random numbers, some
: of them will definitely be the same. Then you are not covering 60% of the
: population, maybe only 55%.
: What I need is 600 different random numbers between 1 to 1000.
: Is my understanding correct?

s*********x
发帖数: 1923
6
Oh, man! That's exactly what I need! Thank you!

's
first

【在 h*******e 的大作中提到】
: that's a different problem. there's a well-known technique to generate say
: 600 numbers from 1 to 1000 uniformly. for simplicity, you can just use stl's
: random_shuffle to get a random permutation of 1 to 1000, and use the first
: 600 numbers.
:
: of

m******t
发帖数: 2416
7

Surely you meant _map_ the file into memory? Plainly
reading all the lines could be rather, uhm, hazardous, if
we don't know how long a line can be.

【在 g*****g 的大作中提到】
: You first read all the lines into memory, 1000 is nothing.
: Then you generate random numbers, you can keep all generated
: numbers in a hashtable, and make sure you get enough distinct.
:
: of

g*****g
发帖数: 34805
8
Something like
while((str=bufferedReader.readLine())!=null) {
arrList.add(str);
}
I don't think long line matters.

【在 m******t 的大作中提到】
:
: Surely you meant _map_ the file into memory? Plainly
: reading all the lines could be rather, uhm, hazardous, if
: we don't know how long a line can be.

h*******n
发帖数: 2052
9
把1000个数字随机打乱, 然后取前600个
r****t
发帖数: 10904
10
U use python for such problems?
lines = open("file","r").read()
mylines = random.sample(lines, int(len(lines)*0.6))
...(do whatever u want to mylines)

sampling
,

【在 s*********x 的大作中提到】
: Hi, guys,
: I need your expert opinions on this problem. I need to do a random sampling
: of a input file. Suppose this
: file is consisted of 1000 lines (1000 data). I want to randomly choose 60%,
: 70% and 85% of them and do
: some analysis. Does anyone have any suggestions? Thanks!

相关主题
[合集] C++ STL question急! Python 如何从文件读取数据(整数) ~~在线等
github有地方能看到一个Project行数么 问一个小程序python得
How to write a VB Macro to convert text to hyperlink in ExcelA C++ exception question
进入Programming版参与讨论
r****t
发帖数: 10904
11
如果不读完,怎么能知道一共有几行?即使是 map memory 了也需要读完才能开始
ramdom sample. 需不需要 worry file size 还不知道, one problem at a time.

【在 m******t 的大作中提到】
:
: Surely you meant _map_ the file into memory? Plainly
: reading all the lines could be rather, uhm, hazardous, if
: we don't know how long a line can be.

m******t
发帖数: 2416
12

Of course they matter. I have seen lines as long as
a couple of hundreds MB each.
We can _assume_ we won't run into any of them, and that's
probably usually a good assumption, but that's different
from they don't matter.

【在 g*****g 的大作中提到】
: Something like
: while((str=bufferedReader.readLine())!=null) {
: arrList.add(str);
: }
: I don't think long line matters.

m******t
发帖数: 2416
13

The idea is that memory mapping won't keep the whole thing in
the memory - the OS can feel free to swap out all the rest
of the pages exception for the one currently being accessed.

【在 r****t 的大作中提到】
: 如果不读完,怎么能知道一共有几行?即使是 map memory 了也需要读完才能开始
: ramdom sample. 需不需要 worry file size 还不知道, one problem at a time.

g*****g
发帖数: 34805
14
I think you know beforehand the likehood of running into
such trouble by looking at the file size.
At worst you run into out of memory exception, and you
correct it. Well, we all run into them in some case no matter
how carefully you design the code.

【在 m******t 的大作中提到】
:
: The idea is that memory mapping won't keep the whole thing in
: the memory - the OS can feel free to swap out all the rest
: of the pages exception for the one currently being accessed.

r****t
发帖数: 10904
15
I understand your point about using mmap. 但是就像前段时间这里讨论过的一样,
mmap 也不能彻底解决大文件问题,你没法 map 8G 文件,或者你要是可以的话,你没
法 map 8T 文件。
My question is: mmap might not help till u starts to deal with 很大的文件,
而且不一定 help performance, 因为一样需要先知道文件有多少行。

【在 m******t 的大作中提到】
:
: The idea is that memory mapping won't keep the whole thing in
: the memory - the OS can feel free to swap out all the rest
: of the pages exception for the one currently being accessed.

r****t
发帖数: 10904
16
另外一个办法 (不用知道一共有几行,magicfat 的启发) 是,
如果行数比较多,根据大树定理
你从头读到尾,对每一行以 60% 的机会选出应该也是可以的,只是最终你得到的行数
接近 60%, 可能不能完全等于某个指定的行数,这时候你甩掉多余的行,或者回头找
几行也行(这个最终处理比较麻烦,但是你可能更本不需要拿出精确的行数?
def select(line):
.... if randome()<= 0.6:
........ return line
....lese:
........ return False
mylines = ifilter(select, open("myfile"))
这样 mylines 是个 iterator, 这个code 不会把整个文件读入内存,所以可能 memory 上
面节省一些。缺点是不能得到精确的行数,要求 60% 基本上得到 60%, 但是多几行少几行都可能。

sampling
,

【在 s*********x 的大作中提到】
: Hi, guys,
: I need your expert opinions on this problem. I need to do a random sampling
: of a input file. Suppose this
: file is consisted of 1000 lines (1000 data). I want to randomly choose 60%,
: 70% and 85% of them and do
: some analysis. Does anyone have any suggestions? Thanks!

r****t
发帖数: 10904
17
stupid me. 这么写比较傻, 不喜欢搞 one-liner。但是下面足够了(python >=2.5):
mylines = ifilter(lambda x: x if random()<=.6 else False,
....................open("myfile"))

【在 r****t 的大作中提到】
: 另外一个办法 (不用知道一共有几行,magicfat 的启发) 是,
: 如果行数比较多,根据大树定理
: 你从头读到尾,对每一行以 60% 的机会选出应该也是可以的,只是最终你得到的行数
: 接近 60%, 可能不能完全等于某个指定的行数,这时候你甩掉多余的行,或者回头找
: 几行也行(这个最终处理比较麻烦,但是你可能更本不需要拿出精确的行数?
: def select(line):
: .... if randome()<= 0.6:
: ........ return line
: ....lese:
: ........ return False

c*****t
发帖数: 1879
18
你这个算法有可能拿到多于或者少于需要的 sample 。
其实比较简单的办法就是 random_shuffle line[N],N 是总共 data 的行数。
就算 N 是几百万,也是小意思。然后 sort 一下 line。然后就读一行,如果该
行不在 line[] 里,扔掉。在 line[] 里留下。这办法可以处理很大的文件。。。
上面是 sampling w/o replacement 。sampling w/ replacement 可以同理。
唯一麻烦的是,如果某 algorithm 对 order 比较 sensitive 。。。

【在 r****t 的大作中提到】
: stupid me. 这么写比较傻, 不喜欢搞 one-liner。但是下面足够了(python >=2.5):
: mylines = ifilter(lambda x: x if random()<=.6 else False,
: ....................open("myfile"))

r****t
发帖数: 10904
19
已经说清楚了行数可能不准这个问题,行数准的解法在更前面的帖子里。实际上 LZ
不一定需要行数准确,而是需要 sample probability 为常数。
你这个方法的问题是,任给一个文件怎么拿到这个 N? 直接 assume N is known 不太合理。

【在 c*****t 的大作中提到】
: 你这个算法有可能拿到多于或者少于需要的 sample 。
: 其实比较简单的办法就是 random_shuffle line[N],N 是总共 data 的行数。
: 就算 N 是几百万,也是小意思。然后 sort 一下 line。然后就读一行,如果该
: 行不在 line[] 里,扔掉。在 line[] 里留下。这办法可以处理很大的文件。。。
: 上面是 sampling w/o replacement 。sampling w/ replacement 可以同理。
: 唯一麻烦的是,如果某 algorithm 对 order 比较 sensitive 。。。

c*****t
发帖数: 1879
20
可以先读一遍,或者 wc -l,这都很快。
这东西拿 C 写都很容易。。。

太合理。

【在 r****t 的大作中提到】
: 已经说清楚了行数可能不准这个问题,行数准的解法在更前面的帖子里。实际上 LZ
: 不一定需要行数准确,而是需要 sample probability 为常数。
: 你这个方法的问题是,任给一个文件怎么拿到这个 N? 直接 assume N is known 不太合理。

相关主题
ostream& operator << (ostream& s, int cnt) error你们吵着要javascript的: How to delete the last line in the file in node.js?
delete files in windowshelp understanding code (random number)
还是awk牛B我的Visual C++ .net是不是有问题?
进入Programming版参与讨论
r****t
发帖数: 10904
21
所以需要先读一边才行么。
wc 其实很慢(我也没想到)这儿根本不能采用,上回我的 python code 干活都干完(sanitize some csv file)了, wc 算个行数花 3, 4 倍的时间还没完。

【在 c*****t 的大作中提到】
: 可以先读一遍,或者 wc -l,这都很快。
: 这东西拿 C 写都很容易。。。
:
: 太合理。

c*****t
发帖数: 1879
22
flex 的 test suite 里的 wc 超快(专门就是显示 flex 的速度的)。你用
那个就好了。可以达到 10x 以上的速度。主要是平常用的 wc 是手写的,没
block read,还 N 多 jump 。
你的办法在正规的 sampling 里不能用。多了少了都不行。

(sanitize some csv file)了, wc 算个行数花 3, 4 倍的时间还没完。

【在 r****t 的大作中提到】
: 所以需要先读一边才行么。
: wc 其实很慢(我也没想到)这儿根本不能采用,上回我的 python code 干活都干完(sanitize some csv file)了, wc 算个行数花 3, 4 倍的时间还没完。

r****t
发帖数: 10904
23
需要准确的 sample 个数,用我的第一个解就行了。
flex test suite 是什么东东?从来没听说过。。。

【在 c*****t 的大作中提到】
: flex 的 test suite 里的 wc 超快(专门就是显示 flex 的速度的)。你用
: 那个就好了。可以达到 10x 以上的速度。主要是平常用的 wc 是手写的,没
: block read,还 N 多 jump 。
: 你的办法在正规的 sampling 里不能用。多了少了都不行。
:
: (sanitize some csv file)了, wc 算个行数花 3, 4 倍的时间还没完。

h*******e
发帖数: 225
24
flex带的test samples

【在 r****t 的大作中提到】
: 需要准确的 sample 个数,用我的第一个解就行了。
: flex test suite 是什么东东?从来没听说过。。。

m******t
发帖数: 2416
25

Not when you don't keep all the lines in memory
without any apparent benefit.

【在 g*****g 的大作中提到】
: I think you know beforehand the likehood of running into
: such trouble by looking at the file size.
: At worst you run into out of memory exception, and you
: correct it. Well, we all run into them in some case no matter
: how carefully you design the code.

m******t
发帖数: 2416
26

That's some funny logic though. Why do we switch to 64-bit?
It can't handle 31498 trillion gazillion google bungle number
of bytes anyway. 8-)
It doesn't have to be real huge files. And yes, it does help
performance.
Trust me, I would know. I wrote a log parsing program (in java)
a little while ago that does exactly this - build an index of
fseek positions for every 100 lines. On a 200MB files,
the difference between readLine() and a memory mapped buffer
was beyond visible.

【在 r****t 的大作中提到】
: I understand your point about using mmap. 但是就像前段时间这里讨论过的一样,
: mmap 也不能彻底解决大文件问题,你没法 map 8G 文件,或者你要是可以的话,你没
: 法 map 8T 文件。
: My question is: mmap might not help till u starts to deal with 很大的文件,
: 而且不一定 help performance, 因为一样需要先知道文件有多少行。

s*********x
发帖数: 1923
27
will the random.sample get the real random number of lines or pseduo-random
number? I tried STL
random_shuffle and found it's not a real random shuffle... meaning if you
run it twice, the random
"shuffled" numbers are in the same order. then the sampling procedure is not
right.

【在 r****t 的大作中提到】
: U use python for such problems?
: lines = open("file","r").read()
: mylines = random.sample(lines, int(len(lines)*0.6))
: ...(do whatever u want to mylines)
:
: sampling
: ,

r****t
发帖数: 10904
28
random.sample in python uses random.random pseduo-RNG, 但是在 linux 上面
random.random 自动使用 /dev/urandom seed,所以 run it a second time, the
order should probably (not /dev/random yet) be different.
random.random() uses Mersenne Twister as the core generator,应该是很多地方
都在使用的。

random
not

【在 s*********x 的大作中提到】
: will the random.sample get the real random number of lines or pseduo-random
: number? I tried STL
: random_shuffle and found it's not a real random shuffle... meaning if you
: run it twice, the random
: "shuffled" numbers are in the same order. then the sampling procedure is not
: right.

t****t
发帖数: 6806
29
STL random_shuffle calls system std::rand() if you don't supply a rand
object yourself.
so the caller is actually responsible for guaranteeing the randomness.

random
not

【在 s*********x 的大作中提到】
: will the random.sample get the real random number of lines or pseduo-random
: number? I tried STL
: random_shuffle and found it's not a real random shuffle... meaning if you
: run it twice, the random
: "shuffled" numbers are in the same order. then the sampling procedure is not
: right.

r****t
发帖数: 10904
30

这个是 coconut 前一段时间提出来的标准,我只是转述了一下。
那好吧,>200 M help performance, still big files.

【在 m******t 的大作中提到】
:
: That's some funny logic though. Why do we switch to 64-bit?
: It can't handle 31498 trillion gazillion google bungle number
: of bytes anyway. 8-)
: It doesn't have to be real huge files. And yes, it does help
: performance.
: Trust me, I would know. I wrote a log parsing program (in java)
: a little while ago that does exactly this - build an index of
: fseek positions for every 100 lines. On a 200MB files,
: the difference between readLine() and a memory mapped buffer

相关主题
大家看看怎么把这几行matlab 代码译成cmultiple random number generator
请问用mmap分配的共享内存如何回收?请问释放容器内存的方法
random number generator in C++一种新型的推测编程
进入Programming版参与讨论
r****t
发帖数: 10904
31
so you mean one needs to take care of seeding in his code?

【在 t****t 的大作中提到】
: STL random_shuffle calls system std::rand() if you don't supply a rand
: object yourself.
: so the caller is actually responsible for guaranteeing the randomness.
:
: random
: not

t****t
发帖数: 6806
32
STL algorithm is ... algorithm only. randomness is not a part of the shuffle
algorithm.

【在 r****t 的大作中提到】
: so you mean one needs to take care of seeding in his code?
1 (共1页)
进入Programming版参与讨论
相关主题
help understanding code (random number)STL感觉实在太变态了
我的Visual C++ .net是不是有问题?[合集] C++ STL question
大家看看怎么把这几行matlab 代码译成cgithub有地方能看到一个Project行数么
请问用mmap分配的共享内存如何回收?How to write a VB Macro to convert text to hyperlink in Excel
random number generator in C++急! Python 如何从文件读取数据(整数) ~~在线等
multiple random number generator 问一个小程序python得
请问释放容器内存的方法A C++ exception question
一种新型的推测编程ostream& operator << (ostream& s, int cnt) error
相关话题的讨论汇总
话题: random话题: lines话题: file话题: read话题: 1000