B******5 发帖数: 4676 | 1 这个比较有意思,我电脑上的结果
java allsum=1.8658666E16
real 13.31
user 14.86
sys 7.54
c++ allsum=1.86587e+16
real 35.99
user 35.89
sys 0.01
c++ allsum=1.86587e+16
real 14.50
user 14.44
sys 0.02
O2的
c++ allsum=1.86587e+16
real 14.08
user 14.03
sys 0.00
O3的
c++ allsum=1.86587e+16
real 14.50
user 14.44
sys 0.02 |
|
t****t 发帖数: 6806 | 2 来个完整的. 不要说不优化, 就是优化得不对, 也差远了. 这个换了台机器, Xeon
5670 @2.93G
EDIT: 加个cache aware的.
######## 不优化
$ g++461 11.C
$ time a.out
c++ allsum=1.86587e+16
37.868u 0.010s 0:37.88 99.9% 0+0k 0+0io 0pf+0w
######## -O2, 最普通的
$ g++461 -O2 11.C
$ time a.out
c++ allsum=1.86587e+16
10.012u 0.018s 0:10.03 99.9% 0+0k 0+0io 0pf+0w
######## -O3, 且允许SIMD
$ g++461 -O3 -funsafe-math-optimizations 11.C
$ time a.out
c++ allsum=1.86587e+16
8.649u 0.010s 0:08.66 99.8% 0+0k 0+0io 0pf+0w
######## -O3, 允许SIMD, 再unroll lo... 阅读全帖 |
|
y***d 发帖数: 2330 | 3 Don't forget -ffast-math...
java allsum=1.8658666E16
real 0m9.719s
g++ -O3 test.cpp
c++ allsum=1.86587e+16
real 0m9.116s
g++ -O3 -ffast-math test.cpp
c++ allsum=1.86587e+16
real 0m6.029s
g++ -O3 -march=native -mtune=native -ffast-math test.cpp
c++ allsum=1.86587e+16
real 0m4.888s
g++ -O3 -march=native -mtune=native -ffast-math test.cpp -funsafe-math-
optimizations -funroll-loops -fprefetch-loop-arrays
c++ allsum=1.86587e+16
real 0m4.235s |
|
j*****j 发帖数: 201 | 4 我的解法, test case都通过了
vector > combinationSumrec(vector candidates, int start,
int end,int target){
vector > toreturn;
if (target == 0){
vector temp;
toreturn.push_back(temp);
return toreturn;
}
if (target
return toreturn;
if (start > end)
return toreturn;
if (start == end){
if (target%(candidates[start... 阅读全帖 |
|
L***n 发帖数: 6727 | 5 试了下gotoblas2,我的机器
$ cat /proc/cpuinfo | grep model\ name | head -1
model name : Intel(R) Core(TM) i7 CPU Q 740 @ 1.73GHz
$ time -p java jmatrix
java allsum=1.8658666E16
real 11.69
user 12.68
sys 6.65
C++ 和优化flags
$ g++ -O3 -funsafe-math-optimizations -funroll-loops -fprefetch-loop-arrays
-march=native cmatrix.cpp -o cmatrix
$ time -p ./cmatrix
c++ allsum=1.86587e+16
real 8.04
user 8.00
sys 0.03
trivially在最外层循环里用gotoblas(就是把里面的二重循环换成Blas Level2)
g++ -O2 -funroll-loops -fprefetch-loop-ar... 阅读全帖 |
|
t*****z 发帖数: 812 | 6 m[2000][2000] x v[2000]性能差这么多,叫c++情何以堪啊。。。
测试环境
AMD Athlon(tm) 64 FX-53 Processor
Memory: 8GB
测试结果
[~]$javac jmatrix.java
[~]$/usr/bin/time -p java jmatrix
java allsum=1.8658666E16
real 27.90
user 26.82
sys 0.17
[~]$g++ cmatrix.cpp
[~]$/usr/bin/time -p ./a.out
c++ allsum=1.86587e+16
real 70.89
user 69.99
sys 0.32
测试代码见
http://ping80life.blogspot.com/2012/01/java-c.html |
|
t*****z 发帖数: 812 | 7 m[2000][2000] x v[2000]性能差这么多,叫c++情何以堪啊。。。
测试环境
AMD Athlon(tm) 64 FX-53 Processor
Memory: 8GB
测试结果
[~]$javac jmatrix.java
[~]$/usr/bin/time -p java jmatrix
java allsum=1.8658666E16
real 27.90
user 26.82
sys 0.17
[~]$g++ cmatrix.cpp
[~]$/usr/bin/time -p ./a.out
c++ allsum=1.86587e+16
real 70.89
user 69.99
sys 0.32
测试代码见
http://ping80life.blogspot.com/2012/01/java-c.html |
|
s*****u 发帖数: 164 | 8 $ javac jmatrix.java
$ time -p java jmatrix
java allsum=1.8658666E16
real 32.97
user 32.46
sys 0.23
$ g++ -O3 -arch x86_64 cmatrix.cpp
$ time -p ./a.out
c++ allsum=1.86587e+16
real 16.48
user 16.17
sys 0.10 |
|
t*****z 发帖数: 812 | 9 m[2000][2000] x v[2000]性能差这么多,叫c++情何以堪啊。。。
测试环境
AMD Athlon(tm) 64 FX-53 Processor
Memory: 8GB
测试结果
[~]$javac jmatrix.java
[~]$/usr/bin/time -p java jmatrix
java allsum=1.8658666E16
real 27.90
user 26.82
sys 0.17
[~]$g++ cmatrix.cpp
[~]$/usr/bin/time -p ./a.out
c++ allsum=1.86587e+16
real 70.89
user 69.99
sys 0.32
测试代码见
http://ping80life.blogspot.com/2012/01/java-c.html |
|
i*****o 发帖数: 1714 | 10 呵呵, 用javascript做了一下,竟然只用了不到三分钟,而没有优化的 C++用了48秒
$ time node main.js
js allsum: 18658666000000000
real 2m54.334s
user 2m54.260s
sys 0m0.998s
$ time ./a.out
c++ allsum=1.86587e+16
real 0m48.793s
user 0m48.720s
sys 0m0.061s
java 和 O3 的C++分别是17 和 15秒。 |
|
t*****z 发帖数: 812 | 11 m[2000][2000] x v[2000]性能差这么多,叫c++和fortran情何以堪啊。。。
测试环境
AMD Athlon(tm) 64 FX-53 Processor
Memory: 8GB
测试结果
[~]$javac jmatrix.java
[~]$/usr/bin/time -p java jmatrix
java allsum=1.8658666E16
real 27.90
user 26.82
sys 0.17
[~]$g++ cmatrix.cpp
[~]$/usr/bin/time -p ./a.out
c++ allsum=1.86587e+16
real 70.89
user 69.99
sys 0.32
测试代码见
http://ping80life.blogspot.com/2012/01/java-c.html |
|
t*****z 发帖数: 812 | 12 m[2000][2000] x v[2000]性能差这么多,叫c++和fortran情何以堪啊。。。
测试环境
AMD Athlon(tm) 64 FX-53 Processor
Memory: 8GB
测试结果
[~]$javac jmatrix.java
[~]$/usr/bin/time -p java jmatrix
java allsum=1.8658666E16
real 27.90
user 26.82
sys 0.17
[~]$g++ cmatrix.cpp
[~]$/usr/bin/time -p ./a.out
c++ allsum=1.86587e+16
real 70.89
user 69.99
sys 0.32
测试代码见
http://ping80life.blogspot.com/2012/01/java-c.html |
|
h**6 发帖数: 4160 | 13 第一题反过来,在数组两端求最短和达到 allsum-maxsum 的两段子数组,中间就是所
求的数组。 |
|
g*****i 发帖数: 2162 | 14 等价转换的思路很好啊,小尾羊以前也提过.
两种基本的情况,也就是只找左边开始的最短子数组大于allsum-maxsum和只找右边的都
可以很容易找到.
问题是如果左右都要包括,也就是类似循环数组的情况,如何找出最短的呢?这个问题和
原问题几乎一模一样,感觉没有减低难度啊. |
|
t*****z 发帖数: 812 | 15 还是java快一点点
[~]$ /usr/bin/time -p ./a.out
c++ allsum=1.86587e+16
real 29.73
user 29.17
sys 0.16 |
|
t*****z 发帖数: 812 | 16 还是java快一点点
[~]$g++ -O3 cmatrix.cpp
[~]$/usr/bin/time -p ./a.out
c++ allsum=1.86587e+16
real 29.73
user 29.17
sys 0.16 |
|
m*******l 发帖数: 12782 | 17 java allsum=1.8658666E16
real 783.07
user 781.82
sys 0.20 |
|
m*******l 发帖数: 12782 | 18 c++ allsum=1.86587e+16
real 37.14
user 37.08
sys 0.02 |
|
y***d 发帖数: 2330 | 19 恩,基本上前面说的 -funsafe-math-optimizations, cache optimization 加上
navtie 就够了
g++ -O3 test.cpp -funsafe-math-optimizations -funroll-loops -fprefetch-loop
-arrays -march=native -mtune=native
c++ allsum=1.86587e+16
real 0m4.253s
real 0m4.300s
real 0m4.296s |
|
g*****y 发帖数: 7271 | 20 I enabled openmp in VC2008 and added the following line
before the loop and recompile:
#pragma omp parallel for default(none) \
private(k,i,j,sum) shared(m,x,y,allsum)
Now it takes 2 seconds (previous 10 secs). But the result
is quite different because of the multithreading.
In JVM, how to utilize multi-core?
others
do |
|
g*****y 发帖数: 7271 | 21 Modified the openmp directive a little bit and now
the result is correct and it takes about 2 seconds
on i7 860 @ 2.8GHz (4 cores).
just put the following line before the k loop:
#pragma omp parallel for reduction(+:allsum) \
private(k,i,j,sum) shared(m,x,y)
There seems to be JOMP developed for java. Anybody tested
its performance? |
|
c****r 发帖数: 576 | 22 Matlab矩阵计算也就零点几秒,不用循环。
tic;
s = m * repmat((0:1999)',1,2000);
allsum = sum(s(:)) + sum(0:1999);
toc
Elapsed time is 0.478149 seconds |
|
t*****z 发帖数: 812 | 23 还是java快一点点,不过差不多了
[~]$ /usr/bin/time -p ./a.out
c++ allsum=1.86587e+16
real 29.73
user 29.17
sys 0.16 |
|
t*****z 发帖数: 812 | 24 还是java快一点点,不过差不多了
[~]$ /usr/bin/time -p ./a.out
c++ allsum=1.86587e+16
real 29.73
user 29.17
sys 0.16 |
|