f****r 发帖数: 20 | 1 请教各位:
在处理极大数据(TeraByte and PetaByte)时, 或者当数据不能直接LOAD into Memory
时,主要有那些常用方法(除MapReduce 外). GOOGLE 后发现可以用外部存储或者利用数
据库来解决 computation under Memory Constraints. 这些方法都具体如何实现.
请推荐参考书或文献.
谢谢指点 |
N**D 发帖数: 10322 | 2 you need to define the problem first.
Memory
【在 f****r 的大作中提到】 : 请教各位: : 在处理极大数据(TeraByte and PetaByte)时, 或者当数据不能直接LOAD into Memory : 时,主要有那些常用方法(除MapReduce 外). GOOGLE 后发现可以用外部存储或者利用数 : 据库来解决 computation under Memory Constraints. 这些方法都具体如何实现. : 请推荐参考书或文献. : 谢谢指点
|
f****r 发帖数: 20 | 3 I don't have very specific problem. Just want to know generally and
systematically how they are solved.
Thanks
【在 N**D 的大作中提到】 : you need to define the problem first. : : Memory
|
N**D 发帖数: 10322 | 4 work for goog, you will know
it is trading secret mostly.
【在 f****r 的大作中提到】 : I don't have very specific problem. Just want to know generally and : systematically how they are solved. : Thanks
|
D*******a 发帖数: 3688 | 5 没有具体问题,那么generally只能说是divide and conquer了
【在 f****r 的大作中提到】 : I don't have very specific problem. Just want to know generally and : systematically how they are solved. : Thanks
|
f****r 发帖数: 20 | 6 Does google mainly use MapReduce/GFS?
【在 N**D 的大作中提到】 : work for goog, you will know : it is trading secret mostly.
|
f****r 发帖数: 20 | 7 也就是具体问题具体分析?
那一般常用有些什么STRATEGY?
【在 D*******a 的大作中提到】 : 没有具体问题,那么generally只能说是divide and conquer了
|
D*******a 发帖数: 3688 | 8 你这种问题没法回答
大矩阵相乘跟大数组排序的strategy基本两码事
【在 f****r 的大作中提到】 : 也就是具体问题具体分析? : 那一般常用有些什么STRATEGY?
|
N**D 发帖数: 10322 | 9 how would I know?
【在 f****r 的大作中提到】 : Does google mainly use MapReduce/GFS?
|
f****r 发帖数: 20 | 10 Thanks,
How about sorting in an extremly large data and get frequency count of each
unique vaule.Furthermore, the joint counts of each unique combination of
several variables (columns).
Do you have any recommendation for books on those kind of problems you
mentioned?
Thanks again.
【在 D*******a 的大作中提到】 : 你这种问题没法回答 : 大矩阵相乘跟大数组排序的strategy基本两码事
|
|
|
D*******a 发帖数: 3688 | 11 网上找些hadoop的例子得了
each
【在 f****r 的大作中提到】 : Thanks, : How about sorting in an extremly large data and get frequency count of each : unique vaule.Furthermore, the joint counts of each unique combination of : several variables (columns). : Do you have any recommendation for books on those kind of problems you : mentioned? : Thanks again.
|
f****r 发帖数: 20 | 12 如果不用Hadoop呢?
我是说如果只有一个一般的COMPUTER,如何处理类似问题.
【在 D*******a 的大作中提到】 : 网上找些hadoop的例子得了 : : each
|
w***g 发帖数: 5958 | 13 这个得用stream algorithm。不过你这个computer也挺牛的,能存petabyte。
【在 f****r 的大作中提到】 : 如果不用Hadoop呢? : 我是说如果只有一个一般的COMPUTER,如何处理类似问题.
|
f****r 发帖数: 20 | 14 Thanks.
Actually, the data set I use is not as big as petabyte, but around 18*10G
data set.
I would like to evaluate some data mining techniques ( like CART, Boosting
tree etc.)for classification. There are many open source packages, but most
of them are really not good at dealing with large and high dimensional
data set.
So, I am wondering what kind of technologies does commercial data mining
packages use to deal with the scalability and dimensionality, since most of
them claim they can dea
【在 w***g 的大作中提到】 : 这个得用stream algorithm。不过你这个computer也挺牛的,能存petabyte。
|
f****r 发帖数: 20 | 15 Any other suggestions and comments?
Thanks
most
of
【在 f****r 的大作中提到】 : Thanks. : Actually, the data set I use is not as big as petabyte, but around 18*10G : data set. : I would like to evaluate some data mining techniques ( like CART, Boosting : tree etc.)for classification. There are many open source packages, but most : of them are really not good at dealing with large and high dimensional : data set. : So, I am wondering what kind of technologies does commercial data mining : packages use to deal with the scalability and dimensionality, since most of : them claim they can dea
|
g*****g 发帖数: 34805 | 16 只有一个一般的计算机,就只能外排了。
【在 f****r 的大作中提到】 : 如果不用Hadoop呢? : 我是说如果只有一个一般的COMPUTER,如何处理类似问题.
|
g*****g 发帖数: 34805 | 17 18*10G? Dump it into DB, that's all.
most
of
【在 f****r 的大作中提到】 : Thanks. : Actually, the data set I use is not as big as petabyte, but around 18*10G : data set. : I would like to evaluate some data mining techniques ( like CART, Boosting : tree etc.)for classification. There are many open source packages, but most : of them are really not good at dealing with large and high dimensional : data set. : So, I am wondering what kind of technologies does commercial data mining : packages use to deal with the scalability and dimensionality, since most of : them claim they can dea
|
f****r 发帖数: 20 | 18 Thanks for your two replies.
如果DUMP into DB, query the DB for any Statistics (counts)needed for buiding
tree models using SQL?
Will the query time an issue for large computation? 能否详细说说?
你提到的外排序和另一楼主提到的stream algorithm,类似的算法有什么比较好的参考
书和实例.
Thanks again.
【在 g*****g 的大作中提到】 : 18*10G? Dump it into DB, that's all. : : : most : of
|
N**D 发帖数: 10322 | 19 问一下你老板把
buiding
【在 f****r 的大作中提到】 : Thanks for your two replies. : 如果DUMP into DB, query the DB for any Statistics (counts)needed for buiding : tree models using SQL? : Will the query time an issue for large computation? 能否详细说说? : 你提到的外排序和另一楼主提到的stream algorithm,类似的算法有什么比较好的参考 : 书和实例. : Thanks again.
|
k**x 发帖数: 74 | 20 SPRINT for CART
most
of
【在 f****r 的大作中提到】 : Thanks. : Actually, the data set I use is not as big as petabyte, but around 18*10G : data set. : I would like to evaluate some data mining techniques ( like CART, Boosting : tree etc.)for classification. There are many open source packages, but most : of them are really not good at dealing with large and high dimensional : data set. : So, I am wondering what kind of technologies does commercial data mining : packages use to deal with the scalability and dimensionality, since most of : them claim they can dea
|