请教关于大数据的问题 - CS版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

CS版 - 请教关于大数据的问题

相关主题
● Internship opportunities with Microsoft	● cloud
● 对免费云计算资源感兴趣的 (转载)	● 搜索的核心技术，李彦宏和GOOGLE的LARRY谁是先行者？ (转载)
● [合集] computable vs. non-computable	● 请教博士题目
● 关于编程序与CS（计算机科学）	● cs的方向请教
● 一道MS面试题 (转载)	● 云有没有可能代替超级计算机进行科学计算？
● 请教：Map reduce到底是什么啊 (转载)	● 请问计算机系什么专业硕士毕业以后比较好找工作？
● Parallel computing in Matlab (转载)	● 感觉云计算其实是伪科学
● 问一下ＭＰＩ的问题	● 学术界做mapreduce基本上是扯淡

相关话题的讨论汇总
话题: memory话题: data话题: petabyte话题: db话题: most

进入CS版参与讨论

(共1页)

f****r
发帖数: 20

请教各位:
在处理极大数据(TeraByte and PetaByte)时, 或者当数据不能直接LOAD into Memory
时,主要有那些常用方法(除MapReduce 外). GOOGLE 后发现可以用外部存储或者利用数
据库来解决 computation under Memory Constraints. 这些方法都具体如何实现.
请推荐参考书或文献.
谢谢指点

N**D
发帖数: 10322

you need to define the problem first.

Memory

【在 f****r 的大作中提到】

: 请教各位:
: 在处理极大数据(TeraByte and PetaByte)时, 或者当数据不能直接LOAD into Memory
: 时,主要有那些常用方法(除MapReduce 外). GOOGLE 后发现可以用外部存储或者利用数
: 据库来解决 computation under Memory Constraints. 这些方法都具体如何实现.
: 请推荐参考书或文献.
: 谢谢指点

f****r
发帖数: 20

I　don't have very specific problem. Just want to know generally and
systematically how they are solved.
Thanks

【在 N**D 的大作中提到】

: you need to define the problem first.
:
: Memory

N**D
发帖数: 10322

work for goog, you will know
it is trading secret mostly.

【在 f****r 的大作中提到】

: I　don't have very specific problem. Just want to know generally and
: systematically how they are solved.
: Thanks

D*******a
发帖数: 3688

没有具体问题，那么generally只能说是divide and conquer了

【在 f****r 的大作中提到】

: I　don't have very specific problem. Just want to know generally and
: systematically how they are solved.
: Thanks

f****r
发帖数: 20

Does google mainly use MapReduce/GFS?

【在 N**D 的大作中提到】

: work for goog, you will know
: it is trading secret mostly.

f****r
发帖数: 20

也就是具体问题具体分析?
那一般常用有些什么STRATEGY?

【在 D*******a 的大作中提到】

: 没有具体问题，那么generally只能说是divide and conquer了

D*******a
发帖数: 3688

你这种问题没法回答
大矩阵相乘跟大数组排序的strategy基本两码事

【在 f****r 的大作中提到】

: 也就是具体问题具体分析?
: 那一般常用有些什么STRATEGY?

N**D
发帖数: 10322

how would I know?

【在 f****r 的大作中提到】

: Does google mainly use MapReduce/GFS?

f****r
发帖数: 20

Thanks,
How about sorting in an extremly large data and get frequency count of each
unique vaule.Furthermore, the joint counts of each unique combination of
several variables (columns).
Do you have any recommendation for books on those kind of problems you
mentioned?
Thanks again.

【在 D*******a 的大作中提到】

: 你这种问题没法回答
: 大矩阵相乘跟大数组排序的strategy基本两码事

相关主题
● 请教：Map reduce到底是什么啊 (转载)	● cloud
● Parallel computing in Matlab (转载)	● 搜索的核心技术，李彦宏和GOOGLE的LARRY谁是先行者？ (转载)
● 问一下ＭＰＩ的问题	● 请教博士题目
进入CS版参与讨论

D*******a
发帖数: 3688

网上找些hadoop的例子得了

each

【在 f****r 的大作中提到】

: Thanks,
: How about sorting in an extremly large data and get frequency count of each
: unique vaule.Furthermore, the joint counts of each unique combination of
: several variables (columns).
: Do you have any recommendation for books on those kind of problems you
: mentioned?
: Thanks again.

f****r
发帖数: 20

如果不用Hadoop呢?
我是说如果只有一个一般的COMPUTER,如何处理类似问题.

【在 D*******a 的大作中提到】

: 网上找些hadoop的例子得了
:
: each

w***g
发帖数: 5958

这个得用stream algorithm。不过你这个computer也挺牛的，能存petabyte。

【在 f****r 的大作中提到】

: 如果不用Hadoop呢?
: 我是说如果只有一个一般的COMPUTER,如何处理类似问题.

f****r
发帖数: 20

Thanks.
Actually, the data set I use is not as big as petabyte, but around 18*10G
data set.
I would like to evaluate some data mining techniques ( like CART, Boosting
tree etc.)for classification. There are many open source packages, but most
of them are really not good at dealing with large and high dimensional
data set.
So, I am wondering what kind of technologies does commercial data mining
packages use to deal with the scalability and dimensionality, since most of
them claim they can dea

【在 w***g 的大作中提到】

: 这个得用stream algorithm。不过你这个computer也挺牛的，能存petabyte。

f****r
发帖数: 20

Any other suggestions and comments?
Thanks

most
of

【在 f****r 的大作中提到】

: Thanks.
: Actually, the data set I use is not as big as petabyte, but around 18*10G
: data set.
: I would like to evaluate some data mining techniques ( like CART, Boosting
: tree etc.)for classification. There are many open source packages, but most
: of them are really not good at dealing with large and high dimensional
: data set.
: So, I am wondering what kind of technologies does commercial data mining
: packages use to deal with the scalability and dimensionality, since most of
: them claim they can dea

g*****g
发帖数: 34805

只有一个一般的计算机，就只能外排了。

【在 f****r 的大作中提到】

: 如果不用Hadoop呢?
: 我是说如果只有一个一般的COMPUTER,如何处理类似问题.

g*****g
发帖数: 34805

18*10G? Dump it into DB, that's all.

most
of

【在 f****r 的大作中提到】

f****r
发帖数: 20

Thanks for your two replies.
如果DUMP into DB, query the DB for any Statistics (counts)needed for buiding
tree models using SQL?
Will the query time an issue for large computation? 能否详细说说?
你提到的外排序和另一楼主提到的stream algorithm,类似的算法有什么比较好的参考
书和实例.
Thanks again.

【在 g*****g 的大作中提到】

: 18*10G? Dump it into DB, that's all.
:
:
: most
: of

N**D
发帖数: 10322

问一下你老板把

buiding

【在 f****r 的大作中提到】

: Thanks for your two replies.
: 如果DUMP into DB, query the DB for any Statistics (counts)needed for buiding
: tree models using SQL?
: Will the query time an issue for large computation? 能否详细说说?
: 你提到的外排序和另一楼主提到的stream algorithm,类似的算法有什么比较好的参考
: 书和实例.
: Thanks again.

k**x
发帖数: 74

SPRINT for CART

most
of

【在 f****r 的大作中提到】

(共1页)

进入CS版参与讨论

相关主题
● 学术界做mapreduce基本上是扯淡	● 一道MS面试题 (转载)
● 包子求助，EE转CS，求建议	● 请教：Map reduce到底是什么啊 (转载)
● MS in CS at Columbia 方向选择？	● Parallel computing in Matlab (转载)
● computational geometry和algorithms这个方向博士就业怎么样？	● 问一下ＭＰＩ的问题
● Internship opportunities with Microsoft	● cloud
● 对免费云计算资源感兴趣的 (转载)	● 搜索的核心技术，李彦宏和GOOGLE的LARRY谁是先行者？ (转载)
● [合集] computable vs. non-computable	● 请教博士题目
● 关于编程序与CS（计算机科学）	● cs的方向请教

相关话题的讨论汇总
话题: memory话题: data话题: petabyte话题: db话题: most

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天