关于dataflow的讨论汇总 - 话题女王

全部话题 - 话题: dataflow

m*****n
发帖数: 204

来自主题: Programming版 - 有人熟悉google cloud dataflow吗

最近做了个小工具用了dataflow,谈不上熟悉。
对生手来说上手是非常容易了。搭个复杂的MR
Pipeline很方便，开发时在云下运行小数据反应
也很快。不过我没用过flume之类，不知道是否
比它们强。性能上和类似的hadoop job差不多。
缺点是proprietary api. 好像cloudera在
用Spark api去wrap around dataflow.
也许值得看看。

m***h
发帖数: 77

来自主题: Programming版 - 狗的dataflow是什么

刚从hacker news看到, streaming processing, 跟spark和flink竞争
https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

k*******6
发帖数: 103

来自主题: Programming版 - MapReduce 过时了？Google 发布 Cloud Dataflow

过时好几年了...
http://techcrunch.com/2014/06/25/google-launches-cloud-dataflow

w**z
发帖数: 8232

来自主题: Programming版 - MapReduce 过时了？Google 发布 Cloud Dataflow

Doesn't look like so from the article. It' using different algorithm from
MapReduce.
http://www.infoworld.com/t/hadoop/why-google-cloud-dataflow-no-

f********x
发帖数: 99

来自主题: Programming版 - Flink Sparks Next Wave of Distributed Data Processing

The world beyond batch: Streaming 101: A high-level tour of modern data-
processing concept
http://radar.oreilly.com/2015/08/the-world-beyond-batch-streami
by Tyler Akidau August 5, 2015
Editor’s note: This is the first post in a two-part series about the
evolution of data processing, with a focus on streaming systems, unbounded
data sets, and the future of big data.
Streaming data processing is a big deal in big data these days, and for good
reasons. Amongst them:
Businesses crave ever more tim... 阅读全帖

f********x
发帖数: 99

来自主题: Programming版 - Flink Sparks Next Wave of Distributed Data Processing

真正的大牛是楼主，我只是一个小混混，也只是停留在成天研究这些眼花缭乱的层次里
面。
技术的选择主要是根据你所要解决的具体问题而定。你可以描述一下需求，大家一起深
入讨论一下。
如果你只是单纯的学习，那么选择学Spark或者Flink，甚至老旧的MapReduce和最先进
的Dataflow，都无所谓。因为他们的编程模式差别不大，连同语法都差不多一样。例如
，你可以用这样的组合：
Spark的书：
http://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analys
+
Spark AMPCamp训练:
http://ampcamp.berkeley.edu/
+
Flink的batch文档:
http://ci.apache.org/projects/flink/flink-docs-master/apis/prog
+
Flink的streaming文档:
http://ci.apache.org/projects/flink/flink-docs-master/apis/stre
+
Flink的online trainin... 阅读全帖

f********x
发帖数: 99

来自主题: Programming版 - spark就是因为吃饱了要做rdd成immutable导致了无法实现真正意义上的stream processing吧？

SDK开源，Execution engine不会被lockin。Google其实早有预谋去统一这块市场。
Dataflow over Spark:
http://googlecloudplatform.blogspot.com/2015/01/easily-run-data
Dataflow over Flink:
http://googlecloudplatform.blogspot.com/2015/03/announcing-Goog
Genome analysis pipeline over Dataflow:
http://github.com/googlegenomics/dataflow-java

f******2
发帖数: 2455

来自主题: Programming版 - Apache Beam bs Apache spark

看了一下incubation的goal description，感觉google还是就想开源个壳子就把客户赢
过来，估计不会成功。
首先，在这里把spark批评一把：https://cloud.google.com/dataflow/blog/dataflow
-beam-and-spark-comparison
然后，在这里想把spark一统到自己的programming model下来：https://wiki.apache.
org/incubator/BeamProposal
感觉完全不顾databrick的感受。
而且dataflow的server side根本没有开源计划。这就好像azure说，我开源了azure客
户段
，而且是apache项目，你们不要用aws啦。
这么搞在云计算上没法翻盘。

f*****d
发帖数: 2285

来自主题: JobHunting版 - 貌似现在G在IO大会上出了新产品代替MapReduce了？

http://googlecloudplatform.blogspot.com/2014/06/sneak-peek-goog
In today's world, information is being generated at an incredible rate.
However, unlocking insights from large datasets can be cumbersome and costly
, even for experts.
It doesn’t have to be that way. Yesterday, at Google I/O, you got a sneak
peek of Google Cloud Dataflow, the latest step in our effort to make data
and analytics accessible to everyone. You can use Cloud Dataflow:
for data integration and preparation (e.g. in prepara... 阅读全帖

发帖数: 1

来自主题: JobHunting版 - google cloud 组求建议

最近在team match，前几个hr给的都是办公四件套的组我并不是很感兴趣。后来我说
dataflow看起来还不错，hr给联系了dataprep组（因为dataflow不在湾区），这个组非
常新还在BETA阶段。电话和HM还有Tech Lead(年底刚升的T5...)聊了一下，感觉这个组
主要是做Trifacta和GCP的integration，并不涉及data processing infra的活。而且
听HM的意思过去一年主要就是tech lead一个人写的dataprep的code(估计因为这个升的
T5），二人说是有很多东西要做不过时间比较短没来得及深聊，谈的都非常笼统。
请问坛子内的大神们有没有稍微了解这个Dataprep的？能给点建议感激不尽。我主要
顾虑是有以下几点：
1. team太小，tech lead才刚升的T5，担心过去没有高人带。我是T4.
2. HM和LEAD都在西雅图，湾区这边因为离Trifacta近，所以新职位都在湾区。这样的
话就跟HM和LEAD距离甚远
3. 组里做的主要是integration，具体细节还有待约第二次电话详聊。我个人对这个领... 阅读全帖

a****l
发帖数: 8211

来自主题: CS版 - IT历史回顾和未来展望

Labview没有你说的那么神奇。其实它本质上和C/JAVA没多大区别，和JAVA更类似点，
只不过是采用了dataflow的设计方法而以，别的语言的基本要素labview也是一一对应
的。
Labview能“让我也能够做一些外人看来很难的工作，实际上稍稍学学，很快就能够用
起来，数据采集，仪器控制等等都变得很容易”简单的控制仪器的关键是人家公司生产
从板卡到GUI整个流程的所有软硬件，自然一切都好办，点一个按钮什么都出来了。换
了任何一个其他的高级语言也完全能这样，只要你有API，什么不能办？labview能控制
别的公司的仪器也是靠大家都通过的控制协议才行的。
而且，labview的这些“傻瓜”功能其实也就是给初学者看看的，你真的要开发实用的
软件最后自己还是要用基本for、while这种模块搭起来的。labview其实也就是提供了
一些高级的API，没什么大不了的。Labview是公司独有的软件，是不可能象你希望的
GNU方向发展的。
话又说回来，我看labview的dataflow设计方法是很适合现代CPU的发展趋势的。现在
labview的公司里有很多这方面的讨论。

CPU

N*****m
发帖数: 42603

来自主题: Programming版 - Apache Beam bs Apache spark

就是DataFlow的DSL开源了，好几个星期前的事了
现在搞了个beam的名字
引擎没有开源，然后可以在它自家的DF Service上跑
也可以把spark, flink当引擎

dataflow
apache.

s**u
发帖数: 9035

来自主题: Biology版 - Clinical Research Fellow opening

Department: STAR (Shock, Trauma & Anesthesiology Research)
Schedule:
Shift:
Hours:
Job Details: Description: The Charles McMathias, Jr National Study Center
for Trauma and EMS of the Shock Trauma and Anesthesiology Research Organized
Research Center (STAR ORC) at the University of Maryland Baltimore, School
of Medicine, is seeking a highly motivated, enthusiastic and capable
Research Fellow interested in Emergency Medical Services to work as part of
a multi-disciplinary team, requiring self-... 阅读全帖

S**********e
发帖数: 1325

来自主题: MedicalCareer版 - Clinical Research Fellow opening(ZZ) (转载)

【以下文字转载自 Medicalpractice 讨论区】
发信人: shmu (shmu), 信区: Medicalpractice
标题: Clinical Research Fellow opening(ZZ)
发信站: BBS 未名空间站 (Tue Dec 13 09:01:25 2011, 美东)
发信人: shmu (shmu), 信区: Biology
标题: Clinical Research Fellow opening
发信站: BBS 未名空间站 (Tue Dec 13 08:59:36 2011, 美东)
Department: STAR (Shock, Trauma & Anesthesiology Research)
Schedule:
Shift:
Hours:
Job Details: Description: The Charles McMathias, Jr National Study Center
for Trauma and EMS of the Shock Trauma and Anesthesiology Researc... 阅读全帖

s**u
发帖数: 9035

来自主题: Medicalpractice版 - Clinical Research Fellow opening(ZZ)

发信人: shmu (shmu), 信区: Biology
标题: Clinical Research Fellow opening
发信站: BBS 未名空间站 (Tue Dec 13 08:59:36 2011, 美东)
Department: STAR (Shock, Trauma & Anesthesiology Research)
Schedule:
Shift:
Hours:
Job Details: Description: The Charles McMathias, Jr National Study Center
for Trauma and EMS of the Shock Trauma and Anesthesiology Research Organized
Research Center (STAR ORC) at the University of Maryland Baltimore, School
of Medicine, is seeking a highly motivated, enthusiastic and capable
Res... 阅读全帖

M*******n
发帖数: 10087

来自主题: Military版 - 国内做IT的现在很心高气傲啊 (转载)

【以下文字转载自 Dreamer 讨论区】
发信人: Dreamer (不要问我从哪里来), 信区: Dreamer
标题: 国内做IT的现在很心高气傲啊
发信站: BBS 未名空间站 (Wed Aug 12 16:32:07 2015, 美东)
前不久回国，跟国内IT同样聊天，感觉国内做IT的现在很心高气傲啊，动不动就是什么
中国现在技术不比美国差，看看阿里的大数据，比亚马孙都高，硅谷现在不行了什么的。
其实国内搞来搞去，还不是把美国开源的东西拿过来整合一下，改改参数。阿里所谓的
大数据，水分先不说，有用什么自己开发的软件么？从传统的数据库，到现在大数据流
行的Hadoop，Spark，到号称下一代的Google Dataflow，有哪个是国内企业搞出来的。
现在就是用户多点，就开始看不起美国的技术了。其实美国这些企业如果有这么多用户
，支持起来一样也没有问题，毕竟代码都是人家自己写的，你改改参数就能做到，人家
知道内核实现的，改起来更容易。国内企业除了华为，有几个有自己全套的核心技术的
。东西都没做出来，就这么浮躁了，唉。

s******c
发帖数: 1920

来自主题: JobHunting版 - 求推荐点MapReduce的Paper

google后来发的flumejava 那篇，
其实就是今年io上包装出来的内个下一代mapreduce，dataflow

d********w
发帖数: 363

来自主题: JobHunting版 - 系统设计能力提高捷径

品味来了。
Basics and Algorithms
The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of
Thumb (1997): This paper (and the original one proposed 10 years earlier)
illustrates a quantitative formula to calculate whether a data page should
be cached in memory or not. It is a delight to read Jim Gray approach to an
array of related problems, e.g. how big should a page size be.
AlphaSort: A Cache-Sensitive Parallel External Sort (1995): Sorting is one
of the most essential algorithms in... 阅读全帖

f*****d
发帖数: 2285

来自主题: JobHunting版 - Pinterest Software Engineer position for Data/Hadoop

WHy does Pinterest still use hadoop & hive? Why not use Spark or Google
Cloud Dataflow?

f*****d
发帖数: 2285

来自主题: JobHunting版 - Data bricks怎样？

https://cloud.google.com/dataflow/
[在 wookoong (悟空) 的大作中提到：]
：ysd 说的workflow指的是什么？google和databricks都有什么解决方案？
：
：...........

s******c
发帖数: 1920

来自主题: JobHunting版 - Data bricks怎样？

google做flume对外叫dataflow有年头了没太宣传而已
就发了片非常misleading的paper。
结果被spark抢了风头

S*******w
发帖数: 24236

来自主题: JobHunting版 - Databricks 这个公司什么情况？

这边有open source的flink和它竞争，另一边还有狗狗的dataflow。
founder赚点钱可以想进去喝汤的打工仔比较困难了。

j******6
发帖数: 8

来自主题: JobHunting版 - Google选组求助

目前match到了三个组，跟manager都聊了一圈感觉还是很懵，不知道该怎么选，看起
来好像第2个好像比较有意思，但是具体也不是很清楚。求内部人士或了解的人给点
意见谢谢了 🙏
1. Gsuite - Identity and Access Management
2.Tech Infra Data Processing and Analysis
（Opportunity to contribute to Beam (open source) and to Cloud & Dataflow,as
well as work with the TensorFlow & ecosystem and CloudML）
3.Cloud, Enterprise Admin Platform team

s*****r
发帖数: 43070

来自主题: JobHunting版 - 感觉亚麻的市值早晚要超过狗狗啊

dataflow

s******c
发帖数: 1920

来自主题: Stock版 - 亚麻不是一个短期的回调，而是一个传奇的结束

看google io就知道了:
cloud和Android就是goog未来两大重点
goog和amazon的云服务还是很不一样，aws最常见的是直接买他们提供的虚拟机，
google cloud更加高层一些，比如bigquery，dataflow。孰是孰非还真不好说，或者说
估计都有特别fit的sector和客户

M*******n
发帖数: 10087

来自主题: SanFrancisco版 - 国内做IT的现在很心高气傲啊 (转载)

m******s
发帖数: 118

来自主题: CS版 - 2006 ACM fellows [zz]

http://newswire.ascribe.org/cgi-bin/behold.pl?ascribeid=20070108.114012&time=12%2049%20PST&year=2007&public=0
2006 ACM Fellows
Eric W. Allender - Rutgers University
For contributions to computational complexity theory.
Arvind - Massachusetts Institute of Technology
For contributions to dataflow computing and verification.
Mikhail J. Atallah - Purdue University
For contributions to parallel and distributed computation.
Ming-Syan Chen - Nation

B*********L
发帖数: 700

来自主题: Database版 - SSIS import and export wizard 搞不定了

多谢前辈！
我照着下面这个回复改了。It works!
Maurice Maglalang Wednesday, January 23, 2008 7:25:22 AM
change ERROR OUTPUT within your dataflow task to IGNORE FAILURE on the
offending field within your source. do the same to your destination then
you'll be good to go bro...ignore those worthless answers above...l8

p***c
发帖数: 5202

来自主题: Database版 - data warehouse完全可以自己写procedure,view等来执行吗。

那天听一讲座，授课者说：some older engineers。。。。底下一片：ouch。。。就是
说的这种顽固不化的人
哈哈哈
你们老师是瞎吹，就楼上说的，他最多用用sql写
etl dataflow 那部分，我这个同意，很多时候sql很强，但是一旦碰到其他source，比
方xml，json，csv什么的，他用sql写？累死他
report data modeling 那部分
cube不用ssas或者类似ssas的东西比方ibm买的cognos，他用sql怎么写？写死他啊

F****n
发帖数: 3271

来自主题: Java版 - JTable太弱了，应当改写

It's a dataflow-based visualization package that can be used to visualize
anything.

o**2
发帖数: 168

来自主题: Java版 - NoThread concurrency

你说得对，business logic放到那些object里去了。这是FM提供benefit给programmer
的开始：
1，这些object是active的，也就是你不需要安排thread什么的了。这很象browser和
website的关系，你发request，可以得到response。做dataflow也很方便。
2，这些object是有名字的，你用名字access它们，而不需要管它们的reference。
3，这些object的里面，你的business logic是在一个single threaded的环境里执行，
减少了race condition。
先想到这几点。

a****l
发帖数: 8211

来自主题: Programming版 - 看了这篇文章，脑子有点不够用了

the problem is that text programs are inheierently sequential, thus you have
all kinds of problem when going to parallel world.
Change to dataflow programming, and lots of problems can be simply avoided.

a****l
发帖数: 8211

来自主题: Programming版 - CUDA 和 Hadoop 是不是算并行和分布的两个比较有前途的技术?

I think they are all focusing on the wrong path. Dataflow is the way to go.

r*********r
发帖数: 3195

来自主题: Programming版 - CUDA 和 Hadoop 是不是算并行和分布的两个比较有前途的技术?

dataflow is such a generic word. what specific technology does it refer to?

go.

c******o
发帖数: 1277

来自主题: Programming版 - 异步编程+FP的程序的可读性太差

你没发现，在future monad这里，你都不用管啥串行，并行，你的dataflow自动就把这
些都解决了。
这个本身就叫data flow currency paradigm
要是有transaction, 再加STM/agent, 要是distributed, 再加 location neutral
Actors.
这些都是现在FP强的地方。

c******o
发帖数: 1277

来自主题: Programming版 - 异步编程+FP的程序的可读性太差

http://doc.akka.io/docs/akka/snapshot/scala/futures.html
http://doc.akka.io/docs/akka/snapshot/scala/dataflow.html

c******o
发帖数: 1277

来自主题: Programming版 - akka/scala/jvm

用scala 有一段了，最近又自学了clojure和自己练习了一下 akka/spark
说说新的感想。
我们的新后端是Play/Scala的，本来想加akka,但是事实上最后用的很少。
是不是akka不好，没啥用呢？我以前也有这个疑问，现在觉得理解错了。
Akka是很好，绝对是killer app. 以前我觉得没用，没用到，是理解错了。
第一，我其实是在间接的用akka, scala 2.10.x的future就是以前akka的future,
scala的future已经deprecated, play本身就是build在akka上的，spark也是。
第二，其实我们确实用不到actor,actor最大得用途是distributed fault tolerance
system, 一般的async future/dataflow就足够足够了。要是需要shared data,
transaction, agent就够了。虽然actor以上两个都可以用，但是也不一定要用。
第三，其实akka很多内容，actor只是一个最general的。
有future(其实内部都是callback),... 阅读全帖

k**********g
发帖数: 989

来自主题: Programming版 - 我来说说go的目标对手吧

Ideally, these frameworks should have supported two important models:
(1) dataflow model (similar to FRP, functional reactive programming, but
without the "leaky abstractions"), and
(2) Task dependency DAG model
Microsoft PPL (parallel patterns library) supports #1 out of the box.
I implemented the second model twice, in 2012 (without PPL) and 2013 (on top
of PPL). Relying on PPL simplifies the management of the thread pool,
especially for its initialization, automacially spinning up new thread... 阅读全帖

s******c
发帖数: 1920

来自主题: Programming版 - MapReduce 过时了？Google 发布 Cloud Dataflow

MapReduce就相当于云时代的汇编, 再怎么高大上的新东西, 也是绕不开的, 只是有更
高级的抽象而已. 底层还是在跑MapReduce或者MapReduce的变种

f********x
发帖数: 99

来自主题: Programming版 - spark就是因为吃饱了要做rdd成immutable导致了无法实现真正意义上的stream processing吧？

解铃还需系铃人:
http://cloud.google.com/dataflow/

r***t
发帖数: 104

来自主题: Programming版 - 有人熟悉google cloud dataflow吗

我们公司正在准备试验用它代替mr，有人熟悉的来说说pro 和cons。

l*******m
发帖数: 1096

来自主题: Programming版 - 狗的dataflow是什么

捐给Apache

g****s
发帖数: 340

来自主题: Programming版 - Apache Beam bs Apache spark

感觉这个策略不错啊，在自己的cluster上跑用fink,spark，想效率高用cloud
dataflow。
话说狗家的stack太独特了，build system open source了几个月还在beta。想open
source一个更高层的tech要花很多时间。

N*****m
发帖数: 42603

来自主题: Programming版 - scalding的学习资料好少啊！

beam还不太成熟，runner估计就他们自己的dataflow没坑

g****t
发帖数: 31659

来自主题: Programming版 - 这偏语言分析的文章很好

(1)
Nim后台是c,然后gcc或者clang都可以选，build生成exe。
比Julia方便.
Nim竞争对手是c，不是Julia这些易用语言。速度和c一样，
据说overhead很少。
当然，Nim本意不是数据处理的语言。但是如果你要有c的速度，
又觉得c在语言上太落后，那就可以考虑这些。
Nim／Rust都和swift差不多类型吧。自己的定位都是
system language。
(2)
做算法在一个小方向做深了，很容易产生找个新语言的想法。
这和找工作的需求不一样。
再给你看一个：
https://github.com/frankmcsherry/differential-dataflow
http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf
这哥们写了很多rust

S*******w
发帖数: 24236

来自主题: Programming版 - 大牛们有人玩apache beam吗

最大的drive是来自google cloud的dataflow啊

p******k
发帖数: 11

来自主题: Quant版 - SecDB/Slang高盛（Goldman Sachs）赚钱的利器

http://ponyhawk.org/?p=10
SecDB就是security DB，Slang就是security langurage。SecDB就是一个数据库平台，
而Slang就是在这个平台上使用的语言。在网上找它的信息的时候发现很多讨论，但是
都很宏观，一旦深入进去总好像模模糊糊没有什么具体的东西可以了解，究其原因或许
是因为它的拥有者是另一个神秘的公司——高盛（Goldman Sachs）。曾经有个金融领
域的大佬曾经说过高盛（Goldman Sachs）就和某党差不多，在外面的人开来他很强大
，很神秘，虽然没做什么坏事，但基本上做的事都会被联想到邪恶。但就是这样一个“
邪恶“的角色被人们口口相传拥有一个赚钱的利器“SecDB/Slang”的时候，这件利器
本身也就成了世人追逐的对象。而这种说纷纭的千人千象更加增添了它“神器”传说。
既然讨论一个IT系统，我觉得还是IT人员最有发言权，仔细分析一下注解[1]中的讨论
和注解[2]中的描述，我觉得以下摘录值得进一步分析：
“They could also calculate the side effects of propo... 阅读全帖

D******n
发帖数: 2836

来自主题: Statistics版 - A simple SAS data flow analyzer

https://github.com/dashagen/sas-dataflow

D******n
发帖数: 2836

来自主题: Statistics版 - A simple SAS data flow analyzer

https://github.com/dashagen/sas-dataflow/wiki

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天