怎么样实现fuzzy join - Programming版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Programming版 - 怎么样实现fuzzy join

相关主题
● SQL add some columns into a table from another table (转载	● cassandra query speed求助
● c preprocess question	● linux, find command question
● C++ Q13: Input	● 请问sql语句能不能实现这样的功能
● typedef and operator new problem	● 问个小问题
● 多线程优化求助！ (转载)	● C++里面如何最方便的表示这个数组的数组？
● error of sql query in MS Access database (转载)	● A question about page table
● Error of SQL query on IBM netezza SQL database from Aginity (转载)	● .net c# 到底哪不好？没听说过c#程序员找不到工作的，说心里话
● sqlite3 db.close() called before db.serialized() finishes	● 请教思路数据同步有关

相关话题的讨论汇总
话题: ts2话题: table话题: nextline2话题: ts1话题: timestamp

进入Programming版参与讨论

1

(共1页)

mw 发帖数: 525	1 两个table，都是csv形式存在硬盘上，每个100万行，每个的第一列都是timestamp，怎么样实现一个fuzzy join，产生一个新的一百万行的table，每行都是table 1，和小或者等于table1的timestamp的table2的对应的那一行我感觉这个好像没有什么巧办法，只能一行一行的从两个table轮流读。如果要把两个table同时读到内存，然后建立索引什么的，那对内存需求就太大了各位有什么好办法啊？
mw 发帖数: 525	2 70多个人看了，没点意见吗？是不是我没有解释清楚？大牛们不要吝惜拍一下哈【在 mw 的大作中提到】 : 两个table，都是csv形式存在硬盘上，每个100万行，每个的第一列都是timestamp，怎 : 么样实现一个fuzzy join，产生一个新的一百万行的table，每行都是table 1，和小或 : 者等于table1的timestamp的table2的对应的那一行 : 我感觉这个好像没有什么巧办法，只能一行一行的从两个table轮流读。 : 如果要把两个table同时读到内存，然后建立索引什么的，那对内存需求就太大了 : 各位有什么好办法啊？
g*****g 发帖数: 34805	3 不明白这有啥难的。1M的数据，就算每个数据1K，不就1G，这年头内存随便放。如果每条记录很大，存一个timestamp加一个offset也就几个字节的事情。把两个表简单内排O(NlogN)，然后做个Merge, O(M+N)就完了。
r***6 发帖数: 401	4 #!/usr/bin/python f1 = open("f1.csv") f2 = open("f2.csv") ts2 = 0 line2 = "\n" for line1 in f1: ts1 = int(line1.split(",")[0]) while ts2 <= ts1: nextline2 = f2.readline() if not nextline2: break ts2 = line2.split(",")[0] if ts2 <= ts1: line2 = nextline2 print line1 + "," + line2,
d****n 发帖数: 1637	5 nice! otherwise, try linux command tool "join" http://www.folkstalk.com/2012/02/join-command-in-unixlinux-exam 【在 r***6 的大作中提到】 : #!/usr/bin/python : f1 = open("f1.csv") : f2 = open("f2.csv") : ts2 = 0 : line2 = "\n" : for line1 in f1: : ts1 = int(line1.split(",")[0]) : while ts2 <= ts1: : nextline2 = f2.readline() : if not nextline2:
D****r 发帖数: 309	6 not quite understood your request: " 每行都是table 1，和小或者等于table1的timestamp的table2的对应的那一行" would be better print some sample line here of each table, and what is the result format you want. I think apart from rc256's python solution, shell scripting would be easy to handle that. e.g. using awk with certain split symbol and compare the result and print together. 【在 mw 的大作中提到】 : 两个table，都是csv形式存在硬盘上，每个100万行，每个的第一列都是timestamp，怎 : 么样实现一个fuzzy join，产生一个新的一百万行的table，每行都是table 1，和小或 : 者等于table1的timestamp的table2的对应的那一行 : 我感觉这个好像没有什么巧办法，只能一行一行的从两个table轮流读。 : 如果要把两个table同时读到内存，然后建立索引什么的，那对内存需求就太大了 : 各位有什么好办法啊？
r***6 发帖数: 401	7 Another solution is to use R says package. Essentially what you needed is last observation carried forward. There are many stats package does that. not quite understood your request: " 每行都是table 1，和小或者等于table1的 timestamp的table2的对应的那一行"wo........ 【在 D****r 的大作中提到】 : not quite understood your request: : " 每行都是table 1，和小或者等于table1的timestamp的table2的对应的那一行" : would be better print some sample line here of each table, and what is the : result format you want. : I think apart from rc256's python solution, shell scripting would be easy to : handle that. e.g. using awk with certain split symbol and compare the : result and print together.

1

(共1页)

进入Programming版参与讨论

相关主题
● 请教思路数据同步有关	● 多线程优化求助！ (转载)
● 问一道HIVE题关于Efficiency	● error of sql query in MS Access database (转载)
● 能有人详细讲一下这两道google的面试题吗?	● Error of SQL query on IBM netezza SQL database from Aginity (转载)
● complexity of set operation?	● sqlite3 db.close() called before db.serialized() finishes
● SQL add some columns into a table from another table (转载	● cassandra query speed求助
● c preprocess question	● linux, find command question
● C++ Q13: Input	● 请问sql语句能不能实现这样的功能
● typedef and operator new problem	● 问个小问题

相关话题的讨论汇总
话题: ts2话题: table话题: nextline2话题: ts1话题: timestamp

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)