OCR job from recruiter - it is interesting but I can't do it, yet - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - OCR job from recruiter - it is interesting but I can't do it, yet

相关主题
● 能解决这个问题的才是真正的数据科学家	● 关于team work, 我来讲讲我的经历和一些想法
● 提供Uber内推	● The Deliverance of Man
● 请教一个Big Data/Analysis 方面的设计问题	● a RA position in University of Rhode Island
● OCR job from recruiter - it is interesting but I can't do i (转载)	● Opening: R&D Engineer - Game Technology Focus
● OCR job from recruiter - it is interesting but I can't do i (转载)	● 美东药厂分子动力学暑假实习
● San Diego Medical Startup looking for Senior/Entry Level S	● 转载: 大陆人
● 关于DARPA的funding	● 有用lammps的吗
● 工业界2-3 万的钱怎么处理？	● 转一辆自行车

相关话题的讨论汇总
话题: minimum话题: coarse话题: system话题: tool话题: python

进入DataSciences版参与讨论

(共1页)

c***z
发帖数: 6348

If you would be interested please let me know at h******[email protected] as
early as possible.
Job Title: position for Data Scientist for Machine Learning and Natural
Language Processing Experience
Company: BITS
Task 1: Extend NIST Scientific Text Extraction System
Description of Tasks
I. Implement distributed PDF to image conversion subsystem that converts
pages of scientific articles to individual images.
II. Implement distributed optical character recognition-based text
extraction system that extracts text from images of individual pages and
prepares them for further processing by error correcting, machine learning,
and natural language systems.
III. Develop installation scripts for developed system and required tools
to facilitate installation on Linux virtual machines.
IV. Configure Linux virtual machine images with developed system and
necessary software tools and libraries for deployment in a distributed
virtualized system such as cloud computing.
V. Develop system documentation for deployment, maintenance, and
operation.
Deliverables:
The deliverables for the tasks under Task 1 are:
1. PDF to Image Conversion subsystem in Python using ImageMagick to
perform the actual image conversion. Distribution of computation implemented
using Redis and Thoonk to create a distributed job queue system in which a
publisher node enters PDF ids from a fileserver into the job queue for
distributed worker nodes to fetch and convert into images. Images should be
returned to the fileserver as completed work units in zip files containing
the images. Subsystem should be fault tolerant and include the necessary
error handling and logging to disk to allow for uninterrupted operation over
long periods of times. Failure to perform an image conversion should not
prevent the system from continuing nor should information about the failure
be lost.
2. OCR-based Text Extraction subsystem in Python using OCRopus to extract
the text from the image files. Distribution of computation implemented
using Redis and Thoonk to create a distributed job queue system in which a
publisher node enters work unit identifiers (work units are generated during
image conversion) from a fileserver into the job queue for distributed
worker nodes to fetch and process. Extracted text should be added to zip
file-based work unit and sent back to the fileserver. Subsystem should be
fault tolerant and include the necessary error handling and logging to disk
to allow for uninterrupted operation over long periods of times. Failure to
perform a text extraction should not prevent the system from continuing nor
should information about the failure be lost.
3. Command-line installation scripts in Python that make use of existing
packaging and distribution facilities associated with Linux and Python
libraries when available.
4. A Linux Virtual Machine image compatible with existing VMWare based
infrastructure what have been configured for rapid deployment. The VM image
should contain a current patched version of Linux with the developed code
and its prerequisites installed via the installation script previously
developed.
5. System documentation, in Microsoft Word, for the entire Scientific
Text Extraction System. Documentation shall include an overview of the
architecture, data flow, use of and integration with Redis, deployment,
maintenance, and operation of the application.
Task 2: Develop Graphical User Interface for Computational Soft Materials
Workbench for Multiscale Modeling.
I. Working with MML-specified prototype workbench code extend existing C+
+-based GUI to design and implement menu bar items and dialog boxes which
can be connected to MML specified libraries and tools.
Deliverables:
The deliverables for the tasks under Task 2 are:
1. A prototype of workbench that can be used to illustrate key user
interface concepts. Consists of menus for ZENO and help, a toolbar, an
interface for Python, dialogs for Amorphous Builder, Trajectory Analysis
Tool, LAMMPS and GROMACS simulations, Coarse-Mapping Tool, Coarse-Grain
Structure Tool, Coarse-Grained Force Field Assignment, and ZENO.
Task 3: Develop Computational Soft Materials Workbench for Multiscale
Modeling.
I. Develop core application components.
II. Connect GUI to algorithms and tools.
III. Develop visualization of molecular structures.
IV. Implement facilities for reading and writing files in atomistic
formats.
V. Implement classes to interface to algorithms and tools for molecular
modeling.
VI. Create functionality for Molecular Modeling workflows.
VII. Write documentation for workbench system.
Deliverables:
The deliverables for the tasks under Task 3 are:
1. Identified APIs for GUI, data conversion, molecular visualization,
data conversion, and extensibility. Implement class library to integrate
with APIs.
2. Interface classes to connect functionality to menus for ZENO and help,
a toolbar, an interface for Python, and multiple dialogs (Amorphous Builder
, Trajectory Analysis Tool, LAMMPS and GROMACS simulations, Coarse-Mapping
Tool, Coarse-Grain Structure Tool, Coarse-Grained Force Field Assignment,
and ZENO).
3. 3D visualization, rotation, zoom, selection of individual elements,
and display lists of molecular structures. Visualization of grouping of
highlighted elements into coarse grained elements.
4. Functionality to read and write atomistic data in a variety of domain
formats: CML. PDB, XYZ, LAMMPS (Data and Input), GROMACS (Data, Input, and
Trajectory), Coarse Grain (Mapping and Force Field Table).
5. Classes to interface with molecular modeling algorithms (Amorphous
Builder and Coarse-Graining).
and molecular modeling tools (LAMMPS, GROMACS, Coarse-Grained Structure
Building Tool, ZENO, and Trajectory Analysis Tool).
6. Workflow Functionality that supports a variety of molecular
calculations and computations.
7. Documentation of workbench, creation of GUI Help Menus and web pages.
QUALIFICATIONS OF CONTRACTOR KEY PERSONNEL
All contractor personnel working under this task order shall be designated
as Key Personnel. All Contractor Key Personnel working under this task order
must meet the following minimum qualifications.
• Minimum of 5 years of a scripting language, such as Python,
Javascipt, Perl, or PHP
• Minimum of 5 years experience with system languages such as C or
C++
• Minimum of 5 years experience with Agile Methodologies, such as
XP or SCRUM
• Minimum of 5 years experience with a combination of SQL and No-
SQL databases
• Minimum of 5 years experience developing web applications, with
HTML5, CSS3, Javascript, JQuery, Web 2.0 technologies
• Minimum of 3 years experience developing RESTful interfaces
• Minimum of 1 year experience setting up virtual machines and
installing and making Debian packages
• Minimum of 5 years experience in developing graphical user
interfaces
• Minimum of 5 years experience with XML technologies

(共1页)

进入DataSciences版参与讨论

相关主题
● 转一辆自行车	● OCR job from recruiter - it is interesting but I can't do i (转载)
● 飞鹰网联队一期名单	● San Diego Medical Startup looking for Senior/Entry Level S
● Zeno Acne Blue Device	● 关于DARPA的funding
● 美容小物。。。去痘试试看zeno hot spot	● 工业界2-3 万的钱怎么处理？
● 能解决这个问题的才是真正的数据科学家	● 关于team work, 我来讲讲我的经历和一些想法
● 提供Uber内推	● The Deliverance of Man
● 请教一个Big Data/Analysis 方面的设计问题	● a RA position in University of Rhode Island
● OCR job from recruiter - it is interesting but I can't do i (转载)	● Opening: R&D Engineer - Game Technology Focus

相关话题的讨论汇总
话题: minimum话题: coarse话题: system话题: tool话题: python

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天