Document Digitization/Conversion Service
CSI offers document conversion services in both the e2e (Electronic-To-Electronic) and the p2e (Paper-To-Electronic) domains. Digitization of paper documents (p2e) involves scanning content as images, following which sophisticated OCR techniques using generic as well as custom made in-house tools extract and reformat electronic text. Our customized XML based processing framework has made our services very reliable and accurate. We also provide content enhancement services such as image editing and conversion, and creation of hyperlinks, footnotes and table of contents, among others, allowing maximum flexibility in the use of content.
Our Core Competencies in Document Conversion:
- Document extraction from virtually any format
- OCR cleaning
- Indexing
- Generate output in any format like SGML, XML, HTML, Word RTF etc.
- Any DTD, any schema, any specification
- Content tagging
- We accept documents through either ftp, CD or DVD, paper stack or any other format.
Here is a summary of our working procedure:
- We review your sample document
- We prepare specifications as per your requirement
- We prepare samples
- You review specifications and samples
- We refine specifications and samples till you approve
- We develop in-house tools for automation at various levels
- We start production
- We perform quality check by our quality assurance cell
- We deliver converted documents following the agreed schedule
The solution can be divided in the following phases:
Scanning
- Establish scanning setup
- Identify scanning attributes as per document quality
- Scanning of B&W pages
- Scanning of color pages
- Cropping, Cleaning & Saving scanned files
Digital Content Re Mastering (DCRM)
- Image Preprocessing
- Image clean up through splitting, cropping, de-skewing
- Layout Analysis & Test Recognition
- Tuning OCR Engine for text extraction and image extraction
- Logical Structure Analysis
- Structure analysis of pages as per titles, authors, sections, captions, tables, manifest etc.
- Text Flow & Article Recognition
- Establish reading order within clusters as well as within pages
- Identify multiple articles within the same page or articles that span multiple pages
Manual Corrections
- Correction of content overlapping
- Correction of page content
- Correction of page attributes
- Correction of page type
- Correction of inter-content linking etc.
XML Tagging
- Identification of tags and meta data contents
- Creation of Master XML DTD
- Creation of Master XML
- Master XML contains page format, page content and the meta data
- Creation of HTML pages from Master XML using customized tool (in house developed)
|
|
|