
Document Complexity in XML conversionDuring the conversion process, Exegenix software uses its integral knowledge of typographic principles to identify constructs such as sections, paragraphs, quotes, lists, tables, footnotes, etc., and applies a variety of techniques across the entire document to form a complete, cohesive, internal representation of its structure. The automated approach of Exegenix Conversion Solutions means that it produces best results on documents that employ commonly used graphical objects, traditional ways of formatting these objects, and regular, repeating patterns. Three broad document categories indicate the increasing difficulty of conversion:
Note that these categories do not refer to the traditional concept of a documents "complexity". For example, a document that has many levels of nested lists can be converted as readily as one that has only a single level, provided that the lists are formatted in regular ways. In fact, estimating the ease of conversion is an art rather than a science, and Exegenix uses a combination of the following guidelines on text flow, graphical subtlety, and other considerations: Text flowText flow describes the path that the eye takes when reading the information on the page, and can be characterized as:
Difficulties in correctly identifying and understanding the text flow in the original document can result in XML output that contains out-of-order text or elements, requiring post-conversion processing. Graphical subtletyGraphical subtlety describes the ease with which text and graphical objects can be distinguished within the original document, and the degree to which graphics are used to convey textual information. It can be characterized as:
Difficulties in distinguishing between text and graphics in the original document can result in XML output that contains text captured as a graphic, or textual parts of a graphic captured separately, requiring post-conversion processing. Other considerationsOther considerations that help determine the level of conversion difficulty include: Frequency of images Images tend to break the text flow, convey non-parsable information, and require manual design, etc. Mathematics Complex math can be handled by Exegenix in a variety of different ways, depending on customer requirements. Ratio of white/dark space Text-heavy documents are easier to convert, and less likely to have unconventional text flows. Semantic structure Any structure that is conveyed via content alone cannot be detected, although it may be possible to process post-conversion. Vertical markets Exegenix Conversion Solutions deal best with the widely-observed formatting conventions that span disciplines. Certain types of very specialized documents will initially fall into the idiosyncratic category but, as our development team incorporates their conventions, will move into the conventional category. Its also important to note that the easier a document is to convert, the easier it is to render into a structured format: easily convertible formats can be rendered into richly-structured XML or SGML; more difficult documents may be rendered into HTML that captures the formatting but not the structure of the original document; and the most difficult documents can be rendered sensibly only into text format. |
Submit sample documents More Info |