| |
Classes of XML Output
There are three general classes of XML
output generated by conversion tools. In increasing order of utility, they
are:
- Formatting Based
- Structural
- Semantic
Formatting based markup
Just as HTML is designed to deliver documents to Web browsers,
formatting-based XML markup is designed to deliver content that
looks a specific way to a specific output format, for example, 8½
x 11 Portrait. Change the paper size, and you change the page breaks,
text flow, etc, changing the markup substantially.
This type of markup says nothing about the content itself, just
how it should look when published. Because the markup is closely tied to a
specific authoring application, or output formatting, it is not versatile in
terms of publishing to other devices, or indexing for context-sensitive
searching.
Furthermore, if the formatting codes are inconsistently ordered,
where "bold italic" may appear sometimes as "italic bold", it is very difficult
to post-process to more useful forms.
This is the type of markup generated by all off-the-shelf XML
conversion tools, as well as "Save as XML" from word processors or desktop
publishing tools.
Example:
|
<font
name=Arial
size=20pt><b>Formatting
Based Markup: Is it useful in a practical way, or just the Status Quo?</b></font>
<font
name=TimesNewRoman
size=12pt>Most
conversion technologies leverage specific combinations of formatting codes to
produce their output, or simply dump the formatting codes discovered in an
XML-like
syntax.</font>
<font
name=Arial
size=10pt>Formatting
codes are rarely consistently applied by authors, nor consistently encoded in
data files by authoring applications.</font>
<font
name=TimesNewRoman
size=12pt>Output
from conversion technologies that deliver or rely on formatting based code is
therefore inconsistent. Where consistency of data is important to long-term
management and republishing of content, formatting-based markup is simply not
good enough.</font>
|
Formatting
Based Markup: Is it useful in a practical way, or just the Status
Quo
Most
conversion technologies leverage specific combinations of formatting codes to
produce their output, or simply dump the formatting codes discovered in an
XML-like syntax.
Formatting
codes are rarely consistently applied by authors, nor consistently encoded in
data files by authoring applications.
Output
from conversion technologies that deliver or rely on formatting based code is
therefore inconsistent. Where consistency of data is important to long-term
management and republishing of content, formatting-based markup is simply not
good enough. |
Structural markup
Markup that represents document structures is not
tied to a specific application, output medium or device.
Common structures, like sections, titles, paragraphs, list,
tables, etc., transcend specific formatting combinations. Together with a style
sheet for each output device or medium, large volumes of structure-based markup
can be published on demand in a consistent, pleasing way.
Additionally, structure-based markup provides the proper
foundation for increasingly sophisticated markup, which also stores information
about the actual meaning of the content, known as semantic markup.
This is the type of markup generated by Exegenix Conversion
Solutions.
Example:
|
<section>
<title
font-family=Arial
font-size=20pt
font-weight=bold>Structural
Markup: A Better Way</title>
<sectionbody><para
font-family=TimesNewRoman
font-size=12pt>Deriving
structure from input data results in output that has common tagging despite a
diverse input dataset.</para>
<para
font-family=TimesNewRoman>Input
formatting is retained, and is available for use in post processing or
republishing of converted material.</para>
<para
font-family=TimesNewRoman
font-size=12pt>Regardless
of minor variations in input formatting, structure is unambiguous and
consistent, and therefore, a better choice for long term storage and management
of content.</para></sectionbody>
</section>
|
Structural
Markup: A Better Way
Deriving
structure from input data results in output that has common tagging despite a
diverse input dataset.
Input
formatting is retained, and is available for use in post processing or
republishing of converted material.
Regardless
of minor variations in input formatting, structure is unambiguous and
consistent, and therefore, a better choice for long-term storage and management
of content. |
Semantic markup
Semantic markup is the most difficult type of
markup to produce, because it requires human intervention to interpret the text
and identify content that matches specific requirements. Often, the person
making the determination must be a subject-matter expert in order to properly
understand what they are reading, and apply the proper semantic tagging.
Consider the example below, and how one would attempt to find
the term Semantic Tagging. With the semantically tagged content
shown, it is possible to restrict the search to return only those documents
where the search phrase Semantic Tagging is found in the
subject of a concept.
Structure based markup allows for search on structures, like
section title or figure caption, and serves as a
foundation for semantic searches. Together, it is possible to search for a term
in a concept in Chapter 1. Fewer results are returned,
and these results are exactly what was desired.
With formatting-based markup, no context sensitive search or
indexing is possible. Search queries return hundreds, possibly thousands of
results, leaving the reader to check every result to find the document they
want.
Semantically-tagged output can be costly to achieve, but the
most sophisticated content publishing and indexing applications are possible
when the content is properly semantically tagged.
Exegenix Conversion Solutions deliver structural markup
automatically, and provide tools to add semantic value to content during the
conversion process.
Example:
|
<concept>
<subject>
Semantic
Tagging</subject>
<title
font- family=Arial
font-size=20pt
font- weight=bold>Semantic
Tagging: The Ultimate</title>
<premise><para
font-family=TimesNewRoman
font- size=12pt>Structural
tags form the basis of semantic tagging.</para>
<para>Semantic
tagging enables extremely accurate indexing and search</para></premise>
<proof><para
font-family=TimesNewRomanfont-
size=12pt>Consider
searching for the term Semantic Tagging in the subject
of a concept, versus a simple keyword search on the phrase
Semantic Tagging across a large dataset.</para></proof>
<conclusion><para>Semantic
Tagging rules!</para></conclusion>
</concept>
|
Semantic
Tagging: The Ultimate
Structural
tags form the basis of semantic tagging.
Semantic
tagging enables extremely accurate indexing and search
Consider
searching for the term Semantic Tagging in the subject
of a concept, versus a simple keyword search on the phrase
Semantic Tagging across a large dataset.
Semantic
Tagging rules! |
|
Submit sample documents for conversion. Try it FREE!
More Info
|