What does acid-free paper for the digital age look like?
Most organisations have archives of information which are
vital to their operation. An increasing proportion of this
is stored in some digital form or other. What, though, is
the right form?
The advent of developments like the World-Wide Web offers at
least the possibility for "the paperless office" to move
from that sick joke which makes large numbers of trees fall
over, toward reality. Documents can, in principle, be called
up immediately by any worker anywhere in the world at any
time -- or sold directly to the public over networks.
Internet technology changes at a frenzied pace -- remember,
the World-Wide Web only broke out of academia less than two
years ago, and now Guardian OnLine is spotty with http://s.
The Web is already having its first, fairly genteel, row
over who sets standards.
How can you avoid your valuable information -- in some
cases, effectively the entire value of a company -- being
sucked into a gulf of incompatibility?
As Kent Summers, Marketing Manager of Electronic Book
Technologies, puts it: "There are classes of document.
First, there's the ad-hoc, which is entirely content-centred
-- like a plain text email message. Third is material where
the presentation is everything -- like your classic content-
free Public Relations output. In between are mission-
critical documents -- where form and content are both
important."
It's these last that matter. For them we need the electronic
equivalent of acid-free paper: permanent, portable and
presentable.
Plain 8-bit text won't hack it -- especially if your
organisation is international and not composed entirely of
people who enjoy battling with Microsoft's "code page"
kludge.
First, rule out closed proprietary technologies. A tool like
Lotus Notes may be very fine for the company that uses it --
but information in Notes databases can only be shared with
others who have invested in the right version of Notes.
Quark Express may be a very fine desk-top publishing program
for everyone from the Retired Gentledonkeys' Refuge
Association to the Daily Mail -- but have you ever tried
extracting nice plain text to export to another application?
Look at the guts of a Quark file with a view to hacking the
content out, and weep!
At a stroke, this reduces the choices to two families of
formats.
The most widely-used family, still, is probably PostScript,
Adobe's page description language. Adobe's Acrobat multi-
platform presentation format is its close cousin. Though key
parts of the software "engines" which process these are
proprietary, Postscript itself is a language, and anyone
prepared to learn it can write it -- or write other programs
to process it. Adobe freely distributes Acrobat viewers for
Mac, Windows and Sun operating systems.
So Acrobat looks like a good bet: you can make a document as
pretty as you please, and have it read on any common
personal computer. Readers can even integrate an Acrobat
viewer into their World-Wide Web browser. And unlike, say,
Quark, the actual text is there in the file, in order from
start to end. It is rather wrapped up in a baffling Reverse
Polish Notation language, and if you open a Postscript
document as a text file you may have to go down five pages
to find the first actual word. But if every Acrobat viewer
disappeared from the face of the infosphere, a competent
Forth programmer could still extract the content in a day or
two.
The volume of documents is such that to find anything useful
we need to employ machines to index and catalogue them, and
at a pinch do a brute-force search. This is very difficult
with Postscript. If someone decides to make the space
between "Manchester" and "Guardian" smaller than that
between other words, a major programming effort is needed to
detect from the Postscript or Acrobat file that the two
words are actually adjacent. To deduce the semantics of a
Postscript file, you have to guess the designer's
intentions.
Enter the Standard and Generalised Mark-Up Language family -
- SGML. Originally developed for formatting legal texts,
this was adopted by the US Department of Defense as the
standard for the enormous quantities of documentation that
its suppliers must produce. It has, perhaps, suffered from
this -- remember Ada, the computer language that's DoD as a
doornail?
But SGML is in tune with the philosophy of the Unix
operating system, which sees the entire universe as a stream
of bytes to be piped through processes. An SGML document is
just one byte after another, with all the mark-up enclosed
in <pointy brackets> so that it's easily ignored by a search
program which wants to look only at content. Some of those
bytes may be pointers to files containing graphics; but,
until computers can search through graphics files to find
those depicting Queen Victoria, the real information is in
the text.
The mark-up in a well-structured SGML document tags the
semantics -- the meaning -- of the content: this here is a
<H2>chapter header</H2>, that's a <H3>subsidiary header</H3>, and that's
<EM>emphasised</EM>. The type style is decided only when it's output
or displayed. You might want to output the headers in the
typeface Gill Sans for a British audience and something
horrid like Avant Garde for the Americans, or to use
s p a c i n g as emphasis for Germans. No
problem. A few
key-strokes to change the "style-sheet" accompanying the
document, and it's done.
Because SGML is a stream of characters, it will transfer
seamlessly to Unicode, the standard for truly international
text whose 16-bit characters cover every writing system in
use, including Ancient Greek, the Cherokee alphabet, and all
three Japanese systems.
And -- don't say it too loud -- the language of the World-
Wide Web, Hyper-Text Markup Language, is a "dumbed-down" but
perfectly valid version of SGML. Robert Cailliau, one of the
two founders of the Web, relates the way HTML got accepted,
despite being an Official Standard, to "the Parmentier
trick: import the potato plant, and grow it behind fences,
behaving as if it is a great secret, so people will go in
and steal it. If you say you're going to give them South
American plant roots to eat, it won't work."
HTML, even with the extensions which NetScape introduced
without consulting the standards committee, presently cramps
designers' style. If your documents are stored in SGML,
though, it's easy to output new HTML versions which take
advantage of any enhancements which come along.
There is, of course, a catch. SGML and HTML are perceived as
"tecchy". Writing the mark-up as plain text comes
naturally to people who dream in Algol, but not to many
others.
A small flood of what-you-see-is-what-you-get HTML editors,
from Quarterdeck's WebAuthor to Microsoft's Internet
Assistant, is reaching the market just now. Visual SGML
editors exist, but are not widely known. Bob Rosenthal, who
advised Microsoft on how to deal with Arabic and Hebrew,
this week launched Accent, a program which will process
words in any language you please, using an SGML-Unicode
combination. It marks up what language each passage is in,
so that a search program can recognise that the German
"Boot" (boat) is not the same as the any of the English
"Boot" words.
And what do you do if you have a massive archive of
Postscript-family documents? Hire someone with programming
and librarianship skills as a "reverse designer", and be
very nice to them, because your future will depend on them.
Then you can start thinking or worrying about Virtual
Reality Mark-up Language (yes, it's real!), which is a
different picture altogether.