What does acid-free paper for the digital age look like?

Most organisations have archives of information which are vital to their operation. An increasing proportion of this is stored in some digital form or other. What, though, is the right form?

The advent of developments like the World-Wide Web offers at least the possibility for "the paperless office" to move from that sick joke which makes large numbers of trees fall over, toward reality. Documents can, in principle, be called up immediately by any worker anywhere in the world at any time -- or sold directly to the public over networks.

Internet technology changes at a frenzied pace -- remember, the World-Wide Web only broke out of academia less than two years ago, and now Guardian OnLine is spotty with http://s. The Web is already having its first, fairly genteel, row over who sets standards.

How can you avoid your valuable information -- in some cases, effectively the entire value of a company -- being sucked into a gulf of incompatibility?

As Kent Summers, Marketing Manager of Electronic Book Technologies, puts it: "There are classes of document. First, there's the ad-hoc, which is entirely content-centred -- like a plain text email message. Third is material where the presentation is everything -- like your classic content- free Public Relations output. In between are mission- critical documents -- where form and content are both important."

It's these last that matter. For them we need the electronic equivalent of acid-free paper: permanent, portable and presentable.

Plain 8-bit text won't hack it -- especially if your organisation is international and not composed entirely of people who enjoy battling with Microsoft's "code page" kludge.

First, rule out closed proprietary technologies. A tool like Lotus Notes may be very fine for the company that uses it -- but information in Notes databases can only be shared with others who have invested in the right version of Notes. Quark Express may be a very fine desk-top publishing program for everyone from the Retired Gentledonkeys' Refuge Association to the Daily Mail -- but have you ever tried extracting nice plain text to export to another application? Look at the guts of a Quark file with a view to hacking the content out, and weep!

At a stroke, this reduces the choices to two families of formats.

The most widely-used family, still, is probably PostScript, Adobe's page description language. Adobe's Acrobat multi- platform presentation format is its close cousin. Though key parts of the software "engines" which process these are proprietary, Postscript itself is a language, and anyone prepared to learn it can write it -- or write other programs to process it. Adobe freely distributes Acrobat viewers for Mac, Windows and Sun operating systems.

So Acrobat looks like a good bet: you can make a document as pretty as you please, and have it read on any common personal computer. Readers can even integrate an Acrobat viewer into their World-Wide Web browser. And unlike, say, Quark, the actual text is there in the file, in order from start to end. It is rather wrapped up in a baffling Reverse Polish Notation language, and if you open a Postscript document as a text file you may have to go down five pages to find the first actual word. But if every Acrobat viewer disappeared from the face of the infosphere, a competent Forth programmer could still extract the content in a day or two.

The volume of documents is such that to find anything useful we need to employ machines to index and catalogue them, and at a pinch do a brute-force search. This is very difficult with Postscript. If someone decides to make the space between "Manchester" and "Guardian" smaller than that between other words, a major programming effort is needed to detect from the Postscript or Acrobat file that the two words are actually adjacent. To deduce the semantics of a Postscript file, you have to guess the designer's intentions.

Enter the Standard and Generalised Mark-Up Language family - - SGML. Originally developed for formatting legal texts, this was adopted by the US Department of Defense as the standard for the enormous quantities of documentation that its suppliers must produce. It has, perhaps, suffered from this -- remember Ada, the computer language that's DoD as a doornail?

But SGML is in tune with the philosophy of the Unix operating system, which sees the entire universe as a stream of bytes to be piped through processes. An SGML document is just one byte after another, with all the mark-up enclosed in <pointy brackets> so that it's easily ignored by a search program which wants to look only at content. Some of those bytes may be pointers to files containing graphics; but, until computers can search through graphics files to find those depicting Queen Victoria, the real information is in the text.

The mark-up in a well-structured SGML document tags the semantics -- the meaning -- of the content: this here is a <H2>chapter header</H2>, that's a <H3>subsidiary header</H3>, and that's <EM>emphasised</EM>. The type style is decided only when it's output or displayed. You might want to output the headers in the typeface Gill Sans for a British audience and something horrid like Avant Garde for the Americans, or to use s p a c i n g as emphasis for Germans. No problem. A few key-strokes to change the "style-sheet" accompanying the document, and it's done.

Because SGML is a stream of characters, it will transfer seamlessly to Unicode, the standard for truly international text whose 16-bit characters cover every writing system in use, including Ancient Greek, the Cherokee alphabet, and all three Japanese systems.

And -- don't say it too loud -- the language of the World- Wide Web, Hyper-Text Markup Language, is a "dumbed-down" but perfectly valid version of SGML. Robert Cailliau, one of the two founders of the Web, relates the way HTML got accepted, despite being an Official Standard, to "the Parmentier trick: import the potato plant, and grow it behind fences, behaving as if it is a great secret, so people will go in and steal it. If you say you're going to give them South American plant roots to eat, it won't work."

HTML, even with the extensions which NetScape introduced without consulting the standards committee, presently cramps designers' style. If your documents are stored in SGML, though, it's easy to output new HTML versions which take advantage of any enhancements which come along.

There is, of course, a catch. SGML and HTML are perceived as "tecchy". Writing the mark-up as plain text comes naturally to people who dream in Algol, but not to many others.

A small flood of what-you-see-is-what-you-get HTML editors, from Quarterdeck's WebAuthor to Microsoft's Internet Assistant, is reaching the market just now. Visual SGML editors exist, but are not widely known. Bob Rosenthal, who advised Microsoft on how to deal with Arabic and Hebrew, this week launched Accent, a program which will process words in any language you please, using an SGML-Unicode combination. It marks up what language each passage is in, so that a search program can recognise that the German "Boot" (boat) is not the same as the any of the English "Boot" words.

And what do you do if you have a massive archive of Postscript-family documents? Hire someone with programming and librarianship skills as a "reverse designer", and be very nice to them, because your future will depend on them.

Then you can start thinking or worrying about Virtual Reality Mark-up Language (yes, it's real!), which is a different picture altogether.


[logo]
home

Written:
11 May 1995
An edited and doubtless thus improved version of this article appeared in the Guardian OnLine section.
This version is © copyright 1996 Mike Holderness; moral rights are asserted.

It's scary to think that VRML was an exotic shadow on a newspaper's horizon in May 1995

[logo]
articles index