How to render documents

by Jon Roland

This is a discussion of how to convert printed documents into online documents, for those who may want to do it themselves.

The first step is to obtain a clean original. Old books may be found at antiquarian booksellers or booksellers specializing in the kinds of documents you want. There are also facsimile editions that work well. Old books are sometimes blotched or speckled with age, and this can make scanning difficult. Facsimile editions will have whiter paper that will provide more contrast. However, facsimile editions, being copies, will not, in general, have the resolution of the original, and this may cause recognition errors.

The second step is to make sure you have written permission to reprint. Documents that were published before 1923 are in the public domain and no permission is needed. If published before 1965 they may be in the public domain if the copyright was not renewed, which is likely for scholarly works. Remember that translations are new works, so it is the date of publication of the translation that matters for copyright, or the copyright date if earlier than the publication date, and it is the translator or his publisher from whom one has to obtain permission. Most copyrights of publishers no longer in business have been acquired by a publisher still in business, but sometimes it can take a bit of research to track down the current owner, if any, to obtain permission. For more on this see http://onlinebooks.library.upenn.edu/faq.html.

The third step is to have a good computer, scanner, and software. The computer should have plenty of hard disk space, as each page, scanned at 300 dpi, will take up to about 100 K for the image file, and perhaps 6 K for the edited file. Flatbed scanners work best. Digital cameras can be used, but they will need to have at least 10 megapixels before they are fully satisfactory for that purpose. Scanners should have a resolution of at least 400x400, and higher resolutions are desirable. Good scanners are now available for a low cost. We prefer smaller ones, such as the Epson Perfection 2450 or 3200, or the Casio YC-430, that can be carried into a library, together with a notebook computer, for documents that can only be found in libraries. Larger scanners can be faster, and have scanning plates large enough to scan two facing pages of a book in a single scan, and these may be preferable if portable use is not contemplated.

We selected the Epson Perfection 2450 for its speed, quality, and because it has a transparency reading capability built into its cover. However, its maximum interpolated resolution of 12800 dpi is not sufficient to read and recognize microfilm or microfiche. A resolution of at least 38400 will be needed for that. The best scanning solution for microfilm and microfiche is the ImageMouse microform scanner from http://www.eyecom.com/ImageMousePlus.htm, which is portable enough to transport as carryon luggage if the flight attendant is not too diligent.

The main skill required is proofreading ability. The user needs to be able to spot errors visually, which means knowing how to rapidly recognize errors in spelling and punctuation. The spelling checker won't catch scanning errors that result in a word in the dictionary. If one is good at proofreading and familiar with the material, he can avoid having to compare the recognized text with the original, word by word, and spot the punctuation errors, which are the most common kind and the easiest to overlook. A good supplemental correction tool is Afterscan from http://www.afterscan.com.

You have to be careful in scanning some old books. Pressing them flat against the scanning plate can split the pages from the binding. You want to avoid doing that more than once, which means getting a good scan the first time.

The best OCR software we have found is FineReader from Bitsoft. The latest version we can recommend is 6.0, because 7.x doesn't handle small caps properly, converting them to lower case rather than upper case when the "small caps" attribute is removed, which is not the desired result. See http://www.abbyy.com/ It is also available from their U.S. branch at http://www.abbyyusa.com We have also used Xerox's TextBridge Pro ( http://www.xerox.com/scansoft/textbridge/ ) and Caere's OmniPage Pro ( http://www.caere.com/ ), which are good packages, but we prefer FineReader for several reasons: it is more accurate, especially on rough originals, it supports multiple languages and character sets, it allows the user to define new ones, and it allows one to train the software to recognize antique characters and ligatures that other programs can't handle. It can also be used to rapidly scan multiple pages of a book-length document, then do the character recognitions later without manual intervention, such as overnight. It has a good spelling checker which flags uncertain characters. We use it to save into HTML and further edit and convert from that format.

Under Windows use a text editor UltraEdit-32, which is available at a moderate cost ( http://www.idmcomp.com/ ), to do some preliminary editing of the initial HTML output from Finereader. Under Linux we use the Kate editor. We first delete the first line which specified the type of HTML used, because we are going to change it when we use the HTML editor. Next we delete spaces after the <P> and before the </P> tags, and move spaces and some kinds of punctuation outside certain kinds of tag pairs like <I></I>, etc.

The next step is to use a good HTML editor to further edit the pages and reorganize the text into HTML files that are organized by section of the total document. We prefer the Linux editor Quanta Plus, part of the kdewebdev package, although we sometimes use it in conjunction with other editors like OpenOffice http://www.openoffice.org that can save to HTML. At this writing we are also converting all HTML files to XHTML 1.1.

For viewing HTML files we recommend a better browser than MS Internet Explorer, Firefox from http://www.mozilla.org. For viewing SGML and XML documents we recommend Doczilla http://www.doczilla.com/ from Citec http://www.citec.fi/.

In general, a long document like a book should be divided into several files, dictated to some extent by the layout of the original work into books, chapters, sections, or whatever. However, one must balance that natural division by a standard of trying to hold each file below about 30-50K in size, to reduce downloading time for dialup users, and this might mean splitting longer divisions of the book into sections. One should also adopt a file naming convention that enables one to readily recognize the section labels, and that keeps the files in the same listing order as they will appear when organized under a top-level frontispiece and table of contents file. Numbers which are a part of the filename should have leading zeroes to allow a uniform name format that extends up to the maximum number to be used.

Footnotes should be converted to endnotes at the end of each HTML section file, and renumbered consecutively. The endnote superscripts should be numbers with no spaces following their text, so the exponent doesn't get separated from its text by a line break, and they might be made bold so that they stand out better. Since the exponents are, in general, supplied by the editor and not the original author, we put them in square brackets [], which also works better for the text versions. We recommend using numbers with leading zeroes for the link names, to maintain a uniformity of link naming. Three digits are usually sufficient. Each endnote should be a separate paragraph rather than making them an ordered list. They should be separated from the text body by a horizontal rule, and the endnotes should be reduced in size by one font size increment.

Following the endnotes, we recommend links to the Previous and Next pages in the document, to the Contents, to the Text Version of the file, and perhaps to the top or home page of the site and perhaps an intermediate level page.

We also recommend placing link tags before each section or paragraph in the page, so that references to the document from the index or other documents can go directly to the right section or paragraph. The index will have to be edited to replace page references with section and paragraph references.

Unless you plan to upload a partial rendering, you may want to do the Contents last. When done, test all the links from the Contents page, and also all the Next and Previous links at the end of each page.

To create the text version, which is needed by many people who will want to work with the documents, we recommend you view the HTML file with a browser, the window sized so that the column width is about that of the intended column width of the text file, and then copy and paste the entire file to a text editor. We then use the Linux editor Kate or the Windows editorUltraEdit-32 to format the paragraphs of the file we pasted in from the browser to a column width of 72. One should also look for additional scanning errors, which may have been missed in doing the edits using the OCR and HTML editors. What usually shows up at this stage are errors in punctuation. Any errors found are then corrected in both the text and HTML files.

As a final step, you will probably want to concatenate the separate text files for each section into a single, long text file for the entire document, which is preferred by some people, and perhaps also convert that long text file into one or more ZIP or self-extracting executable files, to facilitate downloading. You may want to split these compressed files into files not exceeding 1.4 MB in size to allow them to be saved and distributed on floppy disks.

You may want to further render these files into other formats, such as Word, WordPerfect, RTF, PDF, Docbook,or TeX. We export to PDF to provide files that will print out properly for those who want print editions. Remember to observe the file extension name conventions so that browsers can recognize what kinds of files they are and how to view or download them.

Don't forget to include any needed copyright and permission notices, credits, and disclaimers.

Productivity tips

One way to improve productivity, for documents that have uniform margins, is to make use of the scanner's control utility to mask out the extraneous text, such as page number and headers. One can then just scan each page as rapidly as one's manual dexterity permits. On the scanner we use, a dextrous user can scan a page, or two pages of an open book, in about 35 seconds, hitting the scan button for the next page before the scan head returns to its parked position.

To provide an example of how long it takes to render a document, we timed the rendering of 100 pages of a 300 page book with 2-column pages of 56 lines each, 44 characters per line of fairly sharp print. The computer had a Pentium IV processor running Win XP at 2.0 GHz. Scanner resolution was set at 300 dpi. Scanning took 60 minutes, using the mask described above. Recognition took 10 minutes. There were an average of about 20 errors per page, mainly footnote superscripts, which took 200 minutes to correct and save the document as an HTML file. Editing with the HTML editor took 300 minutes, and mostly consisted of rearranging the text into separate files for each section, changing the footnotes to endnotes, and creating the links from the superscripts to the endnotes. Creating the text files and making a few additional corrections took 200 minutes. In other words, it took about 8 minutes per page to do everything, which would be about 40 man-hours to render a 300-page book. Individual productivity will vary.

If one has a network of computers and more than one person available to do the work, FineReader can be used with one person on one computer doing the scanning, saving the image files to a shared drive, a second computer doing the recognizing as each scanned image file becomes available, and up to about six additional computers and personnel doing the editing at a rate that could keep up with the scanning. Of course a lot depends on the quality of the original, and rougher originals will result in more errors and require more time to edit them.

Productivity will be less in the beginning, due to factors such as adding new words to the dictionary and new symbols to the training database. One may also need some time to become familiar with the material and learn to recognize the most common kinds of errors that will occur in scanning that material.

Conversion from one format to another

Text to HTML

The best tool we have found for doing this is AscToHtm from http://www.jafsoft.com. It is not perfect, but does the bulk of the work, leaving the editor with only minor cleanup.

PDF to Text

Adobe Acrobat (http://www.adobe.com) supports converting from various file formats into PDF, but does not support conversion from PDF or Postscript to other formats. The best Windows tool we have found for PDF character file conversiion is an Acrobat Reader plugin called PDFtoAll from http://www.pdftoall.com/, which can save to HTML.

Finereader can also load PDF files like any other image files, if they are not password-secured, but then one must go through the recognition process. Sometimes this can be the best and fastest solution for PDF image files, that is, those that are just images in a PDF wrapper, rather than PDF-encoded character files, but FineReader produces huge TIFF files, so we use another tool to breakout out the images into separate TIFF files for loading into FineReader.

Multipage PDF to separate TIFF

In Windows, which we still use because Finereader doesn't have a Linux version yet, we use a tool called PDF2TIF from http://www.verypdf.com/ which breaks out the images in a multipage PDF file into seperate image files and converts them to the standard TIFF format, which is ideal for opening in FineReader. Of course, this will also work for a single-page PDF image file.

MS Word to other formats

Saving an MS Word document as an HTML file produces terrible HTML. It works better to use WordPerfect from Corel http://www.corel.com to aid in the conversion. First save the MS Word file in RTF or Word 6.0/7.0 format. Then load it into WordPerfect, do any further needed editing, and save it as an HTML document. Edit the HTML document to get rid of the extraneous things like duplicated endnote numbers, breaks, font size settings, etc. Before saving as a text file, set the left and right margins to, say, 2.0" so that the column width will be about 72, and save as "DOS text". Then clean up the text version with the text editor. Saving to RTF or Postscript, however, is best done directly from the MS Word file using MS Word.

There is also a version of WordPerfect 8 for Linux. See http://constitution.org/comp/linux.htm.

Additional tools

It is sometimes necessary to clean up bad HTML, such as that produced by Microsoft Word and Frontpage. There is a good freeware tool for this called Tidy.

There are also a number of tools at http://constitution.org/intrtool.htm which are highly useful for managing web sites where digital documents might be published.

Editorial recommendations

One of the first decisions you will want to make is whether or to what extent to follow the formats of the original printed work in designing the HTML version. Our recommendation is to do so when it is relevant to scholarship, but not necessarily just to conform to decisions made by the typesetter rather than by the original author. We recommend correcting obvious typographical errors and misspellings. If you are concerned about scholarly precision, you might want to include a table of changes to indicate where you have changed things.

The parts of the original work which you may want to modify most are the Table of Contents and the Index. Obviously, the Table of Contents may or may not include page numbers, but if they do, or you wish to preserve page numbers for the convenience of cites to pages, you may want to insert page numbers in the body of the text in braces {}. You need to decide whether you want to show links on the Contents page to just the HTML pages or also to the text or other versions, and if the latter, whether you want to put the links on things like colored buttons rather than on the division numbers or on the title itself. You can set border="0" for the links if you don't want the buttons surrounded by a visible border that changes color after the page linked to has been accessed and is in the cache. You may also want to expand the Contents to include sections, if they were omitted from the original. If your division of the document into files does not correspond exactly to the divisions in the Contents, you will want to modify the Contents to show links to each file, and provide some kind of title for each such file. You may want to enclose such added material in square brackets [] to indicate they were added by the editor. You may also want to eliminate the default underscoring of text links by inserting the following into your HTML document:

and perhaps setting the default vlink attribute for visited links to some other value. We like #CC0000.

We recommend that endnote size be reduced by 1 unit, and that relative and not absolute font size be used.

Some people like to put endnotes in a separate file and display them in a separate frame. We recommend against this. It is easier for people to download the endnotes in the same file as the text, and keep them together. To make sure the endnote is positioned at the top of the window, append at least 24 break <BR> tags at the end, because most browsers will not scroll higher than the bottom of the file.

If is it likely that the document will be referenced by other documents down to the paragraph level, we recommend that each paragraph or other such element be preceded with a bookmark tag, numbered in sequence, such as paragraph or even sentence or clause number, and refer to the element with a structural reference of the form bb:cc:ss:pp:ss, or some substring thereof, where bb is the book or volume number, cc is the chapter number, ss the section number, pp the paragraph number, and ss the sentence or clause number. In an index this reference string can be shown for the benefit of those who might want to find the same element in an original printed document.

If the primary document contains references to another document on the same site in an endnote, we recommend keeping the original page numbers used by the author in the printed document, but appending a structural reference as above in square brackets with a link to the element in the referenced document.

For scholarly documents, we recommend keeping the HTML simple, avoiding cascading style sheets, and omitting graphics or other material that may identify the publishing organization but that distract the reader and slow downloads.

Useful links

Greek Ligatures — Tables of ancient and medieval character combinations and abbreviations, useful for rendering manuscripts into Greek fonts.
Distributed Proofreaders — Get volunteer help for the final stages of rendering public domain works.
639-1 Language Codes — Use these to encode foreign languages in HTML pages.
639-2 Language Codes — Use these to encode foreign languages in HTML pages.

Home » Liberty Library» Library Guides
Original URL: http://constitution.org/lg/render.htm \| Text Version Maintained: Jon Roland of the Constitution Society Original date: 1997/06/20 —