Forty-Five Years Of Digitizing Ebooks - Project Gutenberg's Practices

By Author	[ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \| Other Symbols ]
By Title	[ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \| Other Symbols ]
By Language

Download this book: [ ASCII ]

Look for this book on Amazon

We have new books nearly every day.
If you would like a news letter once a week or once a month fill out this form and we will give you a summary of the books for that week or month by email.

Title: Forty-Five Years Of Digitizing Ebooks - Project Gutenberg's Practices
Author: Newby, Gregory B.
Language: English
As this book started as an ASCII text book there are no pictures available.

*** Start of this LibraryBlog Digital Book "Forty-Five Years Of Digitizing Ebooks - Project Gutenberg's Practices" ***

FORTY-FIVE YEARS OF DIGITIZING EBOOKS

By Gregory B. Newby

ABSTRACT

(eBooks). This document offers elements of the story of Project
surrounding procedures for making them as widely available as possible.
and accessible.

HISTORICAL ROOTS

S. Hart had been granted access to a powerful mainframe computer at
the University of Illinois at Urbana-Champaign, and realized that his
greatest impact would be by digitizing and distributing free literature

Michael took a printed copy of the United States Declaration of
where he sat at the teletype terminal and typed this first eBook. He
distributed it via email to the people he knew about via the Internet's
predecessor, Arpanet, which was available at UIUC. At that moment, the
first eBook had been freely distributed to the online community of the
day.

Digitization and production techniques, at the time of this first eBook,
were ad hoc and informal. A single eBook producer would edit a single
file, from a single source. The first eBook's printed source was a
single sheet of paper, without hyphenation, a book cover, images, or
other characteristics of book-length sources. In 1971, capitalization
was not an issue, as only upper case letters were available in the
character set used by the system.

Figure 1: Top view of a Model 33 Teletype, salvaged from the computer
laboratory where Michael Hart typed the first eBook. The paper roll was
where output would be printed.

[Illustration: 0002]

During the next twenty years, from approximately 1971-1991, techniques
of digitization would be dramatically improved, and regularized. Ongoing
developments since then have tracked the available technologies for
eBook creation and use, as well as preferences and interests of the many
volunteers who would produce those eBooks.

refined and clearly articulated, have remained flexible (see the
).

EMPHASIS ON THE PUBLIC DOMAIN

free and unencumbered redistribution of literary works. Access to
literary works enables literacy, which in turn opens the door to
education and, it is hoped, opportunity. Interest in literary works
that could be freely redistributed led to an emphasis on books and other
items that are in the public domain.

The public domain is, today, understood to be those items that are not
operates, is defined as a temporary monopoly by authors (or their
agents), in order to benefit from commercial potential and thereby
fostering continued creation:

"To promote the Progress of Science and useful Arts, by securing for
limited Times to Authors and Inventors the exclusive Right to their
respective Writings and Discoveries" (United States Constitution,

ITEMS ARE IN THE PUBLIC DOMAIN FOR ONE OF THREE REASONS

1. They are ineligible for copyright. In the US, this includes works
created by the US Government;

2. Their copyright term has expired; or

3. They are granted to the public domain by the creator or their agent
(i.e., the rights holder).

Because of its emphasis on literary works, Project Gutenberg has mostly
focused on items for which the copyright term has expired. Until 1998,
this included items published 75 years earlier. For example, items from
1920 entered the public domain when their copyrights expired in 1995.
The US Copyright Term Extension Act of 1998 changed the term to 95 years
for most literary works, so new items (from 1923 onwards) will not enter
the public domain before 2019.

[Illustration: 0003]

Figure 2: Michael Hart s sunroom workspace in his Urbana home

There are over one million published works from 1923 and earlier, and
these are the main items that Project Gutenberg continues to digitize
and distribute. In addition, there were approximately one million works
published in the United States from 1923-1964 but not renewed. Those
items entered the public domain when their first copyright term ended,
28 years after publication. The copyright procedures utilized are online

COLLECTION DEVELOPMENT POLICY AND EARLY MARKUP

The eBook collection, and all other aspects of Project Gutenberg, relies
on volunteers to grow. Therefore, selection of items is done mainly
by volunteers. Project Gutenberg seeks to limit duplication in the
collection, and instead prefers to add items not already in the
collection. Improvements to existing items is ongoing, mainly when
errata reports are submitted by readers.

It took over two decades to release the first 100 eBooks, with #100
being published in 1994. Most of those first eBooks were collected
through personal interaction with Hart. He would guide or participate in
the digitization process, often developing procedures to deal with new
characteristics. Footnotes and endnotes, italics and underscores, bold
text, and different fonts all presented challenges for representation as
plain text. Primitive markup techniques were developed, such as using an
underscore character to surround underscored text, _like this_.

It was not until the mid-1990s that hypertext markup language (HTML) was
first used, and at the time it was decided that Project Gutenberg eBooks
should be wholly self-contained. A zip file would include all of the
needed images, and external links were discouraged.

Throughout the entire history of Project Gutenberg, volunteers have been
encouraged to work on items they are interested in, and to make their
own decisions about how to best represent the content.

PROOFREADING

The first eBooks were created by typing the text of printed books into
word processor or text editing programs, and then submitting the files
for final formatting and redistribution. Typists would perform basic
formatting, including:

Omitting page headers/footers and pagination;

Spelling correction (spelling modernization was optional,
and some transcribers preferred to leave the original
spelling);

De-hyphenation;

Relocating any footnotes to endnotes;

Adding basic markup or emphasis, as described above;

Standard formatting for headings and chapters. Chapter
titles would have two blank lines before, and one blank line
after;

Line and paragraph formatting, including line endings with
carriage returns + line feed at approximately 72 characters,
no paragraph indentation (unless it is a block quote or
similar), and a blank line between paragraphs.

Plain text eBooks, which were the only major format until HTML became
more frequent by the mid- to late-1990s, were designed to be viewed on
computer monitors with fixed-width fonts with 80-character lines. Plain
text is still provided for nearly all Project Gutenberg eBooks today,
although HTML and other formats are also provided.

Once an item is typed into an electronic file, and basic formatting
is completed, one or more rounds of proofreading will help to improve
quality. This includes typos, poor formatting, or inconsistency of
presentation. In practice, all eBooks published by Project Gutenberg
still have errors, even if they are far better than 99% accurate. For
example, an eBook that is 99.999% accurate (i.e., "five nines") will
still have one wrong character in 10,000. That amounts to approximately
30 errors in a typical 50,000 word novel. Proofreading is, by
definition, asymptotic. Subsequent rounds of proofreading improve an
eBook, but that eBook is still likely to contain some errors.

Errors in eBooks often reflect errors in their printed sources, and
Project Gutenberg encourages fixing those errors.

EVOLUTION IN PROOFREADING: DISTRIBUTED PROOFREADERS

From 2002-2004 an important innovation was developed, in support of
the creation of new Project Gutenberg eBooks. This was Distributed
Proofreaders, an early example of what is now known as crowdsourcing.
Through Distributed Proofreaders, volunteers engage in a portion of
the eBook creation process - whether it is copyright clearances,
proofreading (a page at a time!), or the formatting, checking, and
finalization before uploading. Those portions, when coordinated
together, lead to the creation of new eBooks from printed sources.

Distributed Proofreaders has become the single largest source for new
eBooks to the collection, accounting for approximately half of all
titles. Distributed Proofreaders has also innovated substantially in
the use of HTML+CSS (cascading style sheets) for very attractive
presentation of eBooks in Web browsers.

SCANNING

By the early 1990s, scanning and optical character recognition (OCR)
started to become widely available. Hart received a full scanning
station via a grant from a computer manufacturer, which was used to
produce several of the first 100 eBooks. The scanner was a flatbed
model, which required the user to hold the book open, scan a page (or
pair of pages) for ingest to the OCR software, then flip to the next
page.

The OCR software would then automatically recognize the characters from
the scan, and create an editable view of the text. Proofreading and
formatting would then occur in the same way as for a typed text.

A few years later, Project Gutenberg worked with Distributed
Proofreaders to acquire sheet-fed scanners. These scanners, which are
still in operation, are faster. They also tend to produce an image
that is properly aligned, versus the skewing that sometimes occurs
with flatbed scanners. An important difference is the printed books
are damaged: prior to scanning, the spines of the books are cut off, in
order for the individual pages to be ingested by the scanner.

[Illustration: 0006]

Figure 3: Image from the Doré illustrations of Dante's Inferno

It has been Project Gutenberg's intention to make all the original
images from the scanners available, alongside the finished eBook. This
is to have a more complete record of the eBook's source(s), and also to
facilitate improvements by finding typos. Most eBook producers to date
have chosen to not provide the scans, however.

Scanners are used for images within printed books, which are typically
included as JPEG, GIF or PNG items within HTML and other formats. Inline
images may be at a lower resolution, and then clickable to obtain higher
resolution images. Color scanners are used, whenever possible, for color
images.

Project Gutenberg has no prohibition against using items scanned by
other parties. Several excellent sources of scans are freely available,
including Books, Gallica, and The Internet Archive. Scans, and
raw OCR output (if available), may then be transformed into Project
Gutenberg eBooks by volunteers.

From approximately 1994-2004, procedures for digitization became
more clearly articulated. This included the notion that a copyright
"clearance" was the necessary first step for starting any new eBook
for contribution to Project Gutenberg. The "copyright how-to" mentioned
above was developed and refined, with guidance from a number of lawyers
with expertise in US copyright law.

Project Gutenberg has always operated within the copyright laws of the
making it clear that readers in other countries must follow the
laws that apply to them. Project Gutenberg affiliates, which operate
completely independently, exist to emphasize the literary works and
languages of different countries, and they follow the copyright laws of
the country or region in which they operate.

Generally, copyright clearance is simple. Items published prior to 1923,
anywhere in the world, are in the public domain in the US. Prior to
1993, all copyright clearance actions required mailing a photocopy of
the title page and verso (obverse) page of a candidate book to Michael
Hart or Greg Newby, but then an online system was developed that
accepted scans of those pages. A database maintains records of cleared
items, and who submitted them. A few other copyright rules are sometimes
applied, for items published after 1923.

Sometimes, copyrighted items are submitted by authors. For many years,
Project Gutenberg was one of few online repositories of user-contributed
literary works, and therefore accepted items from contemporary authors.
The two requirements for such content were:

1. A perpetual, worldwide, non-exclusive, irrevocable license be granted
to Project Gutenberg, for unlimited redistribution of the item; and

2. The item must be made available as plain text, (valid) HTML, or both.

However, user-contributed content is generally no longer accepted for
portal, operated by an affiliate, The World EBook Library, is available
use any license they wish (such as a Creative Commons license), and can
provide items in PDF or other formats. This simplifies the process for
the authors, and removes the need for Project Gutenberg's volunteers to
be involved with user-contributed content.

MULTIPLE SOURCES

Project Gutenberg encourages the use of multiple printed sources to
create an eBook. For many historical works, including the US Declaration
of Independence (the first Project Gutenberg eBook), there are
variations in the printed sources. Another early example is the works of
William Shakespeare. Project Gutenberg has several different versions of
Shakespeare, including one based on the first edition folios. It has
been typical, throughout the modern history of publishing, for different
versions of a book to have variations.

In practice, the majority of Project Gutenberg eBooks rely on a single
printed source. However, even those items might benefit from other
sources - such as when some pages are missing, or illustrations come
from a different version, or when typos/errata reports come from other
sources.

It is a principal of Project Gutenberg that the eBooks in the collection
are denoted as Project Gutenberg eBooks. Even if the publisher imprint
and frontispiece from a printed work is included, there is no assurance
that the content exactly matches that printed work. And, in fact,
it will not match: minimally, the header/footer will be removed, and
paragraphs will flow together such that they span the pages of the
printed source. Many other adjustments are typically made, as mentioned
above.

For this reason, Project Gutenberg's online catalog metadata does not
include a citation to the source(s) used to create an eBook. Instead,
Project Gutenberg should be cited as the publisher. For example, a
bibliographic citation might have a form such as this:

Carroll, Lewis. "Alice's Adventures in Wonderland." Urbana, Illinois:

OTHER CONTENT TYPES

Project Gutenberg is, arguably, the oldest continuously operating online
content project in the world. From 1971 until the mid-1990s, there were
relatively few online resources for literary content. For this reason,
and also due to a general willingness to experiment and reach out to
broader audiences, Project Gutenberg has a great variety in the content
types offered.

Among the first 100 items, there are mathematical constants and a
musical performance. Government publications, notably the 1990 US Census
and the CIA World Factbook from 1990 onwards, were also included. The
next few hundred items include movies, photographs of ancient cave
paintings, and the first non-English items (Virgil's Aeneid, Cicero's
Orations, and Caesar's Commentaries, all in Latin).

Hundreds of audio eBooks are in the collection. Many were automatically
generated via text-to-speech software. There are also a number
of readings/performances by human readers, including from Project
Gutenberg's partner, Librivox (www.librivox.org). Today, automated
text-to-speech is accessible by most people with a computer or
mobile phone, so there is less emphasis on that format. Human
readings/performances continue to be of interest, especially when the
performance, as well as the original Project Gutenberg source eBook, is
granted to the public domain.

LANGUAGES OTHER THAN ENGLISH

Non-English languages have some additional characteristics that were not
well-suited for the plain text ASCII of Project Gutenberg's early days.
By the early 1990s, it was necessary to display accented characters, to
accommodate languages such as French and Spanish. Later, languages such
as Chinese would require entirely separate character sets.

OCR software may be poorly suited for several non-English languages, or
may fail due to older styles of typesetting (the old German "Fraktur" is
notorious in this regard).

Also, it is necessary to have proofreaders who are fluent in the
language, to assure the eBook is enjoyable and reasonably free of
errors. Despite these challenges, nearly 20% of the collection is in
a language other than English, with 65 separate languages or dialects
other than English. This emphasis on language diversity continues today,
and is limited only by the willingness of volunteers to submit copyright
clearances and prepare items for distribution.

Table 1: Language counts as of August 1, 2016, for 52615 eBooks.

# of eBooks Language code Language or dialect
43095 en English
2711 fr French
1469 de German
1421 fi Finnish
739 nl Dutch
678 it Italian
540 pt Portuguese
504 es Spanish
427 zh Chinese
219 el Greek
128 sv Swedish
112 hu Hungarian
112 eo Esperanto
102 la Latin
66 da Danish
60 tl Tagalog
31 pl Polish
31 ca Catalan
22 ja Japanese
17 no Norwegian
11 cy Welsh
10 cs Czech
9 ru Russian
7 is Icelandic
7 fur Friulian
6 te Telugu
6 he Hebrew
6 enm Middle English
6 bg Bulgarian
4 sr Serbian
4 ang Old English
4 af Afrikaans
3 nai North American Indian
3 nah Nahuatl
3 ilo Iloko
3 ceb Cebuano
2 ro Romanian
2 nav Navajo
2 myn Mayan Languages
2 mi Maori
2 grc Greek, Ancient
2 gla Gaelic, Scottish
2 ga Irish
2 fy Frisian
2 arp Arapaho
1 yi Yiddish
1 sl Slovenian
1 sa Sanskrit
1 rmr Calo
1 oji Ojibwa
1 oc Occitan
1 nap Napoletano- Calabrese
1 lt Lithuanian
1 ko Korean
1 kld Gamilaraay
1 kha Khasi
1 iu Inuktitut
1 ia Interlingua
1 gl Galician
1 fa Farsi
1 et Estonian
1 csb Kashubian
1 br Breton
1 bgi Giangan
1 ar Arabic
1 ale Aleut

EVOLUTION OF MASTER SOURCE FORMATS

Plain text was the first master source type/format for Project
Gutenberg, and remains important today. Plain text is readable on any
device. Plain text is printable, and efficient to store (including
for compression, or sharing by email). For decades, the International
Standards Organization has provided standard computerized encoding for
the basic American standard codes (ASCII) and extensions for accents
and other special characters (Latin1 or ISO 8859-1). Encoding exists for
other languages, and Unicode (with 8- and 16-bit variations) provides
encoding for larger groups of characters.

Within the first few hundred Project Gutenberg eBooks, some encoding was
offered which seemed promising, but did not withstand the test of time.
An early PostScript file was rendered unusable due to insertion of the
Project Gutenberg standard header; a dictionary included markup that,
today, might be reminiscent of XML or ReStructured Text, but without any
sort of codebook for proper presentation; a few word processor native
formats, including WordStar and WordPerfect, were used but are no longer
readable with modern computers.

Even HTML (and other XML variants) was viewed with skepticism, since the
longevity of formats is notoriously difficult to predict when they first
become available.

For these reasons, Project Gutenberg still prefers to make plain text
available for essentially every eBook. The only exceptions are those
for which no plain text encoding is reasonable - such as Chinese, or
mathematical texts, or music. In this way, the collection is "future
proof," so that even if all content cannot be fully represented as text,
the files themselves will still be readable and enjoyable to read.

Figure 3: Typical text view, showing fixed-length lines and spacing
among components.

A CONNECTICUT YANKEE IN KING ARTHUR'S COURT

by MARK TWAIN (Samuel L. Clemens)

PREFACE

The ungentle laws and customs touched upon in this tale are
historical, and the episodes which are used to illustrate
them are also historical. It is not pretended that these
laws and customs existed in England in the sixth century;
no, it is only pretended that inasmuch as they existed in
the English and other civilizations of far later times, it
is safe to consider that it is no libel upon the sixth
century to suppose them to have been in practice in that day
also. One is quite justified in inferring that whatever one
of these laws or customs was lacking in that remote time,
its place was competently filled by a worse one.

Today, Project Gutenberg's plain text offerings are most often derived
automatically from another master format. The most common master format
is HTML, which offers advantages of ubiquity and ease of authoring.
LaTeX is also used as a master, mainly for mathematical texts.
ReStructured Text (RST) was encouraged by Project Gutenberg, due to the
ease of conversion to other formats. However, RST has not been widely
adopted by eBook producers.

DERIVATIVE FORMATS

The ubiquity of reading devices - from mobile phones, to tablets, to
electronic paper - was predicted by Project Gutenberg. Rather than
creating separate master files for each native format for the devices,
automatic conversion is applied to one of the master formats. For years,
Java-format eBooks were automatically created, and these were usable on
many mobile phones.

Today, EPUB and MOBI (also known as Kindle) formats are the most common.
Free software for conversion, called ebookmaker (later, epubmaker) is
used to create derivative formats. This helps to assure compatibility
for different reader devices.

UPLOADING A NEW EBOOK

Volunteers upload the master format for their completed eBook to the
Project Gutenberg server, where it undergoes automated and manual
checks before the new eBook is posted and announced online. Prior to the
upload, the copyright clearance must be completed.

Upon uploading, automated checks include:

HTML checks for validity of the HTML encoding (via the W3C
validator);

HTML checks for internal link structure;

Spelling checks (English, with limited support for other
languages);

Typo/scanno checks (seeking common scanner/OCR errors, such
as "he" for "be" and vice-versa);

Conversion checks.

The conversion check consists of using the epubmaker application to
automatically generate derived formats. Ideally, resulting files will
include:

Plain text in UTF8 encoding;

Automatically generated HTML (if HTML is not the master
format)

EPUB and MOBI

For HTML, EPUB and MOBI, pairs of files are generated: one with images,
and one without. The set of files without images is intended to be
friendlier to readers with limited bandwidth, or without the necessary
storage space for any images included with the eBook.

After uploading, a team of human experts - known as the "whitewashers,"
after a scene in Mark Twain's "The Adventures of Tom Sawyer" - does
final formatting, attaches the Project Gutenberg header and footer, and

CATALOGING AND MIRRORING

The Project Gutenberg catalog database includes metadata from
within each eBook: the author, title, available file formats,
upload/publication date, language, etc. Human catalogers eventually add
additional metadata, including Library of Congress Subject Headings.
This catalog is available for free download in machine-readable form
(XML/RDF or MARC).

Organizations that desire to redistribute Project Gutenberg's content,
freely and without limitations, are invited to do so. The catalog may
be used for this purpose, and various mechanisms are available
to automatically maintain a copy of the collection itself (i.e.,
"mirroring"), including for generated content.

"NO SWEAT OF THE BROW COPYRIGHT"

An important innovation during the evolution of Project Gutenberg was
to clarify the notion of "authorship" and its critical role for
establishing copyright. In early days, it was common to think that
applying HTML markup, or reformatting, or spelling changes, qualified
an item for a new copyright. Historically, some print publishers even
claimed new copyrights simply for typesetting a new edition.

Today, we know US copyright is based on the creative expression of ideas
through authorship. Markup and spelling changes do not qualify. As a
result, Project Gutenberg volunteers are able to "harvest" public domain
materials on the Internet, once they are determined to match public
domain print materials. This is not a frequent occurrence, however,
since most volunteers prefer to work on items that are not yet
digitized.

Similarly, Project Gutenberg claims no copyright on the "sweat of the
brow" labor which is applied to make eBooks from print sources. There
were a few earlier items where such copyright was claimed erroneously,
but this is no longer done.

EBOOKS, OR PICTURES OF BOOKS?

Project Gutenberg has over 50,000 eBooks in its collection. This is far
fewer than Books, or The Internet Archive, or other large-scale
digitization projects of historical items. An important distinction
is that Project Gutenberg engages in the proofreading, formatting,
markup/encoding, and other activities described above. Those other very
large projects are primarily devoted to scanning, and then provide raw
OCR output with a few automatically-generated formats.

Such items are only partial eBooks - really, they are pictures (scans)
of books, with some additional automated features. These are valuable,
but do not provide the reading experience or quality of presentation
that Project Gutenberg strives for. Using current technology, it takes
human intellect and effort to convert a picture of a book to a true,
functional, eBook.

PAST INNOVATIONS AND FUTURE INITIATIVES

Project Gutenberg has evolved its practices over the years, and has
often been a leader in the creation and distribution of eBooks. Some
past innovations include the following, and all are still in active use
today:

Development of an open content trademark license (1991-
1993), which is intended to guarantee to readers that public
domain items remain free, while placing restrictions on the
trademarked name "Project Gutenberg" to protect against
abusive practices by those who would sell the public domain
items;

File/directory-based access to the collection, guaranteeing
ease of copying (by file, or subcollection, or the entire
collection), mirroring, and large-scale redistribution
(1994);

Anonymous access for all readers, requiring no logins or
authorization for any items (1994);

Web-based access to content, and development of procedures
to assure HTML is valid and well-formed (1996);

The Copyright How-To, including the Rule 6 How-To for non-
renewed items (2000 & 2008);

Support of Distributed Proofreaders (2002-2004), for
crowdsourced proofreading and other aspects of new eBook
creation;

Implementation of eBook reader formats, for free use on
mobile phones, tablets, and other devices (2009);

Free redistribution of metadata as a separate download (2007
& 2012);

Integration with Drive, Dropbox, and other mechanisms
for readers to employ "cloud" storage for eBooks (2013);

Fully automated conversion from master formats to eBook
formats (2013).

Project Gutenberg Has Ongoing Initiatives to improve service offerings
to readers. There are no definite timelines for these, and assistance
(or partnerships!) are always of interest. Some future initiatives may
include:*

Continued efforts to separate the "collection" from the
"interface," making it easier for different Web-based skins
to be used to access content;

Mechanisms for creation of personal bookshelves, "shopping
carts" or other reading lists, for users to more easily
track items of interest;

Crowdsourced reviews, errata and improvements to eBooks,
including capabilities for forked versions, versioning, and
other techniques common among developers of free software;

Improvements in ability to identify and filter items by the
author's death date, which is the most common criterion for
public domain status in countries other than the US;

Better tracking of sources used, including for harvested
scans; even with no guarantee of faithfulness to a
particular print source, information about source is
frequently requested;

More languages, more formats, and additional content types;

Encouragement of innovative ideas by Project Gutenberg's
readers and other fans;

Ongoing evolution in the utility of Project Gutenberg eBooks
for future reading devices.

APPRECIATION FOR VOLUNTEERS

Project Gutenberg is thankful to tens of thousands of volunteers,
over more than 45 years, that have contributed to the creation and
distribution of free electronic books. It is through the efforts
of these volunteers that Project Gutenberg has been successful, and
continues to thrive.

[Illustration: 0015]

*** End of this LibraryBlog Digital Book "Forty-Five Years Of Digitizing Ebooks - Project Gutenberg's Practices" ***

Home