Carnegie Mellon University
Pittsburgh, Pa. 15213
December 1, 2001
412-268-2597
rr@cmu.edu
Creating a universal free to read,
digital library containing over one million scanned books, with optical
character recognition when possible to support full text searching, is the goal
of the million book digital library project. Such a resource will lead to the democratization of
knowledge by making available on the web, a unique library resource to
scholars, students, and citizens around the world. The availability of online search allows users to locate
relevant information quickly and reliably thus enhancing student willingness
and success in their research endeavors.
This 24x7x365 resource would also provide an excellent testbed for
language processing research in areas such as machine translation,
summarization, intelligent indexing, and information mining.
A portion of the content would
include out of copyright, pre-1920 materials. A “best books” feature of the project would involve
requesting permission to scan titles in the core collection development tool Books for
College Libraries. A
preliminary Carnegie Mellon University Libraries pilot suggests that 22% of the
80,000 titles might become available. Further, when 80% of the million books
are finished, scholars will be recruited to review collections in their
disciplines and to select remaining books of importance.
Mirroring the site at several locations worldwide will protect the
integrity and availability of the data. Several models for sustainability are
being explored and are discussed in this report. Usability studies would also
be conducted to ensure that the materials are easy to locate, navigate, and
use. Appropriate metadata for navigation and management would also be created.
National Science Foundation is providing funding for Scanners, Computers,
Servers, and Software. These
resources from NSF are augmented by almost twenty to one since China and India will
be providing the necessary manpower (2,000 man years each, over a four year
period), as their contribution to this project, to assist in selection of
documents, software development and in digitizing these materials. Indigenous Chinese and Indian materials
would form a portion of the content scanned as would English language materials
already resident in those countries.
In addition, U.S. libraries, primarily members of the Digital Library
Federation, would ship materials to be scanned and returned.
The primary long-term
objective is to capture all books in digital format. Some believe such a task is impossible. Thus as a first step we are planning to
demonstrate the feasibility by undertaking to digitize 1 million books (less
than 1% of all books in all languages ever published) by 2005. We believe such a project has the
potential to change how education is conducted in much of the world. The project hopes to create a universal
digital library free to read any time any where by anyone.
Each of the million books
is scanned. If it is in a language
for which optical character recognition software is available, the text is
converted to ascii/unicode format to allow full text search to guide students,
scholars, and citizens to the relevant portions of the work. Scanner operators
create metadata, based on existing cataloging records for these books and
journals, to accompany each book.
This project enhances
research, learning, and teaching by making a critical mass of scholarly
information freely available to read online. It has been observed that the result will be like Vannevar
Bush’s Memex. In addition to its
own indexes, major indexers, such as Google will index it and others, including
libraries participating in the project, will hyperlink to it.
A secondary objective of
this project will be to provide a test bed that will support other researchers
who are working on improved scanning techniques, improved optical character
recognition, and improved indexing.
The corpus this project creates will be at least ten times as large as
any existing free resource.
Primary benefit is to
supplement the formal education system by making knowledge available to anyone
who can read and has access. Libraries have played a vital role in the
advancement of human society. Societal advance depends on young people having
access to books via libraries and other means. We expect that making this unique web resource available
free to everyone in the U.S. and around the world will lead to a further
democratization of access to knowledge.
Libraries are unevenly
distributed around the world and within countries. In the U.S., the NCES Survey noted that in 1996, 3,408 of
3,792 institutions of higher education had libraries holding 806.7 million volumes. The 112 largest university libraries in
the United States and Canada each have at least 1.8 million books; they are
members of the Association for Research Libraries. Massachusetts has about 25 million volumes; New York has
about 31 million volumes, and California has about 40 million volumes in their
ARL Libraries (Association for Research Libraries, 1999/2000). Other states, such as North and South
Dakota, have no large libraries. A
few large public libraries have several million volumes. However, most junior colleges, high
schools, and public libraries have much smaller collections. Making this large knowledge repository
with the convenience of online access and the benefit of word and phrase full
text searching can revolutionize research at all levels of education and give a
much-needed boost at minimal cost to our national educational infrastructure.
Secondary benefit: Online search makes locating the
relevant information inside of books far more reliable and much easier. Student success in finding exactly what
they seek will increase and increased success will enhance student willingness
to perform research in this large resource. NCES reports that 84 percent of libraries around the country
are open between 60 and 80 hours a week.
This digital library would be open 24 hours a day, seven days a week,
and 365 days a year for a total of 168 hours a week, over twice the time most
libraries are open. More than one
individual will be able to use the same book at the same time. Thus, popular works will not be checked
out and thus unavailable to others.
This million-book project
will produce an extensive and rich testbed for use in further textual language
processing research. It is hoped
that at least 10,000 books among the million will be available in more than one
language, providing a key testing area for problems in example based machine
translation. In the last stage of the project, books in multiple languages will
be reviewed to ensure that this testbed feature is accomplished.
Many believe that knowledge
is now doubling at the rate of every two to three years. Machine summarization,
intelligent indexing, and information mining are tools that will be needed for
individuals to keep up in their discipline work, in their businesses, and in
their personal interests. This
large digitization project will support research in these areas.
The preliminary work
described below has been used to establish a protocol, to select standards to
be used, and to address issues of indexing and retrieval. Workflow and training programs to
support the larger project are being developed. Both the content and the mechanisms for using it will be
made available in open source code.
The National Science
Foundation’s 2000 ITR grant cycle provided $500,000 for equipment to begin a
large pilot. That grant will allow
the purchase of 18 Minolta book scanners to be located in India and China. Some machines have already been
deployed to begin the scanning process. Strong discounts from Minolta have
expanded the number of machines that can be purchased. Earlier pilot projects, a 100-book
scanning project and a 1000-book scanning project, that aided in the selection
of the scanners and the establishment of processes used are described more
fully below.
Chinese University
Presidents, a Ministry of Education official, and Chinese Academy of Sciences
leaders visited the U.S. to reach agreements and to form a steering
committee.
Dr. Michael Lesk and Dr.
Stephen Griffin from NSF attended the Carnegie Mellon meeting and also hosted
the Chinese delegation at the National Science Foundation. Professor PAN Yunhe,
President of Zhejiang University; Dr. GAO Wen, Deputy President of the Graduate
School, Chinese Academy of Science; Professor CHI Huisheng, Vice President of
Beijing University; Professor HU Dongcheng, Vice President of Tsinghua
University; Professor XU Zhong, Vice President of Fudan University; Professor,
ZHANG Yibin, Assistant to the President, Nanjing University; Mr. GUO Xinli,
Vice General Director, Ministry of Education of China; Mr. CHEN Jianping, Vice
Director, State Planning Commission of China; and Dr. Ching-Chih Chen of
Simmons College attended. The
National Science Foundation funded this summit.
The Indian university and
government officials are scheduled to visit on the 26th of May 2002
and it is expected that similar agreements would be reached.
U.S. Digital Library
Federation members met on November 15 and 16, 2001 to work out the logistics of
selecting and transporting materials from U.S. collections under a grant from
NSF. Drs. Lesk and Griffin were
joined in Pittsburgh by representatives from OCLC, the Center for Research
Libraries, and collection development officers and other librarians from the
Library of Congress, the University of Washington, the University of California
Berkeley, Stanford University, University of Illinois, University of Chicago,
Penn State University, and the University of Pittsburgh. The Digital Library Federation’s
Executive Director also attended the meeting.
The collection development librarians discussed:
·
Collection focus to achieve a consensus on how to select the million
books to be digitized.
·
Involvement of outside scholars in selection issues to consider how
non-librarian scholars might participate in selection.
·
Copyright considerations to consider seeking permission for a set of in
copyright “best books”, such as those in Books for College Libraries.
·
Standards for the work to review the current Digital Library Federation
standards with a view to rapid adoption.
·
Registry issues to move forward with OCLC in establishing a registry for
books selected.
·
Methods of transport to consider alternative means of transport and
return.
·
Timing to weigh the advantages of air containers and sea containers.
·
Level of participation to determine minimum levels for contributors to
the project.
·
Incentives for participation to establish means of recognition for
contributions through screen display and copies of the archives.
The outcome of this meeting
will result in a plan for the selection and transmission of almost a million
books to China and India over a multiyear period and a plan for assessing the
success of the project annually.
Creating a
scalable database to support this project is a related research proposal. Drs. Christos Faloutsos, Jeffrey
Eppinger, and Natalia Allamachi are submitting a proposal to NSF to address
these issues. Their globally distributed database will appear to be a virtual
central database from any place around the world. Mirroring the database in several countries will ensure
security and availability.
The
database will house both an image file and a text file at about 10-20 megabytes
per book. The aggregate of 20
terabytes will be affordable to store because the costs of storage continue to
decline substantially. By 2010, a
terabyte of storage is expected to cost as little as $10.
100 book
pilot: Two years
ago, we funded a pilot experiment to scan 100 books so that the practical
difficulties of a million book project could be assessed. Carnegie Mellon University Libraries
faculty and staff assisted in the pilot.
The scanner of choice was an inexpensive duplex scanner that required
the books to be disbound so that the pages could be fed through in
batches. While the economy and
speed of this technique were most attractive, several technical problems
occurred.
·
The pages had to be cut on all four edges for smooth feeding. The project required the purchase of a
$10,000 guillotine to accomplish this.
The guillotine was somewhat dangerous, required in-depth training in use
and safety, slowed the process, was a public relations nightmare for the
library community, and obviated the economy of the inexpensive scanner.
·
Dust, an inevitable accompaniment to older books, proved to be a formidable
opponent. Dust caused frequent
jamming and subsequent cleaning of the scanner. Paper fixatives were employed to counteract the dust. Spraying on the fixative slowed the
project and was not entirely satisfactory.
At the end
of the first hundred books, the scanner operators and their supervisor sought
another approach.
1000 book
project. Books 200 through 1000 were scanned
using a Minolta Overhead scanner.
Although this scanner was 5 times more expensive roll-feed double sided
scanner we used, it proved to be more reliable. Books did not have to be disbound. The image processing software for curvature correction,
deskewing, despeckling and cropping allows for thick books to be scanned either
flat or in an angled cradle that reduces wear on the spine. Thorough training is required to
operate the scanner, but several different employees were successfully trained
to use it during the period of the project. The results of this 1000 book project can be viewed at www.ulib.cs.cmu.edu under 1000 book
project. This scanner and the
processes are the ones that are recommended for the million book project. The advantages of the Minolta approach
include:
·
Disbinding via a guillotine is not necessary.
·
Books can be reused in their original form.
·
Dust, thick paper, and long books can be easily accommodated.
·
Training requirements are reasonable.
· Equipment is reliable.
3.
Data
Production
·
Bitonal images with a pixel depth of 1 bit-per-pixel were scanned at a
resolution of 600 dots per inch (dpi). Images stored as "Intel" TIFF
(Tagged Image File Format) files, with the header content specified. The
compression algorithm used is ITU (Formerly CCITT) Group 4.
·
TIFF version 5.0 is acceptable. Subject to testing, version 6.0 (or
later) may also be acceptable.
·
Initial-capture system includes dynamic thresholding or a similar feature
to capture variability of darkness in the imprint and possibly darker (e.g.,
foxed) backgrounds from decay.
Images should be as readable as the original pages.
·
"Typical" or "expected" data to be provided for most
TIFF tags (normally, the data supplied by software default settings). A
specification for the TIFF header to be produced to include scanner technical
information, filename, and other data, but to be in no way a burden on the
production service.
·
Images written in sequential order, with corresponding 8.3 file names,
e.g., 00000001.tif as first image in volume sequence and 00000341.tif as 341st
image in volume sequence
·
Volumes to be provided to Million Book Project by libraries with unique
identifiers that conform to 8.3 format; images should be in directories named
with corresponding identifier (e.g., akf3435.001 as identifier for volume will
result in directory with same name, and 00000001.tif through 0000000N.tif
within that directory)
·
Images and directories (as specified above) to be written by Million Book
Project to gold CD-ROM meeting agreed upon specifications, and using ISO9660
format.
·
Skew to be within a specified range of degrees allowed.
4.
Optical Character Recognition (OCR)
The primary function of OCR is to allow searching inside the text. Because words are often repeated, the
98% success rate will allow students and scholars to find relevant passage. In
the pilot projects, the OCR program Abby Fine Reader was run after the
scanning was completed. Abby Fine Reader was selected for its
ability to keep words intact if they were hyphenated between two pages. On
English language texts with print that has few broken letters, OCR accuracy of Abby Fine
Reader is about 98% of text.
We do not plan to correct the OCR output as part of this project.
More sophisticated programs with voting system to resolve different
interpretations are available, but licenses are too expensive. Chinese and Japanese OCR programs are
also available and will be used whenever possible. Providing a testbed that
will allow for the creation of even better OCR programs is a secondary goal of
this project. Scholars may wish to run newer OCR programs over the scans and
even to correct the output.
5.
Metadata
Digital Library Federation standards and metadata best practices will be
used throughout this project.
Bibliographic metadata for the pilot project will be derived from
existing library catalog records.
Carnegie Mellon libraries developed software that uses the standard
Z39.50 protocol to search and retrieve relevant metadata from catalog records
fields. Thus, author, title, and
publication data do not have to be rekeyed.
Another research project associated with this project will be the
creation of software that automatically creates "document structure"
metadata. This metadata allows
users to navigate through the chapters and other parts of a book
successfully. Entering such
information manually is too time consuming for this project, but automatic
metadata creation programs can be utilized subsequently.
Administrative metadata supports the maintenance and archiving of the
paper or digital objects and ensures their long-term availability by providing information
about how the files were created and stored. Administrative metadata will be maintained internally as
file descriptions in the project databases and externally as part of the
copyright permission database.
The Digital Library Federation, a supporter of this project, has several
initiatives underway that will allow commercial browsers to harvest metadata
more aggressively. The results of DLF’s metadata harvesting project will be
explored for possible application to the resources produced in this project
(www.diglib.org).
6.
Quality control
The standards established for quality control are those currently
endorsed by the Digital Library Federation, whose missions include the
establishment of best practices and the development of standards. The project must maintain a 98%
accuracy rate for the quality of images and the inclusion of all pages. Nevertheless, a process must be
developed to allow for users to report missing pages and for those missing pages
to be scanned and dropped back into the existing scanned text. Because the owning library will have to
pull the book, scan the pages, and transport the file, this process will be
expensive. Maintaining high quality
the first time the book is scanned will be essential. A demonstration of high quality, reliable work done on
materials currently in China and India will give U.S. libraries confidence that
their collections should be shared.
E. Content
Seeking to develop a collection of
one million digital books, the Million Book Project envisages a staged approach
as described below. The
Million Book project will adhere to copyright law. U.S. collections will primarily include the following types
of materials.
1. Coordination of Selection
Creating one digital copy, which
can then be easily mirrored in different locations, will suffice and will
support the multiple uses an item may receive. Preliminary discussions with OCLC as a host for a registry
of scanned items are underway.
Certain key projects, such as the Making of America project, are already
represented in the OCLC database as digital books. Other large digitization projects may require some data
entry of their content in order to avoid duplication.
2. Non-copyrighted materials
Materials
published before 1920 are in the public domain and may be scanned for this
project. Several large academic libraries are considering shipping materials
from their depositories of little used material to India/China. These materials will be scanned there
and then returned. To reduce the costs of selection, the project will probably
develop a strategy of selecting key topics and then removing large runs of
books and journals from a selected depository. Having a reasonable turn around time will be essential to
the success of the project. A test
will be devised to understand the logistics of shipping the materials and the
impact of their absence from the home library.
The 1909 copyright law granted
copyright for 28 years. Rights
holders could then renew the copyright for another 28 years; many publishers and
authors did not exercise that renewal option. Thus, some materials published after 1922 (56 years prior to
the 1978 effective date of the 1976 act) may be out of copyright. In order to provide for the efficient
checking of these books’ status, copyright renewal records for books for these
years been scanned and made available online at www.ulib.org. Similar records for other formats, such
as serials and audiovisual material, will also be made available as a part of
this resource.
Government documents are also in
the public domain and may be included in this project. Many participating libraries are
depositories for full runs of government documents and could supply them to the
project, as could the Library of Congress. The inclusion of documents will allow for more recent
material to enter the project legally and to become available to a broader
audience and in a more accessible manner.
Many government documents are currently available in digital form. The creation of these back files would
enhance those resources.
The Chinese delegation is most
eager to have technical reports and science and technology dissertations as a
part of this project. The
producing scholar and the university have copyright interests in these formats. Gaining university permission might be
fairly straightforward. A good
faith attempt would also have to be made to win the permission of the
scholar. That could be a part of
an externally funded copyright clearance project, but no pilot has been done to
allow for an estimate of contact rate and subsequent success. If some arrangement could be made with
University Microfilms to scan dissertations of selected universities from
microfilm, which would be cheaper and easier to transport, such an initiative
might satisfy a strong desire among all participants to increase science
content.
3. Copyrighted
materials
The 1998 Copyright law grants
copyright to authors for their lifetimes plus 70 years or for 95 years. Patent law, by contrast, gives 20
years. A.W. Mellon’s JSTOR project
developed the concept of a moving wall that allowed the inclusion of materials
over five years old. Journal
publishers generally agreed that the economic value of that material was
greatly reduced and granted permission for its inclusion in this most successful
project. A similar broad publisher
agreement about the point at which economic value of a print book declines is
greatly needed because books often go out of print in two or three years and
can then remain in copyright but unavailable for over 90 years.
Dr. Raj Reddy and Dr. Peter Shane,
Director of the Institute for the Study of Information, Technology and Society
recently had a conversation with a major book publisher to explore the
possibility of taking a broad publisher approach to receiving copyright
permissions. Certain publishers,
including the National Academy Press, have had the experience that when they
digitized their books, sales increased because attention was focused on the
material and the scholars were not yet ready to read the books online. Authors' guilds will also be contacted
to see if they would be interested in grant permissions.
Three conditions seem to be
necessary to attract publishers to the scanning of their out of print but in
copyright titles:
· Publisher should receive a tax deduction for contributing
the title to this project. The tax
deduction might reflect revenues previously generated by the title.
· When a print on demand feature becomes a part of this
project, publishers should collect royalties on books printed.