The Million Book Digital Library Project

Raj Reddy and Gloriana StClair

Carnegie Mellon University

Pittsburgh, Pa. 15213

December 1, 2001

412-268-2597

rr@cmu.edu

Objective

The objective of this project is to create a free-to-read, searchable collection of one million books, primarily in the English language, available to everyone over the Internet.  This task is accomplished by scanning the books and indexing their full text.  The text file is created, where possible, through optical character recognition.  The result will be a unique resource accessible to anyone in the world 24x7x365, without regard to nationality or socioeconomic background.

 

Typical large high-school libraries house fewer than 30,000 volumes.  One million volumes is the approximate size of the combined libraries at Carnegie Mellon University.  The total number of different titles indexed in OCLC’s WorldCat is about 48 million.  One million books, therefore, is more than the holdings of any high-school, equivalent to the library at a substantial university and a significant fraction of all available books.

 

 Executive Summary

 

Creating a universal free to read, digital library containing over one million scanned books, with optical character recognition when possible to support full text searching, is the goal of the million book digital library project.  Such a resource will lead to the democratization of knowledge by making available on the web, a unique library resource to scholars, students, and citizens around the world.  The availability of online search allows users to locate relevant information quickly and reliably thus enhancing student willingness and success in their research endeavors.  This 24x7x365 resource would also provide an excellent testbed for language processing research in areas such as machine translation, summarization, intelligent indexing, and information mining.

 

A portion of the content would include out of copyright, pre-1920 materials.  A “best books” feature of the project would involve requesting permission to scan titles in the core collection development tool Books for College Libraries.  A preliminary Carnegie Mellon University Libraries pilot suggests that 22% of the 80,000 titles might become available. Further, when 80% of the million books are finished, scholars will be recruited to review collections in their disciplines and to select remaining books of importance.

 

Mirroring the site at several locations worldwide will protect the integrity and availability of the data. Several models for sustainability are being explored and are discussed in this report. Usability studies would also be conducted to ensure that the materials are easy to locate, navigate, and use. Appropriate metadata for navigation and management would also be created.

 

National Science Foundation is providing funding for Scanners, Computers, Servers, and Software.  These resources from NSF are augmented by almost twenty to one since China and India will be providing the necessary manpower (2,000 man years each, over a four year period), as their contribution to this project, to assist in selection of documents, software development and in digitizing these materials.  Indigenous Chinese and Indian materials would form a portion of the content scanned as would English language materials already resident in those countries.  In addition, U.S. libraries, primarily members of the Digital Library Federation, would ship materials to be scanned and returned.

 

II.        Technical Description

 

A.        Primary Objective

 

The primary long-term objective is to capture all books in digital format.  Some believe such a task is impossible.  Thus as a first step we are planning to demonstrate the feasibility by undertaking to digitize 1 million books (less than 1% of all books in all languages ever published) by 2005.  We believe such a project has the potential to change how education is conducted in much of the world.  The project hopes to create a universal digital library free to read any time any where by anyone.

 

Each of the million books is scanned.  If it is in a language for which optical character recognition software is available, the text is converted to ascii/unicode format to allow full text search to guide students, scholars, and citizens to the relevant portions of the work. Scanner operators create metadata, based on existing cataloging records for these books and journals, to accompany each book.

 

This project enhances research, learning, and teaching by making a critical mass of scholarly information freely available to read online.  It has been observed that the result will be like Vannevar Bush’s Memex.  In addition to its own indexes, major indexers, such as Google will index it and others, including libraries participating in the project, will hyperlink to it.

 

A secondary objective of this project will be to provide a test bed that will support other researchers who are working on improved scanning techniques, improved optical character recognition, and improved indexing.  The corpus this project creates will be at least ten times as large as any existing free resource.

 

B.        Primary Benefit

 

Primary benefit is to supplement the formal education system by making knowledge available to anyone who can read and has access. Libraries have played a vital role in the advancement of human society. Societal advance depends on young people having access to books via libraries and other means.  We expect that making this unique web resource available free to everyone in the U.S. and around the world will lead to a further democratization of access to knowledge.

 

Libraries are unevenly distributed around the world and within countries.  In the U.S., the NCES Survey noted that in 1996, 3,408 of 3,792 institutions of higher education had libraries holding 806.7 million volumes.  The 112 largest university libraries in the United States and Canada each have at least 1.8 million books; they are members of the Association for Research Libraries.  Massachusetts has about 25 million volumes; New York has about 31 million volumes, and California has about 40 million volumes in their ARL Libraries (Association for Research Libraries, 1999/2000).  Other states, such as North and South Dakota, have no large libraries.  A few large public libraries have several million volumes.  However, most junior colleges, high schools, and public libraries have much smaller collections.  Making this large knowledge repository with the convenience of online access and the benefit of word and phrase full text searching can revolutionize research at all levels of education and give a much-needed boost at minimal cost to our national educational infrastructure.

 

Secondary benefit:  Online search makes locating the relevant information inside of books far more reliable and much easier.  Student success in finding exactly what they seek will increase and increased success will enhance student willingness to perform research in this large resource.  NCES reports that 84 percent of libraries around the country are open between 60 and 80 hours a week.  This digital library would be open 24 hours a day, seven days a week, and 365 days a year for a total of 168 hours a week, over twice the time most libraries are open.  More than one individual will be able to use the same book at the same time.  Thus, popular works will not be checked out and thus unavailable to others.

 

This million-book project will produce an extensive and rich testbed for use in further textual language processing research.  It is hoped that at least 10,000 books among the million will be available in more than one language, providing a key testing area for problems in example based machine translation. In the last stage of the project, books in multiple languages will be reviewed to ensure that this testbed feature is accomplished.

 

Many believe that knowledge is now doubling at the rate of every two to three years. Machine summarization, intelligent indexing, and information mining are tools that will be needed for individuals to keep up in their discipline work, in their businesses, and in their personal interests.  This large digitization project will support research in these areas.

 

C.       Status to Date

 

The preliminary work described below has been used to establish a protocol, to select standards to be used, and to address issues of indexing and retrieval.  Workflow and training programs to support the larger project are being developed.  Both the content and the mechanisms for using it will be made available in open source code.

 

The National Science Foundation’s 2000 ITR grant cycle provided $500,000 for equipment to begin a large pilot.  That grant will allow the purchase of 18 Minolta book scanners to be located in India and China.  Some machines have already been deployed to begin the scanning process. Strong discounts from Minolta have expanded the number of machines that can be purchased.  Earlier pilot projects, a 100-book scanning project and a 1000-book scanning project, that aided in the selection of the scanners and the establishment of processes used are described more fully below. 

 

Chinese University Presidents, a Ministry of Education official, and Chinese Academy of Sciences leaders visited the U.S. to reach agreements and to form a steering committee. 

Dr. Michael Lesk and Dr. Stephen Griffin from NSF attended the Carnegie Mellon meeting and also hosted the Chinese delegation at the National Science Foundation. Professor PAN Yunhe, President of Zhejiang University; Dr. GAO Wen, Deputy President of the Graduate School, Chinese Academy of Science; Professor CHI Huisheng, Vice President of Beijing University; Professor HU Dongcheng, Vice President of Tsinghua University; Professor XU Zhong, Vice President of Fudan University; Professor, ZHANG Yibin, Assistant to the President, Nanjing University; Mr. GUO Xinli, Vice General Director, Ministry of Education of China; Mr. CHEN Jianping, Vice Director, State Planning Commission of China; and Dr. Ching-Chih Chen of Simmons College attended.  The National Science Foundation funded this summit.

 

The Indian university and government officials are scheduled to visit on the 26th of May 2002 and it is expected that similar agreements would be reached.

 

U.S. Digital Library Federation members met on November 15 and 16, 2001 to work out the logistics of selecting and transporting materials from U.S. collections under a grant from NSF.  Drs. Lesk and Griffin were joined in Pittsburgh by representatives from OCLC, the Center for Research Libraries, and collection development officers and other librarians from the Library of Congress, the University of Washington, the University of California Berkeley, Stanford University, University of Illinois, University of Chicago, Penn State University, and the University of Pittsburgh.  The Digital Library Federation’s Executive Director also attended the meeting.

 

       The collection development librarians discussed:

 

·          Collection focus to achieve a consensus on how to select the million books to be digitized.

·          Involvement of outside scholars in selection issues to consider how non-librarian scholars might participate in selection.

·          Copyright considerations to consider seeking permission for a set of in copyright “best books”, such as those in Books for College Libraries.

·          Standards for the work to review the current Digital Library Federation standards with a view to rapid adoption.

·          Registry issues to move forward with OCLC in establishing a registry for books selected.

·          Methods of transport to consider alternative means of transport and return.

·          Timing to weigh the advantages of air containers and sea containers.

·          Level of participation to determine minimum levels for contributors to the project.

·          Incentives for participation to establish means of recognition for contributions through screen display and copies of the archives.

 

The outcome of this meeting will result in a plan for the selection and transmission of almost a million books to China and India over a multiyear period and a plan for assessing the success of the project annually.

 

D.       Technical Approach

 

1.      Database creation

 

Creating a scalable database to support this project is a related research proposal.  Drs. Christos Faloutsos, Jeffrey Eppinger, and Natalia Allamachi are submitting a proposal to NSF to address these issues. Their globally distributed database will appear to be a virtual central database from any place around the world.  Mirroring the database in several countries will ensure security and availability.

 

The database will house both an image file and a text file at about 10-20 megabytes per book.  The aggregate of 20 terabytes will be affordable to store because the costs of storage continue to decline substantially.  By 2010, a terabyte of storage is expected to cost as little as $10.

 

2.      Scanning

 

100 book pilot: Two years ago, we funded a pilot experiment to scan 100 books so that the practical difficulties of a million book project could be assessed.  Carnegie Mellon University Libraries faculty and staff assisted in the pilot.  The scanner of choice was an inexpensive duplex scanner that required the books to be disbound so that the pages could be fed through in batches.  While the economy and speed of this technique were most attractive, several technical problems occurred.

 

·    The pages had to be cut on all four edges for smooth feeding.  The project required the purchase of a $10,000 guillotine to accomplish this.  The guillotine was somewhat dangerous, required in-depth training in use and safety, slowed the process, was a public relations nightmare for the library community, and obviated the economy of the inexpensive scanner.

·    Dust, an inevitable accompaniment to older books, proved to be a formidable opponent.  Dust caused frequent jamming and subsequent cleaning of the scanner.  Paper fixatives were employed to counteract the dust.  Spraying on the fixative slowed the project and was not entirely satisfactory.

 

At the end of the first hundred books, the scanner operators and their supervisor sought another approach.

 

1000 book project.  Books 200 through 1000 were scanned using a Minolta Overhead scanner.  Although this scanner was 5 times more expensive roll-feed double sided scanner we used, it proved to be more reliable.  Books did not have to be disbound.  The image processing software for curvature correction, deskewing, despeckling and cropping allows for thick books to be scanned either flat or in an angled cradle that reduces wear on the spine.  Thorough training is required to operate the scanner, but several different employees were successfully trained to use it during the period of the project.  The results of this 1000 book project can be viewed at www.ulib.cs.cmu.edu under 1000  book project.  This scanner and the processes are the ones that are recommended for the million book project.  The advantages of the Minolta approach include:

 

·  Disbinding via a guillotine is not necessary.

·  Books can be reused in their original form.

·  Dust, thick paper, and long books can be easily accommodated.

·  Training requirements are reasonable.

·  Equipment is reliable.

 

3.      Data Production

 

·    Bitonal images with a pixel depth of 1 bit-per-pixel were scanned at a resolution of 600 dots per inch (dpi). Images stored as "Intel" TIFF (Tagged Image File Format) files, with the header content specified. The compression algorithm used is ITU (Formerly CCITT) Group 4.

·   TIFF version 5.0 is acceptable. Subject to testing, version 6.0 (or later) may also be acceptable.

·    Initial-capture system includes dynamic thresholding or a similar feature to capture variability of darkness in the imprint and possibly darker (e.g., foxed) backgrounds from decay.  Images should be as readable as the original pages.

·    "Typical" or "expected" data to be provided for most TIFF tags (normally, the data supplied by software default settings). A specification for the TIFF header to be produced to include scanner technical information, filename, and other data, but to be in no way a burden on the production service. 

·  Images written in sequential order, with corresponding 8.3 file names, e.g., 00000001.tif as first image in volume sequence and 00000341.tif as 341st image in volume sequence

·  Volumes to be provided to Million Book Project by libraries with unique identifiers that conform to 8.3 format; images should be in directories named with corresponding identifier (e.g., akf3435.001 as identifier for volume will result in directory with same name, and 00000001.tif through 0000000N.tif within that directory)

·  Images and directories (as specified above) to be written by Million Book Project to gold CD-ROM meeting agreed upon specifications, and using ISO9660 format.

·  Skew to be within a specified range of degrees allowed.

 

4.      Optical Character Recognition (OCR)

 

The primary function of OCR is to allow searching inside the text.  Because words are often repeated, the 98% success rate will allow students and scholars to find relevant passage. In the pilot projects, the OCR program Abby Fine Reader was run after the scanning was completed. Abby Fine Reader was selected for its ability to keep words intact if they were hyphenated between two pages. On English language texts with print that has few broken letters, OCR accuracy of Abby Fine Reader is about 98% of text.  We do not plan to correct the OCR output as part of this project. 

 

More sophisticated programs with voting system to resolve different interpretations are available, but licenses are too expensive.  Chinese and Japanese OCR programs are also available and will be used whenever possible. Providing a testbed that will allow for the creation of even better OCR programs is a secondary goal of this project. Scholars may wish to run newer OCR programs over the scans and even to correct the output. 

 

5.       Metadata

 

Digital Library Federation standards and metadata best practices will be used throughout this project.  Bibliographic metadata for the pilot project will be derived from existing library catalog records.  Carnegie Mellon libraries developed software that uses the standard Z39.50 protocol to search and retrieve relevant metadata from catalog records fields.  Thus, author, title, and publication data do not have to be rekeyed.

 

Another research project associated with this project will be the creation of software that automatically creates "document structure" metadata.  This metadata allows users to navigate through the chapters and other parts of a book successfully.  Entering such information manually is too time consuming for this project, but automatic metadata creation programs can be utilized subsequently.

 

Administrative metadata supports the maintenance and archiving of the paper or digital objects and ensures their long-term availability by providing information about how the files were created and stored.  Administrative metadata will be maintained internally as file descriptions in the project databases and externally as part of the copyright permission database.

 

The Digital Library Federation, a supporter of this project, has several initiatives underway that will allow commercial browsers to harvest metadata more aggressively. The results of DLF’s metadata harvesting project will be explored for possible application to the resources produced in this project (www.diglib.org).

 

6.       Quality control

 

The standards established for quality control are those currently endorsed by the Digital Library Federation, whose missions include the establishment of best practices and the development of standards.  The project must maintain a 98% accuracy rate for the quality of images and the inclusion of all pages.  Nevertheless, a process must be developed to allow for users to report missing pages and for those missing pages to be scanned and dropped back into the existing scanned text.  Because the owning library will have to pull the book, scan the pages, and transport the file, this process will be expensive.  Maintaining high quality the first time the book is scanned will be essential.  A demonstration of high quality, reliable work done on materials currently in China and India will give U.S. libraries confidence that their collections should be shared.

 

E.       Content

 

Seeking to develop a collection of one million digital books, the Million Book Project envisages a staged approach as described below.   The Million Book project will adhere to copyright law.  U.S. collections will primarily include the following types of materials.

 

1.      Coordination of Selection

 

Creating one digital copy, which can then be easily mirrored in different locations, will suffice and will support the multiple uses an item may receive.  Preliminary discussions with OCLC as a host for a registry of scanned items are underway.  Certain key projects, such as the Making of America project, are already represented in the OCLC database as digital books.  Other large digitization projects may require some data entry of their content in order to avoid duplication.

 

2.      Non-copyrighted materials

 

Materials published before 1920 are in the public domain and may be scanned for this project. Several large academic libraries are considering shipping materials from their depositories of little used material to India/China.  These materials will be scanned there and then returned. To reduce the costs of selection, the project will probably develop a strategy of selecting key topics and then removing large runs of books and journals from a selected depository.  Having a reasonable turn around time will be essential to the success of the project.  A test will be devised to understand the logistics of shipping the materials and the impact of their absence from the home library. 

 

The 1909 copyright law granted copyright for 28 years.  Rights holders could then renew the copyright for another 28 years; many publishers and authors did not exercise that renewal option.  Thus, some materials published after 1922 (56 years prior to the 1978 effective date of the 1976 act) may be out of copyright.  In order to provide for the efficient checking of these books’ status, copyright renewal records for books for these years been scanned and made available online at www.ulib.org.  Similar records for other formats, such as serials and audiovisual material, will also be made available as a part of this resource.

 

Government documents are also in the public domain and may be included in this project.   Many participating libraries are depositories for full runs of government documents and could supply them to the project, as could the Library of Congress.  The inclusion of documents will allow for more recent material to enter the project legally and to become available to a broader audience and in a more accessible manner.  Many government documents are currently available in digital form.  The creation of these back files would enhance those resources.

 

The Chinese delegation is most eager to have technical reports and science and technology dissertations as a part of this project.  The producing scholar and the university have copyright interests in these formats.  Gaining university permission might be fairly straightforward.  A good faith attempt would also have to be made to win the permission of the scholar.  That could be a part of an externally funded copyright clearance project, but no pilot has been done to allow for an estimate of contact rate and subsequent success.  If some arrangement could be made with University Microfilms to scan dissertations of selected universities from microfilm, which would be cheaper and easier to transport, such an initiative might satisfy a strong desire among all participants to increase science content. 

 

3.       Copyrighted materials

 

The 1998 Copyright law grants copyright to authors for their lifetimes plus 70 years or for 95 years.  Patent law, by contrast, gives 20 years.  A.W. Mellon’s JSTOR project developed the concept of a moving wall that allowed the inclusion of materials over five years old.  Journal publishers generally agreed that the economic value of that material was greatly reduced and granted permission for its inclusion in this most successful project.  A similar broad publisher agreement about the point at which economic value of a print book declines is greatly needed because books often go out of print in two or three years and can then remain in copyright but unavailable for over 90 years.

 

Dr. Raj Reddy and Dr. Peter Shane, Director of the Institute for the Study of Information, Technology and Society recently had a conversation with a major book publisher to explore the possibility of taking a broad publisher approach to receiving copyright permissions.  Certain publishers, including the National Academy Press, have had the experience that when they digitized their books, sales increased because attention was focused on the material and the scholars were not yet ready to read the books online.  Authors' guilds will also be contacted to see if they would be interested in grant permissions.

 

Three conditions seem to be necessary to attract publishers to the scanning of their out of print but in copyright titles:

 

·  Publisher should receive a tax deduction for contributing the title to this project.  The tax deduction might reflect revenues previously generated by the title.

·  When a print on demand feature becomes a part of this project, publishers should collect royalties on books printed.