Book scanning

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Book scanning is the process of converting physical books into digital images or electronic books (e-books) via image scanning. This is a much less time-intensive method than re-typing all of the text; before scanning became feasible, re-typing was generally the only option.

Once a book has been digitally scanned, the images are available for rapid distribution, reproduction, and on-screen reading. Such book images are commonly stored in a DjVu, Portable Document Format (PDF), or Tagged Image File Format (TIFF). One can reap additional benefits by using optical character recognition (OCR) to convert images of book pages into a machine-processable encoding of the book's text, dramatically reducing the storage needed for the book and allowing the text to be reformatted, searched, or used as input for text processing applications such as natural language processing.

Contents

[edit] Commercial book scanners

Sketch of a V-shaped book scanner
Sketch of a typical manual book scanner

Commercial book scanners are not like normal scanners; these book scanners are usually a high quality digital camera with light sources on either side of the camera mounted on some sort of frame to provide easy access for a person or machine to flip the pages of the book. Some models involve V-shaped book cradles, which provide support for book spines and also center book position automatically.

The advantage of this type of scanner is that it is very fast, compared to the productivity of overhead scanners. Compared with traditional overhead scanners whose prices normally start from US$10,000 upwards, this type of digital camera-based book scanner is much more cost-effective.

[edit] Book scanning by organizations on a large scale

Projects like Project Gutenberg, Google Book Search, and the Open Content Alliance scan books on a large scale.

One of the main challenges to this is the sheer volume of books that must be scanned, expected to be in the tens of millions. All of these must be scanned and then made searchable online for the public to use as a universal library. Currently, there are 3 main ways that large organizations are relying on: outsourcing, scanning in house using commercial book scanners, and scanning in house using robotic scanning solutions.

As for outsourcing, books are often shipped to be scanned by low-cost sources such as India or China. Alternatively, due to convenience, safety and technology improvement, many organizations choose to scan in-house by using either overhead scanners which are time-consuming, or digital camera-based scanning solutions which are substantially faster, and is a method employed by Internet Archive as well as Google. Traditional methods have included cutting off the book's spine and scanning the pages in a scanner with automatic page-feeding capability, with rebinding of the loose pages occurring afterwards.

Once the page is scanned, the data is either entered manually or via OCR, another major cost of the book scanning projects.

Due to copyright issues, most scanned books are those that are out of copyright; however, Google Book Search is known to scan books still protected under copyright unless the publisher specifically excludes them.

[edit] Destructive scanning

For book scanning on a low budget, the least expensive method to scan a book or magazine is to cut off the binding. This converts the book or magazine into a sheaf of looseleaf papers, which can then be loaded into a standard automatic document feeder and scanned using inexpensive and common scanning technology. While this is definitely not a desirable solution for very old and uncommon books, it is a useful tool for book and magazine scanning where the book is not an expensive collector's item and replacement of the scanned content is easy. There are two technical difficulties with this process, first with the cutting and second with the scanning.

[edit] Cutting

The proper method of cutting a stack of 500 to 1000 pages in one pass is accomplished with a guillotine paper cutter. This is a large steel table with a paper vise that screws down onto the stack and firmly secures it before cutting. The cut is accomplished with a large sharpened steel blade which moves straight down and cuts the entire length of each sheet all at once. A lever on the blade permits several hundred pounds of force to be applied to the blade for a quick one-pass cut.

A clean cut through a thick stack of paper cannot be made with a traditional inexpensive sickle-shaped hinged paper cutter. These cutters are only intended for a few sheets, with up to ten sheets being the practical cutting limit. A large stack of paper applies torsional forces on the hinge, pulling the blade away from the cutting edge on the table. The cut becomes more inaccurate as the cut moves away from the hinge, and the force required to hold the blade against the cutting edge increases as the cut moves away from the hinge.

The guillotine cutting process dulls the blade over time, requiring that it be resharpened. Coated paper such as slick magazine paper dulls the blade more quickly than plain book paper, due to the kaolinite clay coating. Additionally, removing the binding of an entire hardcover book causes excessive wear due to cutting through the cover's stiff backing material. Instead the outer cover can be removed and only interior pages need be cut.

[edit] Scanning

Once the paper is liberated from the spine, it can be scanned one sheet at a time using a traditional flatbed scanner or automatic document feeder (ADF).

Pages with a decorative riffled edging or curving in an arc due to a non-flat binding can be difficult to scan using an ADF. An ADF is designed to scan pages of uniform shape and size, and variably sized or shaped pages can lead to improper scanning. The riffled edges or curved edge can be guillotined off to render the outer edges flat and smooth before the binding is cut.

The coated paper of magazines and bound textbooks can make them difficult for the rollers in an ADF to pick up and guide along the paper path. An ADF which uses a series of rollers and channels to flip sheets over may jam or misfeed when fed coated paper. Generally there are fewer problems by using as straight of a paper path as is possible, with few bends and curves. The clay can also rub off the paper over time and coat sticky pickup rollers, causing them to loosely grip the paper. The ADF rollers may need periodic cleaning to prevent this slipping.

Magazines can pose a bulk-scanning challenge due to small nonuniform sheets of paper in the stack, such as magazine subscription cards and fold out pages. These need to be removed before the bulk scan begins, and are either scanned separately if they include worthwhile content, or are simply left out of the scan process.

[edit] Non-destructive scanning

In recent years, software driven machines and robots have been developed to scan books without the need of disbinding them in order to preserve both the contents of the document and a digital photo archive of its current state. This recent trend has been due in part to ever improving imaging technologies that allow a high quality digital archive image to be captured with little or no damage to a rare or fragile book in a reasonably short period of time. Some high-end scanning systems employ vacuum and air and static charges to turn pages while imaging is performed automatically, usually from a high resolution camera located over an adjustable v-shaped cradle. Images are then shuttled from the imaging device into various editing suites which can further process the images for either an archival-quality file such as TIFF or JPEG 2000, or a web-friendly output such as JPEG or PDF.

[edit] References

[edit] See also

[edit] Book archives

[edit] External links

Personal tools