Home METS / ALTO ALTO Introduction

Contact

Joachim Bauer

joachim_bauer2_1_20100509_1203364873

Senior System Engineer
METS and  ALTO
Editorial Board Member

E-mail

 


ALTO Introduction

Introduction

METS offers great opportunities to reflect complex structure more than any other standard.
Therefore the METAe project group chose METS for their challenging task to digitize historic books and journals (1850-1920).
While METS is great in describing the structure of objects, a schema related to the content and layout information of each piece of the object was missing. Thus, the METAe project group introduced the ALTO schema, that was not only able to hold all the text information of a page, but also to hold all the word and paragraph, text block or illustration coordinates within a page. This allows to fully describe and reconstruct the layout and segementation of the original page digitized.
ALTO became a great extension schema for METS during the METAe project, at least for printed materials.

History

METS offers great opportunities to reflect complex structure more than any other standard. Thus, the METAe project group chose METS for their challenging task to digitize historic books and journals (1850-1920).
While METS is great in describing the structure of objects, a schema related to the content and layout information of each piece of the object was missing. Thus, the METAe project group introduced the ALTO schema, that was not only able to hold all the text information of a page, but also to hold all the word and paragraph, text block or illustration coordinates within a page. ALTO became a great extension schema for METS during the METAe project, at least for printed materials.

METS/ALTO XML Objects in Real Life

CCS developed its software docWORKS/METAe as a content conversion software. Scanned images are processed (Pre-processing, Layout Analysis, OCR, Structure Analysis) and exported as standard XML objects, based on METS/ALTO XML schemas. From the rich METS/ALTO XML object, you can build derivatives (PDF, METS/TEI, METS/TXT) using XSL style sheets easily.

Several national and general libraries as well as other cultural and educational institutions already use docWORKS to digitize and preserve their books, newspapers and journals, f.e.:

Harvard University Library
Library of Congress
Stanford University Library
University of Texas at Austin
Royal Danish Library
National Library of Finland
National Library of Norway
National Library of the Netherlands


ALTO in NDNP

For the NDNP (National Digital Newspaper Project) the Library of Congress was looking for a METS extension schema describing the layout and content on printed pages. ALTO was a perfect fit, as it is proven in digitization of books and journals for previous years. Due to NDNP related requests the ALTO schema was extended to cover all needs.
ALTO 1.1 has been released and published by Library of Congress for some adaptions to the technical requirements of the NDNP project.

ALTO Description

ALTO stores layout information and OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. ALTO is a standardized XML format to store layout and content information. It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), where METS provides metadata and structural information while ALTO contains content and physical information.

Each ALTO file contains a style section where different styles (for paragraphs and fonts) are listed. The layout section contains what’s on the page. A page is divided into several regions (Print space, left margin, right margin, top margin and bottom margin). For each region all objects are listed which have been detected inside.

Measurements in ALTO XML files are given in 1/10mm or in 1/1200inch. For presentation purposes one might want to create low resolution images. To use the coordinates within the ALTO file with any resolution they need to be transformed into pixels.

Transforming the inch1200 values to pixel depends on the image resolution. Convert the values into pixel as follows:
pixel = value * resolution / 1200

For 1/10mm convert the values into pixel as follows:
pixel = value * resolution / 254

 

Call us

thumb_phone

+49 (0)40 227 130 0

Upcoming Events

There are no events at this time

Bookmark

© 2010 • CCS Content Conversion Specialists GmbH • Weidestrasse 134 • 22083 Hamburg • T +49 40 227 130 0 • F +49 40 227 130 11 • info@content-conversion.com
Imprint
    Legal Disclaimer     Privacy Policy     Contact