Home METS / ALTO ALTO Technical

Contact

Joachim Bauer

joachim_bauer2_1_20100509_1203364873

Senior System Engineer
METS and  ALTO
Editorial Board Member

E-mail

 


ALTO Technical Information

References from METS to ALTO Files 

The reference starts from an entry of the type

within the structmap

The FILEID refers to the following structure within the file group

the BEGIN attribute then points into the alto file itself.

Structure of ALTO Files

The ALTO file consists of three major sections:

  • Description
  • Styles
  • Layout

The Description section contains metadata about the alto file itself and processing information on how the file was created.
The Styles section contains the text and paragraph styles with their individual descriptions:

  • TextStyle has font descriptions
  • ParagraphStyle has paragraph descriptions, e.g. alignment information

The Layout section contains the content information. It is subdivided into Pages.
A page consists of margins and printspace, all of those are non-intersection rectangular areas within the page area. Each of these can contain any number of objects like lines, images or textblocks and more. A textblock is divided into textlines and those are divided furthermore in strings and spaces.
The global structure of the ALTO file is as follows:

alto

Description

 

MeasurementUnit
sourceImageInformation
Processing

Styles

TextStyle
TextStyle

ParagraphStyle
ParagraphStyle

Layout

Page

TopMargin
LeftMargin
RightMargin
BottomMargin
PrintSpace


TextStyles

Textstyles have no content. The attributes are

  • FONTFAMILY
  • FONTSIZE
  • FONTCOLOR
  • FONTWEIGHT
  • FONTSTYLE
  • FONTPITCH
  • FONTCHARSET
  • UNDERLINED

Only FONTFAMILY and FONTSIZE are required.


ParagraphStyles

Paragraph styles have no content. The attributes are:

Name

with one of the values

ALIGN

Left

 

Right

 

Center

 

Block

LEFT

Numeric

RIGHT

Numeric

LINESPACE

Numeric

FIRSTLINE

Numeric


Attributes of a Page Element

  • PAGECLASS
  • STYLEREFS
  • HEIGHT
  • WIDTH
  • PHYSICAL_IMG_NR
  • PRINTED_IMG_NR
  • QUALITY (OK, Damaged, Missing)
  • POSITION (Left, Right, Foldout, Single)
  • PROCESSING (A link to processing information)


Page Areas

Each page is divided into different areas (TopMargin, LeftMargin, RightMargin, BottomMargin and PrintSpace). The margins may contain text or other objects that are not part of the main body.

The positions are given as HPOS, VPOS, WIDTH and HEIGHT.

TopMargin

The area between the top line of print and the upper edge of the leaf. It may contain page number, running title or a complete page header.

LeftMargin

The left margin of a page. May contain margin notes.

RightMargin

The right margin of a page. May contain margin notes.

BottomMargin

The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word.

PrintSpace

Rectangle surrounding the printed area of a page. Page number and running title are not part of the print space.

 

The position of the margins on a page is illustrated in this picture.


The Structure of Each of the Page Area (PageSpace) Elements 

The page area elements have the attributes:

HPOS

Horizontal position upper/left corner (1/10 mm)

VPOS

Vertical position upper/left corner (1/10 mm)

WIDTH

Width (1/10 mm)

HEIGHT

Height (1/10 mm)

ROTATION

In deg. as floating point number (optional)

 

Each page area may contain any number of elements. Those elements are one of the following:

TextBlock

A block of text

ComposedBlock

A block that consists of other blocks

Illustration

A picture or image

GraphicalElement

A graphic used to seperate blocks. Mostly a line or a rectangle

 

Each of them may have the following attributes:

ID

Unique ID

STYLEREFS

Reference for text or paragraph styles

HPOS

Horizontal position upper/left corner (1/10 mm)

VPOS

Vertical position upper/left corner (1/10 mm)

WIDTH

Width (1/10 mm)

HEIGHT

Height (1/10 mm)

ROTATION

In deg as floating point number (optional)

IDNEXT

Reference to the next element relating to the reading order

 

If the shape of the element is not rectangular an element SHAPE might be added:

Polygons are coded as X,y x,y … with different coordinate pairs separated by spaces.

Circles and ellipses are, although allowed in principle, not supported by docWORKS. Instead, such shapes are represented as polygons with sufficient accuracy.

A TextBlock is divided into lines and those are divided into strings, spaces and hyphens:

TextBlock

TextLine

 

String
SP
String
SP
...

TextLine

...

Meaning of those tags

Tag

Description

TextLine

Line of text

String

A single word

SP

White space

HYP

Hyphenation


Additional Attributes of the Tags

TextBlock

language

 

String

CONTENT

String content (word)

 

SUBS_TYPE

HypPart1

If content is the first part of a hyphenated word, applies only for the last word of a line if it is hyphenated

 

 

HypPart2

If content is the second part of a hyphenated word, applies only for the first word of a line if it is hyphenated

 

SUBS_CONTENT

Complete content of a hyphenated word

 

WC

Word Confidence: Confidence level of the OCR results for this string. A float value between 0 (unsure) and 1 (confident)

 

CC

Confidence level of each character in that string. A list of numbers, one number between 0 (confident) and 9 (unsure) for each character

STYLEREFS

Text style used for this string, if it is different from the parent text block style

STYLE

Any combination of font style (italics, bold, …)

 

ALTERNATIVE

(element) Any number of alternative strings to be used instead

Illustration

TYPE

A user defined description of the type of the illustration

 

FILEID

A link to a seperate file that contains just the illustration.

ComposedBlock

TYPE

A user defined description of the type of the composed block

 

FILEID

A link to a separate file that contains just the composed block

 

Call us

thumb_phone

+49 (0)40 227 130 0

Upcoming Events

There are no events at this time

Bookmark

© 2010 • CCS Content Conversion Specialists GmbH • Weidestrasse 134 • 22083 Hamburg • T +49 40 227 130 0 • F +49 40 227 130 11 • info@content-conversion.com
Imprint
    Legal Disclaimer     Privacy Policy     Contact