File : unicode.ads
-----------------------------------------------------------------------
-- XML/Ada - An XML suite for Ada95 --
-- --
-- Copyright (C) 2001 --
-- ACT-Europe --
-- Author: Emmanuel Briot --
-- --
-- This library is free software; you can redistribute it and/or --
-- modify it under the terms of the GNU General Public --
-- License as published by the Free Software Foundation; either --
-- version 2 of the License, or (at your option) any later version. --
-- --
-- This library is distributed in the hope that it will be useful, --
-- but WITHOUT ANY WARRANTY; without even the implied warranty of --
-- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU --
-- General Public License for more details. --
-- --
-- You should have received a copy of the GNU General Public --
-- License along with this library; if not, write to the --
-- Free Software Foundation, Inc., 59 Temple Place - Suite 330, --
-- Boston, MA 02111-1307, USA. --
-- --
-- As a special exception, if other files instantiate generics from --
-- this unit, or you link this unit with other files to produce an --
-- executable, this unit does not by itself cause the resulting --
-- executable to be covered by the GNU General Public License. This --
-- exception does not however invalidate any other reasons why the --
-- executable file might be covered by the GNU Public License. --
-----------------------------------------------------------------------
-- This package provides support for wide-characters in Unicode/Iso 10646
-- encoding.
-- A series of child packages are given to convert from any encoding to
-- Unicode.
-- It also supports several transformation format (ie serialization of
-- these characters to files), like UTF8, UTF16,...
-- Vocabulary used in this package: This is only a small extract of
-- documents found at http://www.unicode.org/unicode/reports/tr17
--
-- Repertoire
-- ==========
-- Set of abstract characters to be encoded, normally a familiar alphabet or
-- symbol set.
-- Unicode is one such repertoire, although an open one. New entries are
-- added to it, but none is ever deleted from it.
-- Internally, this package converts all characters to entries in the Unicode
-- repertoire
--
-- Glyphs
-- ======
-- A particular image which represents a character or part of a character. For
-- instance, a given character might have a slightly different aspect in
-- different fonts.
-- Note that a single glyph can correspond to a sequence of characters, or a
-- single character to a sequence of glyphs.
-- This package doesn't deal at all with glyphs, this is left to the end-user
-- application
--
-- Subsets
-- =======
-- Unicode is intended to be a universal repertoire, with all possible
-- characters. Most applications will only support a subset of it, given the
-- complexity of some scripts.
-- The Unicode standad includes a set of internal catalogs, called
-- collections. Several child packages exist to support these collections.
--
-- Coded character sets (packages Unicode.CCS.*)
-- ====================
-- Mapping from a set of abstract characters to the set of non-negative
-- integers
-- The integer associated with a character is called "code point", and the
-- character is called "encoded character"
-- Examples of these are: ISO/8859-1, JIS X 0208, ...
--
-- Character naming (packages Unicode.Names.*)
-- ================
-- A unique name is assigned to each abstract character, so that it is
-- possible to get the same character no matter what repertoire is used.
--
-- Character Encoding Forms
-- ========================
-- Mapping from the set of integers used in a Coded Character Set to the set
-- of sequences of code units.
-- A "code unit" is integer occupying a specified binary width in a computer
-- architecture
-- Examples of fixed-width encoding forms: 7-bit, 8-bit, EBCDIC
-- Examples of variable-width encoding forms: Utf-8, Utf-16,...
--
-- Character Encoding Scheme (packages Unicode.CES.*)
-- =========================
-- Mapping of code units into serialized byte sequences. It also takes into
-- account the byte-order serialization.
-- As a summary, converting a file containing latin-1 characters coded on
-- 8 bits to a Utf8 latin2 file, the following steps are involved:
--
-- Latin1 string (contains bytes associated with code points in Latin1)
-- | "use Unicode.CES.Basic_8bit.To_Utf32"
-- v
-- Utf32 latin1 string (contains code points in Latin1)
-- | "Convert argument to To_Utf32 should be
-- v Unicode.CCS.Iso_8859_1.Convert"
-- Utf32 Unicode string (contains code points in Unicode)
-- | "use Unicode.CES.Utf8.From_Utf32"
-- v
-- Utf8 Unicode string (contains code points in Unicode)
-- | "Convert argument to From_Utf32 should be
-- v Unicode.CCS.Iso_8859_2.Convert"
-- Utf8 Latin2 string (contains code points in Latin2)
-- In the package below, all the functions Is_* are based on values defined
-- in the XML standard.
-- Several child packages are provided, that support different encoding
-- forms, and can all convert from and to Utf32, which thus behaves as the
-- reference.
package Unicode is
type Unicode_Char is mod 2**32;
-- A code point associated with a given character, taken in the Unicode
-- repertoire.
-- Note that by design, the first 127 entries are taken in the ASCII set
-- and are fully compatible. You can thus easily compare this with
-- constant characters by using Character'Pos ('.')
function Is_White_Space (Char : Unicode_Char) return Boolean;
-- Return True if Char is a space character, ie a space, horizontal tab,
-- line feed or carriage return.
function Is_Letter (Char : Unicode_Char) return Boolean;
-- True if Char is a letter.
function Is_Base_Char (Char : Unicode_Char) return Boolean;
-- True if Char is a base character.
function Is_Digit (Char : Unicode_Char) return Boolean;
-- True if Char is a digit.
function Is_Combining_Char (Char : Unicode_Char) return Boolean;
-- True if Char is a combining character (ie a character that
-- applies to the preceding character to change its meaning, like
-- accents in latin-1).
function Is_Extender (Char : Unicode_Char) return Boolean;
-- True if Char is an extender character.
function Is_Ideographic (Char : Unicode_Char) return Boolean;
-- True if Char is an ideographic character (asian languages).
function To_Unicode (C : Character) return Unicode_Char;
-- Convert from Ada Character encoding (extended ASCII) to Unicode
-- character.
private
pragma Inline (Is_Ideographic);
pragma Inline (Is_Letter);
pragma Inline (Is_White_Space);
pragma Inline (To_Unicode);
end Unicode;