[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UNICODE Doku-Datenbank

Date: Mon, 1 Dec 1997 11:35:07 METDST
From: "Bernhard Eversberg" <EV _at__ buch.biblio.etc.tu-bs.de>
Subject: UNICODE Doku-Datenbank

Im Zuge unserer Konkordanz-Arbeiten haben wir eine Datenbank von Zeichen-
codes auf der Basis des UNICODE-Standards erstellt. Diese Datenbank wird
frei verfuegbar gemacht.
Den nachfolgenden Beitrag senden wir auch an englischsprachige Listen,
daher ist der Text englisch.
Anleitung zr Beschaffung und Installation der Datenbank am Ende.


UNIC : Unicode Reference Database of Character Encodings               971126
--------------------------------------------------------      UB Braunschweig


In the course of developing multiplatform and multipurpose database software
at Braunschweig University Library, we frequently encounter character conver-
sion tasks. Since there is an unmistakable trend towards UNICODE, we made an
effort to set up a universal reference database of encodings. This should
serve us, from now on, to produce new conversion tables as needed much
quicker and more reliably than before, since the database was derived from
the latest versions of official tables.
We are making the whole thing freely available to everybody who can see a
use for it.

The database serves two functions:

1. Browsing and searching in new ways within UNICODE and other code lists

2. Instantaneous production of conversion lists between any two of
   40 different code lists, including SGML, USMARC, EBCDIC etc.

UNIC contains records for some 6.600 characters, that's all non-ideographic
characters currently covered by UNICODE. In a similar endeavor, a database
for the CJK scripts was set up at the Bodleian Library in Oxford by David
Helliwell. Since both databases operate under "allegro", our library data-
base system, we can make them both available on our FTP server (see below).
The CJK database, however, will take some more time since it needs updating.

There are 9 indexes:
1 : Names of characters (official UNICODE names, old and new versions)
    example: LATIN SMALL LETTER U WITH CARON
2 : Letters, Ligatures, Digits (names of letters only, old and new versions)
    example: U WITH CARON   (old: U HACEK)
3 : Keyword index  (all words occuring in the UNICODE names)
    example: every single word of LATIN SMALL LETTER U WITH CARON
4 : Directionality, Decomposition types  (technical details of UNICODE)
5 : Number values and digits (of those codes designating digits and numbers)
6 : Comments (just for convenience)
7 : General category  (a classification of symbols)
    example: 
8 : Related (equivalent) characters (of upper/lower equivalents)
9 : Unicode 2-Byte codes (hexadecimal)
    example: 01d4= LATIN SMALL LETTER U WITH CARON
    Lists of other code tables (see below)

These indexes can be browsed up and down like with any "allegro" database.

Example from index 2 : Letters, Ligatures, Digits 
    ....
    1  turned t
    1  turned v
    1  turned w
    1  turned y
   25  two
    1  two bar
    1  two full stop
    1  two period
    1  two with stroke
  105  u                  [d.h. es gibt 105 Formen des Buchstabens u !]
    2  u acute
    1  u bar
    2  u breve
    2  u circumflex
    2  u diaeresis
    2  u diaeresis acute
    2  u diaeresis grave
    2  u diaeresis hacek
    2  u diaeresis macron
    2  u double acute
    1  u final form
    2  u grave
    2  u hacek         -->  [Datensatz-Anzeige siehe unten]
    2  u horn
    1  u isolated form
    2  u macron
    2  u ogonek
    2  u ring
    2  u tilde
    2  u with acute
    2  u with breve
    2  u with caron  
    2  u with circumflex
    2  u with circumflex below
    4  u with diaeresis
    2  u with diaeresis and acute
    2  u with diaeresis and caron
    2  u with diaeresis and grave
    2  u with diaeresis and macron
    2  u with diaeresis below
    2  u with dot below
    4  u with double acute
    2  u with double grave
    2  u with grave
    1  u with hamza above
    1  u with hamza above isolated form
    2  u with hook above
    2  u with horn
    2  u with horn and acute
    2  u with horn and dot below
    2  u with horn and grave
    2  u with horn and hook above
    2  u with horn and tilde
    2  u with inverted breve
    4  u with macron
    2  u with macron and diaeresis
    2  u with ogonek
    2  u with ring above          ... see display of this below
    2  u with tilde
    2  u with tilde and acute
    2  u with tilde below
    ....


The 2nd entry under "latin small letter u with caron" is this record, shown
here in the full display:


---------------------------------------------------------------------
|          UNICODE    016F    1 111          SGML  uring (ISOlat2)  |
|     Basic Letter    U                                             |
|             NAME    LATIN SMALL LETTER U WITH RING ABOVE          |
|                                                                   |
|General Category     Letter, Lowercase                             |
|                                                                   |
|Char.Decomposition   0075= LATIN SMALL LETTER U                    |
|                     030a= COMBINING RING ABOVE                    |
|                                                                   |
|  Old Unicode Name   LATIN SMALL LETTER U RING                     |
|  UpperCaseEquival   016E                                          |
|                                                                   |
|PC 852  latin2       133                                           |
|Win 1250  E.Eur.     249                                           |
|ISO 8859-2 latin1    249                                           |
|allegro-OSTWEST      158                                           |
---------------------------------------------------------------------


How to use UNIC
---------------
Every "allegro" user will at once know how to use this database. We supply
a few directions for use aimed at those who don't have this knowledge:

1. Index browsing
The [Esc] key brings up the menu from which you choose the index you want to
see. Then, you just type the first few letters of the entry you are looking
for, then [Enter]. The cursor keys move the arrow ==> up and down the lists,
as you would expect. Press [Enter] to see the full records, then use 
[Cursor down] and [Cursor up] to move up and down beetween the full records
in the sequence of the index.'
Or: use [Shift+F8] in the index to get an expanded index display, from which
you then choose the exact record you want to see.


2. Result sets
If, for example, you want to look at all occurrences of the small letter u,
you do this:
-- build a result set, using this sequence of actions:
   Alt+3         (go directly to index 3, the word index)
   [Backspace]   (if you already had a result set, to start a new one)
   u [Enter]     (go to the entries under "u")
   +             (make the entries under "u" your result set: 122 entries)
   small [Enter] (go to the entries under "small"  985 entries)
   +             (Boolean AND combination with your result set: 45 entries)
                 
   In the lower right corner, you now see  "... results"

2. look at the results: from the index display, hit
   [Shift+F9]    (produces a brief result set listing)
   [Up/Down]     (position the arrow)
   [Enter]       (see the full record)
   [Up/Down]     (move between the full displays of the result records)
   [Esc]         (return to normal index display)


List of character tables contained in UNIC database
---------------------------------------------------
The lists can be found in numerical order in index 9 under the 
following prefixes:
(Go to index 9, enter "m 850" for example)

m   MS-DOS PC code tables
    437  us
    850  latin1
    852  latin2 
    863  canada
    865  nordic
    861  iceland
    860  portug
    855  cyrill
    866  crillRuss
    869  greek
    857  turk
    862  hebrew
    864  arabic


w   MS-Windows code tables
    1252  Latin
    1250  Latin2
    1251  Cyrill
    1257  Baltic
    1253  Greek
    1254  Turkish
    1255  Hebrew
    1256  Arabic
    1258  Vietnam

i   ISO9 8859 tables
    -1  Latin1: IDENTICAL with UNICODE x00-xFF !
    -2  Latin2
    -3  Latin3
    -4  Latin4
    -5  Cyrillic
    -6  Arabic
    -7  Greek
    -8  Hebrew
    -9  Latin5
    ddb MAB2 Codes of Deutsche Bibliothek Frankfurt (ISO 5426)

z   EBCDIC tables
     037  USCanada
     500  International
     875  Greek
    1026  Latin5Turkish
    ddb   MAB1  Deutsche Bibliothek's old code table


u   USMARC codes     
 
p   Pica codes

o   "allegro" so-called OSTWEST code list (extended PC 437)

s   SGML (including HTML) so-called "entities"

y   Macintosh Roman


Production of conversion lists
------------------------------

1. Press [4] on any record display (not index display)

2. Select the code you want to convert from

3. Select the code you want to convert into 

4. Enter 0 or 1 as prompted, for decimal or hexadecimal output

5. Press [F4] and confirm with [y]

The commented list comes out as a file named CLIST.


How to get it and install it
----------------------------

It is a DOS application to run on any PC or in any DOS box.
(With time permitting, we'll set up a Web page for it)

ftp 134.169.20.1
anonymous
<your e-mail address>
bin
cd formate
get unic.exe        [it contains unic.txt]
quit

Move UNIC.EXE to any subdirectory. Start it, and it will unwrap itself.

In any case, you have to have ANSI.SYS installed for proper displays.
Then start UNI.BAT.

B. Eversberg




Bernhard Eversberg
Universitaetsbibliothek, Postf. 3329, 
D-38023 Braunschweig, Germany
Tel.  +49 531 391-5026 , -5011 , FAX  -5836
e-mail  B.Eversberg _at__ tu-bs.de
Prev by Date: Abgabe von Katalogschraenken etc.
Next by Date: Re: Wissenschaftsrat / DBI
Previous by thread: Unicode-Katalog, automatische Transliteration
Next by thread: Unicode
Index(es):
- Date
- Thread
Listeninformationen unter http://www.inetbib.de.