[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UNICODE Doku-Datenbank
Im Zuge unserer Konkordanz-Arbeiten haben wir eine Datenbank von Zeichen-
codes auf der Basis des UNICODE-Standards erstellt. Diese Datenbank wird
frei verfuegbar gemacht.
Den nachfolgenden Beitrag senden wir auch an englischsprachige Listen,
daher ist der Text englisch.
Anleitung zr Beschaffung und Installation der Datenbank am Ende.
UNIC : Unicode Reference Database of Character Encodings 971126
-------------------------------------------------------- UB Braunschweig
In the course of developing multiplatform and multipurpose database software
at Braunschweig University Library, we frequently encounter character conver-
sion tasks. Since there is an unmistakable trend towards UNICODE, we made an
effort to set up a universal reference database of encodings. This should
serve us, from now on, to produce new conversion tables as needed much
quicker and more reliably than before, since the database was derived from
the latest versions of official tables.
We are making the whole thing freely available to everybody who can see a
use for it.
The database serves two functions:
1. Browsing and searching in new ways within UNICODE and other code lists
2. Instantaneous production of conversion lists between any two of
40 different code lists, including SGML, USMARC, EBCDIC etc.
UNIC contains records for some 6.600 characters, that's all non-ideographic
characters currently covered by UNICODE. In a similar endeavor, a database
for the CJK scripts was set up at the Bodleian Library in Oxford by David
Helliwell. Since both databases operate under "allegro", our library data-
base system, we can make them both available on our FTP server (see below).
The CJK database, however, will take some more time since it needs updating.
There are 9 indexes:
1 : Names of characters (official UNICODE names, old and new versions)
example: LATIN SMALL LETTER U WITH CARON
2 : Letters, Ligatures, Digits (names of letters only, old and new versions)
example: U WITH CARON (old: U HACEK)
3 : Keyword index (all words occuring in the UNICODE names)
example: every single word of LATIN SMALL LETTER U WITH CARON
4 : Directionality, Decomposition types (technical details of UNICODE)
5 : Number values and digits (of those codes designating digits and numbers)
6 : Comments (just for convenience)
7 : General category (a classification of symbols)
example:
8 : Related (equivalent) characters (of upper/lower equivalents)
9 : Unicode 2-Byte codes (hexadecimal)
example: 01d4= LATIN SMALL LETTER U WITH CARON
Lists of other code tables (see below)
These indexes can be browsed up and down like with any "allegro" database.
Example from index 2 : Letters, Ligatures, Digits
....
1 turned t
1 turned v
1 turned w
1 turned y
25 two
1 two bar
1 two full stop
1 two period
1 two with stroke
105 u [d.h. es gibt 105 Formen des Buchstabens u !]
2 u acute
1 u bar
2 u breve
2 u circumflex
2 u diaeresis
2 u diaeresis acute
2 u diaeresis grave
2 u diaeresis hacek
2 u diaeresis macron
2 u double acute
1 u final form
2 u grave
2 u hacek --> [Datensatz-Anzeige siehe unten]
2 u horn
1 u isolated form
2 u macron
2 u ogonek
2 u ring
2 u tilde
2 u with acute
2 u with breve
2 u with caron
2 u with circumflex
2 u with circumflex below
4 u with diaeresis
2 u with diaeresis and acute
2 u with diaeresis and caron
2 u with diaeresis and grave
2 u with diaeresis and macron
2 u with diaeresis below
2 u with dot below
4 u with double acute
2 u with double grave
2 u with grave
1 u with hamza above
1 u with hamza above isolated form
2 u with hook above
2 u with horn
2 u with horn and acute
2 u with horn and dot below
2 u with horn and grave
2 u with horn and hook above
2 u with horn and tilde
2 u with inverted breve
4 u with macron
2 u with macron and diaeresis
2 u with ogonek
2 u with ring above ... see display of this below
2 u with tilde
2 u with tilde and acute
2 u with tilde below
....
The 2nd entry under "latin small letter u with caron" is this record, shown
here in the full display:
---------------------------------------------------------------------
| UNICODE 016F 1 111 SGML uring (ISOlat2) |
| Basic Letter U |
| NAME LATIN SMALL LETTER U WITH RING ABOVE |
| |
|General Category Letter, Lowercase |
| |
|Char.Decomposition 0075= LATIN SMALL LETTER U |
| 030a= COMBINING RING ABOVE |
| |
| Old Unicode Name LATIN SMALL LETTER U RING |
| UpperCaseEquival 016E |
| |
|PC 852 latin2 133 |
|Win 1250 E.Eur. 249 |
|ISO 8859-2 latin1 249 |
|allegro-OSTWEST 158 |
---------------------------------------------------------------------
How to use UNIC
---------------
Every "allegro" user will at once know how to use this database. We supply
a few directions for use aimed at those who don't have this knowledge:
1. Index browsing
The [Esc] key brings up the menu from which you choose the index you want to
see. Then, you just type the first few letters of the entry you are looking
for, then [Enter]. The cursor keys move the arrow ==> up and down the lists,
as you would expect. Press [Enter] to see the full records, then use
[Cursor down] and [Cursor up] to move up and down beetween the full records
in the sequence of the index.'
Or: use [Shift+F8] in the index to get an expanded index display, from which
you then choose the exact record you want to see.
2. Result sets
If, for example, you want to look at all occurrences of the small letter u,
you do this:
-- build a result set, using this sequence of actions:
Alt+3 (go directly to index 3, the word index)
[Backspace] (if you already had a result set, to start a new one)
u [Enter] (go to the entries under "u")
+ (make the entries under "u" your result set: 122 entries)
small [Enter] (go to the entries under "small" 985 entries)
+ (Boolean AND combination with your result set: 45 entries)
In the lower right corner, you now see "... results"
2. look at the results: from the index display, hit
[Shift+F9] (produces a brief result set listing)
[Up/Down] (position the arrow)
[Enter] (see the full record)
[Up/Down] (move between the full displays of the result records)
[Esc] (return to normal index display)
List of character tables contained in UNIC database
---------------------------------------------------
The lists can be found in numerical order in index 9 under the
following prefixes:
(Go to index 9, enter "m 850" for example)
m MS-DOS PC code tables
437 us
850 latin1
852 latin2
863 canada
865 nordic
861 iceland
860 portug
855 cyrill
866 crillRuss
869 greek
857 turk
862 hebrew
864 arabic
w MS-Windows code tables
1252 Latin
1250 Latin2
1251 Cyrill
1257 Baltic
1253 Greek
1254 Turkish
1255 Hebrew
1256 Arabic
1258 Vietnam
i ISO9 8859 tables
-1 Latin1: IDENTICAL with UNICODE x00-xFF !
-2 Latin2
-3 Latin3
-4 Latin4
-5 Cyrillic
-6 Arabic
-7 Greek
-8 Hebrew
-9 Latin5
ddb MAB2 Codes of Deutsche Bibliothek Frankfurt (ISO 5426)
z EBCDIC tables
037 USCanada
500 International
875 Greek
1026 Latin5Turkish
ddb MAB1 Deutsche Bibliothek's old code table
u USMARC codes
p Pica codes
o "allegro" so-called OSTWEST code list (extended PC 437)
s SGML (including HTML) so-called "entities"
y Macintosh Roman
Production of conversion lists
------------------------------
1. Press [4] on any record display (not index display)
2. Select the code you want to convert from
3. Select the code you want to convert into
4. Enter 0 or 1 as prompted, for decimal or hexadecimal output
5. Press [F4] and confirm with [y]
The commented list comes out as a file named CLIST.
How to get it and install it
----------------------------
It is a DOS application to run on any PC or in any DOS box.
(With time permitting, we'll set up a Web page for it)
ftp 134.169.20.1
anonymous
<your e-mail address>
bin
cd formate
get unic.exe [it contains unic.txt]
quit
Move UNIC.EXE to any subdirectory. Start it, and it will unwrap itself.
In any case, you have to have ANSI.SYS installed for proper displays.
Then start UNI.BAT.
B. Eversberg
Bernhard Eversberg
Universitaetsbibliothek, Postf. 3329,
D-38023 Braunschweig, Germany
Tel. +49 531 391-5026 , -5011 , FAX -5836
e-mail B.Eversberg _at__ tu-bs.de
Listeninformationen unter http://www.inetbib.de.