|
WordTagging
Page history
last edited
by Gabriel Bodard 2 years, 1 month ago
Word and other token tagging/"Tokenization"
All original (Greek and Latin) characters in the edition should be tagged with one of the following elements (modern punctuation may be left untagged; ancient punctuation should use <g>:
- w - a lexical word, not known to be a proper name etc.
- if an incomplete word, use the attribute @part
- part="I" - the initial part of a word (i.e. the end is missing or unresolvable)
- part="M" - the middle part of a word (i.e. the beginning and end are both missing or unresolvable)
- part="F" - the final part of a word (i.e. the beginning is missing or unresolvable)
- (rarely) part="Y" - an obviously incomplete word, but not sure whether it is initial/final etc.
- You do not need to lemmatise this word, as this will be done automatically at a later stage. If you know if will be problematic (rare, fragmentary, irregular spelling, etc.) you may lemmatise manually if you wish, using <w lemma="acerbus">akerbus</w>.
- name - a personal name (including cognomina; but not "Imperator").
- - an imperial cognomen such as "Sarmaticus" should be tagged as a name.
- name cannot take @part, so in the case of an incomplete name, a seg element needs to appear inside the name, with @part (as for words, above)
- placeName - a name of a place
- if a proper adjective - type="ethnic"
- for a colony name, e.g. colonia Septimia Lepcis Magna - tag "colonia" as a word, "Septimia" as a name and "Lepcis Magna" as a placeName:
i.e. <placeName ref="mentionedplace.xml#p123"><w>colonia</w> <name>Septimia</name> <placeName>Lepcis Magna</placeName></placeName>
- num - a numeral
- g - a non-alphabetic symbol such as "denarius," "leaf" or "year" (either for which no Unicode code-point exists, or which is not easy to type, or is not traditionally printed as a character in Leiden)
- abbr - an abbreviation for which we do not know the expansion; e.g. "υ(...)"
- orig - none of the above, text that we can not resolve in any way (only if the editor has/would put this word in uppercase in Leiden)
In addition, any reference to a person (which may be made up of names and/or words/placenames, etc.) should be tagged as persName. Each persName must take one of the following types:
- attested - any person attested other than emperors, consuls, gods etc.
- ruler - a member of the imperial or ruling families (in former projects "emperor")
- divine - a god, hero, angel, personification or other divine entity
- other - mostly historical or literary figures (rarely used)
- consular - only if a consul/archon/priest cited for dating (even more rarely used)
example:
<persName type="attested">
<name type="praenomen"><expan>M<ex>arcus</ex></expan></name>
<name type="gentilicium">Iulius</name>
<name type="cognomen">Aurelianus</name>
<persName>
|
WordTagging
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
Comments (0)
You don't have permission to comment on this page.