• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!



Page history last edited by Gabriel Bodard 2 years, 4 months ago

Word and other token tagging/"Tokenization"


All original (Greek and Latin) characters in the edition should be tagged with one of the following elements (modern punctuation may be left untagged; ancient punctuation should use <g>:


  • w - a lexical word, not known to be a proper name etc.
    • if an incomplete word, use the attribute @part
      • part="I" - the initial part of a word (i.e. the end is missing or unresolvable)
      • part="M" - the middle part of a word (i.e. the beginning and end are both missing or unresolvable)
      • part="F" - the final part of a word (i.e. the beginning is missing or unresolvable) 
      • (rarely) part="Y" -  an obviously incomplete word, but not sure whether it is initial/final etc.
    • You do not need to lemmatise this word, as this will be done automatically at a later stage. If you know if will be problematic (rare, fragmentary, irregular spelling, etc.) you may lemmatise manually if you wish, using <w lemma="acerbus">akerbus</w>.
  • name - a personal name (including cognomina; but not "Imperator").
    • - an imperial cognomen such as "Sarmaticus" should be tagged as a name.
    • name cannot take @part, so in the case of an incomplete name, a seg element needs to appear inside the name, with @part (as for words, above)
  • placeName - a name of a place
    • if a proper adjective - type="ethnic"
    • for a colony name, e.g. colonia Septimia Lepcis Magna - tag "colonia" as a word, "Septimia" as a name and "Lepcis Magna" as a placeName:
      i.e. <placeName ref="mentionedplace.xml#p123"><w>colonia</w> <name>Septimia</name> <placeName>Lepcis Magna</placeName></placeName>
  • num - a numeral
  • g - a non-alphabetic symbol such as "denarius," "leaf" or "year" (either for which no Unicode code-point exists, or which is not easy to type, or is not traditionally printed as a character in Leiden)
  • abbr - an abbreviation for which we do not know the expansion; e.g. "υ(...)"
  • orig - none of the above, text that we can not resolve in any way (only if the editor has/would put this word in uppercase in Leiden)


In addition, any reference to a person (which may be made up of names and/or words/placenames, etc.) should be tagged as persName. Each persName must take one of the following types:


  • attested - any person attested other than emperors, consuls, gods etc.
  • ruler - a member of the imperial or ruling families (in former projects "emperor")
  • divine - a god, hero, angel, personification or other divine entity
  • other - mostly historical or literary figures (rarely used)
  • consular - only if a consul/archon/priest cited for dating (even more rarely used)



<persName type="attested">
     <name type="praenomen"><expan>M<ex>arcus</ex></expan></name>
     <name type="gentilicium">Iulius</name>
     <name type="cognomen">Aurelianus</name>


Comments (0)

You don't have permission to comment on this page.