====== Diarium Latinitatis ====== Task: find out computationally which words in a Latin text are rare or strange. ===== Phases ===== - [[z:wl-doc|prepare]] a wordlist (all words in the text) - use an existing lemmatizing service to process the wordlist - compare list of lemmatized words with the primary wordlist - check the list of differences (example: a [[z:commentarii-hapaksi|list of strange words from Ludovik Crijević Tuberon, Commentarii de temporibus suis]]) - check the strange words in other (nonclassical) dictionaries ===== Tools ===== Preprocess by excluding frequent words; there are some lists. Claude Pavur's is [[http://www.slu.edu/colleges/AS/languages/classical/latin/tchmat/grammar/vocabulary/hif-ed2.html|here]] (18653 wordforms). James H. Dee's database is [[http://www.uic.edu/las/clas/LF_database.html|here]]. Anne Mahoney's 200 essential Latin words are [[http://www.bu.edu/mahoa/vocab200.html|here]] (currently). A list of words where //-que// ending is not a conjunction is [[http://snowball.tartarus.org/otherapps/schinke/intro.html|here]] (among other useful things). Lemmatizing services: [[http://archimedes.mpiwg-berlin.mpg.de/arch/doc/dict-server.html|Archimedes]] (XML-RPC), [[http://www.ilc.cnr.it/lemlat/lemlat/index.html|LemLat]], [[http://api.hucompute.org/preprocessor/|PrePro2010]] (XML-API). All results require postprocessing, cleaning etc. Compare list2 (lemmatized words) with list1 (primary wordlist): [[http://unstableme.blogspot.com/2009/08/linux-comm-command-brief-tutorial.html|comm]] ===== Bash script wrapper ===== Our bash script which serves as a wrapper and pre-processor for Archimedes Project XML-RPC call looks like this:
#!/bin/bash
# Jovanovic, 2011-10, lematiziranje rijeci
# usage: ./comm-lemm.sh antconc-result-filename
# requires: vlist.sed, latstop.txt, rpc3.py, 11lemclean.sed, 11lemclean2.sed
# step 0: clean up an AntConc wordlist
# step 1: remove the frequent Latin words. Ensure the unix format, remove whitespaces.
tr -d '\011' < "$1" | sed -f vlist.sed - | sort - | comm -23 - latstop.txt > c"$1"
# vlist.sed holds cleaning commands
# latstop.txt holds frequent latin words

# let the result of step 1 become a 'file' variable
FILE=c"$1"
# step 2: send rarer words to lemmatizer
# clean up the results
# sort and save the lemmata
python rpc3.py "${FILE}" | sed -f 11lemclean.sed - | sort - | sed -f 11lemclean2.sed - > lem2"${FILE}"
# 11lemclean.sed holds cleaning commands for lemmatizer
# 11lemclean2.sed holds another set of cleaning commands

# keep only the forms which were lemmatized
# then comm, but watch for spaces and tabs!
comm -23 "${FILE}" lem2"${FILE}" > r-"${FILE}"