====== Workflow for Croatian Latin -- words analysis ======
Goal: sort words from a Latin text into those lemmatized, ambiguously lemmatized, and unrecognized. Store results in local database. Note the original form, lemma, stem (for large-scale queries), text provenance, lemmatization category (LEMMA, AMBIGUOUS, NOT RECOGNISED), reason for not recognizing form.
===== 0. Prepare a text =====
TEI XML encoded
===== 1. Prepare a list of words =====
Tasks / strategies:
* separate words starting with uppercase letters from others (uppercase signals potential names: grep '^[[:lower:]]' filename > filename-lower
etc)
* normalize orthography for service querying -- remove accents: iconv -f utf8 -t ascii//TRANSLIT filename-upper | tr '[:upper:]' '[:lower:]' | perl ambig-local.pl
* keep wordform frequency data (CSV field)
* keep the original forms (CSV field)
===== 2. Filter lowercase words locally =====
We have built a local database containing words already identified -- those don't have to be sent to a lemmatizing service (saves time)
* Filter 1: lemmatized words + ambiguously lemmatized words (Perl script [[ambig-local.pl]])
* Filter 3: unrecognized words already in the database
Check: write results of local checks in the CSV!
Have reports on total number of words identified locally
===== 3. Send lowercase words to lemmatizer service =====
Use Bamboo Morpheus service. Expected return format: JSON
* send a list of words (lowercase words from the text) to lemmatizing service (Perl script [[perlmorphb.pl]])
* write results to file
* convert the file to valid JSON (wrap it all in another layer): [[json2json.sh]]
===== 4. Analyse JSON results =====
* extract from JSON records form, lemma and stem, write it to CSV file (to be added to local database)
* three separate CSVs: for identified, ambiguous, unrecognized forms
* get reports on categories
* examine the results, note any errors
First three points achieved by [[jsonu3.sh]] bash script.
===== 5. Analyse unidentified words =====
Expected categories: non-Latin words (Greek, modern languages), numbers, abbreviations, orthographic variations, errors, names, common Latin words not in the service database, uncommon Latin words; and any combination of the above
* record category in CSV field (manually)
===== 6. Update local database =====
* load lemmatized and ambiguous words from CSV file into MySQL table (using a bash script: [[csv2croala_db.sh]])
* for words which were not recognized, record local provenance: from which text?
===== 7. Filter uppercased words =====
When the local database is updated, query it for words from our text beginning with uppercase letter. Queries should be normalised (lowercased).
* Filter out any forms already in the database
===== 8. Repeat steps 3--6 with uppercase wordforms =====