====== Workflow for Croatian Latin -- words analysis ======
Goal: sort words from a Latin text into those lemmatized, ambiguously lemmatized, and unrecognized. Store results in local database. Note the original form, lemma, stem (for large-scale queries), text provenance, lemmatization category (LEMMA, AMBIGUOUS, NOT RECOGNISED), reason for not recognizing form.
===== 0. Prepare a text =====
TEI XML encoded
===== 1. Prepare a list of words =====
Tasks / strategies:
  * separate words starting with uppercase letters from others (uppercase signals potential names: <code bash>grep '^[[:lower:]]' filename > filename-lower</code> etc)
  * normalize orthography for service querying -- remove accents: <code bash>iconv -f utf8 -t ascii//TRANSLIT filename-upper |  tr '[:upper:]' '[:lower:]' | perl ambig-local.pl</code>
  * keep wordform frequency data (CSV field)
  * keep the original forms (CSV field)
===== 2. Filter lowercase words locally =====
We have built a local database containing words already identified -- those don't have to be sent to a lemmatizing service (saves time)
  * Filter 1: lemmatized words + ambiguously lemmatized words (Perl script [[ambig-local.pl]])
  * Filter 3: unrecognized words already in the database

Check: write results of local checks in the CSV!
Have reports on total number of words identified locally
===== 3. Send lowercase words to lemmatizer service =====
Use Bamboo Morpheus service. Expected return format: JSON
  * send a list of words (lowercase words from the text) to lemmatizing service (Perl script [[perlmorphb.pl]])
  * write results to file
  * convert the file to valid JSON (wrap it all in another layer): [[json2json.sh]]

===== 4. Analyse JSON results =====
  * extract from JSON records form, lemma and stem, write it to CSV file (to be added to local database)
  * three separate CSVs: for identified, ambiguous, unrecognized forms
  * get reports on categories
  * examine the results, note any errors
First three points achieved by [[jsonu3.sh]] bash script.
===== 5. Analyse unidentified words =====
Expected categories: non-Latin words (Greek, modern languages), numbers, abbreviations, orthographic variations, errors, names, common Latin words not in the service database, uncommon Latin words; and any combination of the above
  * record category in CSV field (manually)

===== 6. Update local database =====
  * load lemmatized and ambiguous words from CSV file into MySQL table (using a bash script: [[csv2croala_db.sh]])
  * for words which were not recognized, record local provenance: from which text?
===== 7. Filter uppercased words =====
When the local database is updated, query it for words from our text beginning with uppercase letter. Queries should be normalised (lowercased).
  * Filter out any forms already in the database
===== 8. Repeat steps 3--6 with uppercase wordforms =====