Workflow for Croatian Latin -- words analysis

Workflow for Croatian Latin -- words analysis

Goal: sort words from a Latin text into those lemmatized, ambiguously lemmatized, and unrecognized. Store results in local database. Note the original form, lemma, stem (for large-scale queries), text provenance, lemmatization category (LEMMA, AMBIGUOUS, NOT RECOGNISED), reason for not recognizing form.

0. Prepare a text

TEI XML encoded

1. Prepare a list of words

Tasks / strategies:

separate words starting with uppercase letters from others (uppercase signals potential names:
```
grep '^[[:lower:]]' filename > filename-lower
```
etc)

normalize orthography for service querying – remove accents:

iconv -f utf8 -t ascii//TRANSLIT filename-upper |  tr '[:upper:]' '[:lower:]' | perl ambig-local.pl

keep wordform frequency data (CSV field)
keep the original forms (CSV field)

2. Filter lowercase words locally

We have built a local database containing words already identified – those don't have to be sent to a lemmatizing service (saves time)

Filter 1: lemmatized words + ambiguously lemmatized words (Perl script ambig-local.pl)
Filter 3: unrecognized words already in the database

Check: write results of local checks in the CSV! Have reports on total number of words identified locally

3. Send lowercase words to lemmatizer service

Use Bamboo Morpheus service. Expected return format: JSON

send a list of words (lowercase words from the text) to lemmatizing service (Perl script perlmorphb.pl)
write results to file
convert the file to valid JSON (wrap it all in another layer): json2json.sh

4. Analyse JSON results

extract from JSON records form, lemma and stem, write it to CSV file (to be added to local database)
three separate CSVs: for identified, ambiguous, unrecognized forms
get reports on categories
examine the results, note any errors

First three points achieved by jsonu3.sh bash script.

5. Analyse unidentified words

Expected categories: non-Latin words (Greek, modern languages), numbers, abbreviations, orthographic variations, errors, names, common Latin words not in the service database, uncommon Latin words; and any combination of the above

record category in CSV field (manually)

6. Update local database

load lemmatized and ambiguous words from CSV file into MySQL table (using a bash script: csv2croala_db.sh)
for words which were not recognized, record local provenance: from which text?

7. Filter uppercased words

When the local database is updated, query it for words from our text beginning with uppercase letter. Queries should be normalised (lowercased).

Filter out any forms already in the database