Table of Contents

Workflow for Croatian Latin -- words analysis

Goal: sort words from a Latin text into those lemmatized, ambiguously lemmatized, and unrecognized. Store results in local database. Note the original form, lemma, stem (for large-scale queries), text provenance, lemmatization category (LEMMA, AMBIGUOUS, NOT RECOGNISED), reason for not recognizing form.

0. Prepare a text

TEI XML encoded

1. Prepare a list of words

Tasks / strategies:

2. Filter lowercase words locally

We have built a local database containing words already identified – those don't have to be sent to a lemmatizing service (saves time)

Check: write results of local checks in the CSV! Have reports on total number of words identified locally

3. Send lowercase words to lemmatizer service

Use Bamboo Morpheus service. Expected return format: JSON

4. Analyse JSON results

First three points achieved by jsonu3.sh bash script.

5. Analyse unidentified words

Expected categories: non-Latin words (Greek, modern languages), numbers, abbreviations, orthographic variations, errors, names, common Latin words not in the service database, uncommon Latin words; and any combination of the above

6. Update local database

7. Filter uppercased words

When the local database is updated, query it for words from our text beginning with uppercase letter. Queries should be normalised (lowercased).

8. Repeat steps 3--6 with uppercase wordforms