====== Workflow for Croatian Latin -- words analysis ====== Goal: sort words from a Latin text into those lemmatized, ambiguously lemmatized, and unrecognized. Store results in local database. Note the original form, lemma, stem (for large-scale queries), text provenance, lemmatization category (LEMMA, AMBIGUOUS, NOT RECOGNISED), reason for not recognizing form. ===== 0. Prepare a text ===== TEI XML encoded ===== 1. Prepare a list of words ===== Tasks / strategies: * separate words starting with uppercase letters from others (uppercase signals potential names: grep '^[[:lower:]]' filename > filename-lower etc) * normalize orthography for service querying -- remove accents: iconv -f utf8 -t ascii//TRANSLIT filename-upper | tr '[:upper:]' '[:lower:]' | perl ambig-local.pl * keep wordform frequency data (CSV field) * keep the original forms (CSV field) ===== 2. Filter lowercase words locally ===== We have built a local database containing words already identified -- those don't have to be sent to a lemmatizing service (saves time) * Filter 1: lemmatized words + ambiguously lemmatized words (Perl script [[ambig-local.pl]]) * Filter 3: unrecognized words already in the database Check: write results of local checks in the CSV! Have reports on total number of words identified locally ===== 3. Send lowercase words to lemmatizer service ===== Use Bamboo Morpheus service. Expected return format: JSON * send a list of words (lowercase words from the text) to lemmatizing service (Perl script [[perlmorphb.pl]]) * write results to file * convert the file to valid JSON (wrap it all in another layer): [[json2json.sh]] ===== 4. Analyse JSON results ===== * extract from JSON records form, lemma and stem, write it to CSV file (to be added to local database) * three separate CSVs: for identified, ambiguous, unrecognized forms * get reports on categories * examine the results, note any errors First three points achieved by [[jsonu3.sh]] bash script. ===== 5. Analyse unidentified words ===== Expected categories: non-Latin words (Greek, modern languages), numbers, abbreviations, orthographic variations, errors, names, common Latin words not in the service database, uncommon Latin words; and any combination of the above * record category in CSV field (manually) ===== 6. Update local database ===== * load lemmatized and ambiguous words from CSV file into MySQL table (using a bash script: [[csv2croala_db.sh]]) * for words which were not recognized, record local provenance: from which text? ===== 7. Filter uppercased words ===== When the local database is updated, query it for words from our text beginning with uppercase letter. Queries should be normalised (lowercased). * Filter out any forms already in the database ===== 8. Repeat steps 3--6 with uppercase wordforms =====