BaseX Adventures

A lab diary of getting acquainted with the BaseX XML database.

The version in use is the Official Release: 7.6 (2013/02/05). We search through the BaseX GUI.

The documents are encoded in TEI XML. They contain mostly bibliographical and prosopographical data prepared for the Mercurius Croaticus database. The documents will be accessible in a “basex” subdirectory on the CroALa Sourceforge pages.

Further experiments are provided by encoding a book of decisions by three administrative bodies of medieval Dubrovnik, for years 1390-1392. The TEI XML file is available on Sourceforge (current version: dbk1390-r6.xml), as part of the CroALa Project.

Find all filenames in collection

In BaseX and XQuery, a database is a “collection”. It has a name and can contain many files. This is an XQuery to return all filenames of files in the croala database:

let $collection := collection("croala")
for $file in $collection
return substring(document-uri($file), 8)

The substring() function serves for pretty-printing — it actually removes all repetitions of croala/ from node-names (file names).

Result:

mikac-obsidio.xml stephanus-confirmatio-crisogoni.xml babulak-o-ode-matzek.xml zamanj-b-navis.xml nn-ianci-epitaph.xml banic-j-epist-1513-02-16.xml andreis-f-thurc-her.xml lipavic-eleg.xml berislavic-p-epist-1518-04-10.xml kordic-m-funere-lazzari.xml frankapan-b-epist-1525-02-15.xml frankapan-f-epist-1533-07-05.xml vicic-k-thien.xml marul-mar-trop.xml crijev-i-sorgo-1509.xml utjesenovic-j-epist-1545-02-28.xml vitezov-ritter-p-epist-marsil.xml krsava-j-epigram.xml ian-pan-oratio.xml nn-vekenega.xml stay-b-philos.xml andreis-f-epist-1570-02-02.xml mazuranic-a-epist.xml kunic-r-epigr.xml niger-t-epist-adr-1522.xml nn-thoma-archid.xml banic-j-epist-1513-06-04.xml vicic-k-jess.xml milasin-f-met.xml banic-j-epist.xml brodaric-s-epist-1525-11-30.xml brodaric-s-epist-1525-09-30.xml crijev-i-carm-1678.xml vrancic-a-epist-1553-08-12.xml gradic-s-palmottae-vita.xml skerle-n-ep-verh-1798-04.xml vrancic-a-c-vd.xml sisgor-g-eleg.xml dubrovnik-epist-1531-07-22.xml andronic-trag-elegia.xml banic-j-epist-1513-05-01.xml skerle-n-ep-verh-1798-07.xml baric-a-stat.xml crnkovic-p-epist.xml brodaric-s-epist-1526-02-22.xml marul-mar-epist-1477.xml skerle-n-ep-verh-1798-10-16.xml

Etc. etc.

Search several files in a collection

Problem: when we create a collection, the usual XQuery / XPath expression, such as:

//date

will produce zero results.

Solution: the correct notation has to take the namespace into account. It is:

//*:date

However, the expression:

//tei:date

won't work, because of the following: “Error: [XPST0081] No namespace declared for “tei:date”.”

UPDATE: in XQuery, namespace is declared like this:

declare namespace tei = "http://www.tei-c.org/ns/1.0";

… on a first line of the XQuery program or expression, that is, before the FLWOR sequence.

Find all mentions of "Dubrovnik" as a place name in the collection

XPath, or “search” in BaseX:

//*:placeName[. = "Dubrovnik"]

We get 161 results.

Find in the collection all persons born in Dubrovnik

//*:person[*:birth/*:placeName = "Dubrovnik"]

“*:” notation is necessary, because we are searching in a collection.

Now we get 76 results. The collection contains 276 persons.

Find in the collection all persons who wrote elegies

//*:person[*:occupation[contains(.,"elegij")]]

49 results, out of 276 persons.

Find only the persons connected with Dubrovnik who wrote elegies

We use an XQuery expression:

for $eleg in //*:person[*:occupation[contains(.,"elegij")]] 
where $eleg//*:placeName[. = "Dubrovnik"] 
return $eleg

24 results, out of 49 elegiac authors.

Return just x results from the found set

Problem: we need to show just 10 results from a bigger set found in the prosopographical database. It does not have to be only the first 10 results; we can show, e. g., just the results from 200-209.

Solution. We use the following XQuery expressions:

for $person in subsequence(collection("prosop1")//*:person, 1, 10)

“1” means “from the result 1”. “10” means “show just ten results.

Source: the XQuery Wikibook (= Wikibooks contributors, “XQuery,” Wikibooks, The Free Textbook Project, http://en.wikibooks.org/w/index.php?title=XQuery&oldid=2361911 (accessed June 18, 2012)).

for $person in subsequence(collection("prosop1")//*:person, 200, 10) return $person

Show the 10 results from 200 onwards.

for $person in subsequence(collection("prosop1")//*:person, 200, 10) 
order by $person/*:death/*:date/@when 
return $person

Show only the 10 results from 200 onwards, sorting them by value of the “when” attribute in the “death/date” element.

for $person in subsequence(collection("prosop1")//*:person, 200, 10) 
order by $person/*:death/*:date/@when descending 
return $person

The same set, sorted in descending order.

Order by attribute value

Select all persName elements which have a ref attribute; sort them by value of the attribute; return the persName elements.

for $pers in //*:body//*:persName[@ref]
order by string($pers/@ref)
return $pers

Variation: order divs by month (which is expressed as @when inside date, part of each head element):

for $mon in //*:div[@ana[. = 'mensis']]
order by string($mon/*:head/*:date/@when)
return $mon/@xml:id

Even more complex: the collection contains an index of persons and a list of records. Select all matches of names in records (identified by @ref attribute) with persons (identified by @xml). Order the set by date of record, contained in div element with @ana attribute dies, and in its head element, marked as date with @when attribute. (The elements belong to TEI XML set.)

for $pers in //*:person, $persname in //*:persName
let $date := $persname/ancestor::*:div[@ana[. = 'dies']]/*:head/*:date/@when
where $pers/@xml:id = $persname/@ref
order by $date
return (data($pers/@xml:id), data($date))

Count number of matches, group by that number

We have two types of divs, one for months (@ana='mensis'), one for days (@ana='dies'). We need to count days inside months and to group resulting counts. Furthermore, each month contains records of several types (marked by a part of @xml:id value), but we select just one type. The results are returned wrapped inside a p element.

for $dies in //*:div[@ana[. = 'mensis']][@xml:id[contains(., 'minor')]]
order by count($dies/*:div[@ana[. = 'dies']])
return <p>{count($dies/*:div[@ana[. = 'dies']])}</p>

Table rows: meetings, dates, number of votes

Meeting records are in divs (with @ana = “dies”); their head contains dates (date/@when) and number of persons present (num/value, which can be wrapped in sic element, when something is strange in the MS).

We want to see on which days how many people were present, and to know what type of meeting it was (the type is encoded as part of div/@n value).

The results are returned as HTML table rows (tr) and cells (td).

for $dies in //*:div[@ana[. eq 'dies']]
let $balote := $dies/*:head
let $broj := $balote/*:num[not(parent::*:date)][last()]/@value
let $dan := $balote/*:date/@when
order by number($broj) descending, $dan
return <tr>
<td type="sjednica">{data($dies/@n)}</td>
<td type="dan">{data($dan)}</td>
<td type="balote">{data($broj)}</td>
</tr>

Update files

The working collection (database) consists of several TEI XML files. They are altered elsewhere, and we want to update the BaseX database. The easiest way seems to be using several REPLACE commands:

replace dbk1390-92idx.xml /home/neven/rad/croala-r/radno/dbk1390-92idx.xml
 
...

Local name (TEI XML file in the database) is the same as the name of file on disk (described by full path).

 
z/basex-adv.txt · Last modified: 06. 05. 2013. 19:30 by njovanov
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki