Syntactic annotation – tokenisation and IDs – Corpus of Historical Low German

Token boundaries

Broadly speaking, one token equates to one matrix sentence (IP-MAT), including an embedded clause if present.

However, note the following:

When two independent finite clauses are conjoined, the two clauses are treated as separate tokens:

(IP-MAT (NP-OB1 (DPDS Dat))     ← first independent clause as a token
        (VVFIN beleueden) 
        (NP-SBJ (PPER se)
                (DIN alle))
        (PP (APPR myt)
            (NP (NA willen)))
)

(IP-MAT (KON und)               ← second independent clause as a token
        (NP-SBJ *con*)      
        (VVFIN scheiden)
        (PP (APPR van)
            (NP (PPER em)))
)

Direct speech which constitutes an IP-MAT can sit within a higher IP-MAT introducing the speech. The direct speech matrix clause gets the extended tag -SPE:

(IP-MAT (ADVP (ADV DO))
        (VVFIN sprak)
        (NP-SBJ (PPER he))
        (IP-MAT-SPE (NP-SBJ (PPER ik))    ← IP which is direct speech
                    (PTKNEG ne)
                    (VVFIN bin))
)

In contrast to the treatment of direct speech (see above), cases where a citation which is an IP-MAT is introduced by e.g. `X writes’ are treated as separate tokens:

(IP-MAT (NP-SBJ (NE Salustius))   ← first token
        (VVFIN scrift)
)

(IP-MAT (NP-SBJ (DDARTA de)       ← second token
                (FM Troyani))
        (VAFIN hebben)
        (NP-OB1 (NE rome))
        (VVPP ghebuwet)
)

Chapter and section headings are treated as standalone tokens and are tagged IP-MAT is they constitute and independent finite clause or otherwise FRAG:

(FRAG (PP (APPR Van)
          (NP (DDARTA dem)
              (NA Borchgherichte)))
)

Places and dates given for the time of writing are also treated as standalone tokens and are tagged FRAG:

(FRAG (FM proximo)
      (FM libro)
      (FM de)
      (FM ciuitate)
      (FM dei)
)

(FRAG (XY 2.000dcclxxx)
      (FM Abbon)
)

Token IDs

Each token has a unique ID in the form of ID TEXT.DIALECT.GENRE.NUMBER, for example:

ID ARZNEI.WP.SCI.1, ID ARZNEI.WP.SCI.2, ID ARZNEI.WP.SCI.3 etc…
ID ENGELHUS.EP.HIS.1, ID ENGELHUS.EP.HIS.2, ID ENGELHUS.EP.HIS.3 etc…
ID STRALSUND.EE.CHART.1, ID STRALSUND.EE.CHART.2, ID STRALSUND.EE.CHART.3 etc…