Token boundaries
Broadly speaking, one token equates to one matrix sentence (IP-MAT), including an embedded clause if present.
However, note the following:
- When two independent finite clauses are conjoined, the two clauses are treated as separate tokens:
(IP-MAT (NP-OB1 (DPDS Dat)) ← first independent clause as a token (VVFIN beleueden) (NP-SBJ (PPER se) (DIN alle)) (PP (APPR myt) (NP (NA willen))) ) (IP-MAT (KON und) ← second independent clause as a token (NP-SBJ *con*) (VVFIN scheiden) (PP (APPR van) (NP (PPER em))) )
- Direct speech which constitutes an
IP-MATcan sit within a higherIP-MATintroducing the speech. The direct speech matrix clause gets the extended tag-SPE:
(IP-MAT (ADVP (ADV DO)) (VVFIN sprak) (NP-SBJ (PPER he)) (IP-MAT-SPE (NP-SBJ (PPER ik)) ← IP which is direct speech (PTKNEG ne) (VVFIN bin)) )
- In contrast to the treatment of direct speech (see above), cases where a citation which is an
IP-MATis introduced by e.g. `X writes’ are treated as separate tokens:
(IP-MAT (NP-SBJ (NE Salustius)) ← first token (VVFIN scrift) ) (IP-MAT (NP-SBJ (DDARTA de) ← second token (FM Troyani)) (VAFIN hebben) (NP-OB1 (NE rome)) (VVPP ghebuwet) )
- Chapter and section headings are treated as standalone tokens and are tagged
IP-MATis they constitute and independent finite clause or otherwiseFRAG:
(FRAG (PP (APPR Van) (NP (DDARTA dem) (NA Borchgherichte))) )
- Places and dates given for the time of writing are also treated as standalone tokens and are tagged
FRAG:
(FRAG (FM proximo) (FM libro) (FM de) (FM ciuitate) (FM dei) )(FRAG (XY 2.000dcclxxx) (FM Abbon) )
Token IDs
Each token has a unique ID in the form of ID TEXT.DIALECT.GENRE.NUMBER, for example:
ID ARZNEI.WP.SCI.1,ID ARZNEI.WP.SCI.2,ID ARZNEI.WP.SCI.3etc…ID ENGELHUS.EP.HIS.1,ID ENGELHUS.EP.HIS.2,ID ENGELHUS.EP.HIS.3etc…ID STRALSUND.EE.CHART.1,ID STRALSUND.EE.CHART.2,ID STRALSUND.EE.CHART.3etc…