Lessico Tomistico Biculturale

 

 Index Thomisticus Treebank

 'arbor est causa proxima fructus'

(Scriptum super Sententiis, lib. 2, dist. 34, qu. 1, art. 5, expos., 7-2.7-6)

 

Index Thomisticus (IT) Treebank is an ongoing project, which is part of the Lessico Tomistico Biculturale (LTB) project by Father Roberto Busa. The project  is  hosted at the Catholic University of the Sacred Heart in Milan (Italy).

 

IT is considered as the pathfinder of Computer Sciences applications in the Humanities; it retains the opera omnia by Thomas Aquinas (118 texts), plus works by other 61 authors related to Thomas (61 texts). It is a corpus of around 11 millions of tokens (150.000 types; 20.000 lemmas).

 

LTB aims to develop IT in a lexicon, whose lexical entries are all the IT lemmas. Each entry is a report about the morphological, syntactic and semantic uses and values of the lemmas in IT.

 

IT-Treebank wants to make IT a Treebank. The main reference of IT-Treebank is the Prague Dependency Treebank (PDT).

Annotation on analytical level is performed on the basis of PDT Annotation Guidelines and according to guidelines specifically written for Latin, shared and developed with the Latin Dependency Treebank of the Perseus Project in Boston. The IT tagset is available here.

IT-Treebank data are both in CSTS-SGML (Czech Sentence Tree Structure) and PML-XML (Prague Markup Language) format.

Presently, IT-Treebank is composed of 91,350 tokens, for a total of 4,077 syntactically parsed sentences excerpted from Scriptum super Sententiis Magistri Petri Lombardi, Summa contra Gentiles and Summa Theologiae.

 

 

Partners and People

Publications

Conferences and workshops slides 

Browsing the data of IT-Treebank

IT-Treebank can be browsed through Netgraph. Netgraph is a client-server application developed for browsing the data of PDT.

You can choose two ways to search the data:

 

1.    Stand-alone application: you should download Netgraph and require a password;

2.    Applet version of Netgraph

 

The Index Thomisticus Treebank Valency Lexicon (IT-VaLex) can be browsed on-line here. Valency is generally defined as the number of obligatory complements required by a word: these complements are usually named ‘arguments’, while the non-obligatory ones are referred to as ‘adjuncts’. Although valency can be assigned to different parts of speech (usually verbs, nouns and adjectives), scholars have mainly focused their attention on verbs, so that the notion of valency often coincides with verbal valency.

IT-VaLex is a collection of verbal lexical entries enhanced with valency and subcategorization frames. IT-VaLex is closely related to the Index Thomisticus Treebank project, since it is a corpus-based valency lexicon automatically induced from IT-TB data. The lexicon can be browsed by lexical entry, or by number and surface order of the arguments, which are linked to their lexical fillers.

In the Index Thomisticus Treebank annotation style, verbal arguments are annotated using the following tags: Sb (Subject), Obj (Object), OComp (Object Complement) and Pnom (Predicate Nominal).

We are constantly improving our data. Please, report any error or send your comments to Marco Passarotti. For more information on IT-VaLex, see the publications page.

 

IT data (not treebanked) can be browsed through Corpus Thomisticum.

 

For internal use only (password required): Files Treebank