Corpus Italiano

English   Italiano

Welcome to PAISÀ

On these pages we present the corpus PAISÀ, a large corpus of authentic contemporary Italian texts from the web. It was created in the context of the project PAISÀ (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) with the aim to provide a large resource of freely available Italian texts for language learning by studying authentic text materials.

It constitutes a unique language resource for Italian in combining the following features:

  • corpus of web texts (harvested in September/October 2010)
  • composed entirely of freely available and freely distributable texts (under Creative Commons license, Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike)
  • dimension of about 250 mio tokens
  • fully annotated in CoNLL format (lemmatized, POS-tagged (or also here) and annotated for syntactic dependencies)
  • automatically cleaned-up and in part manually corrected (on different processing stages: retrieval of URLs, clean-up of harvested texts, and correction of annotations for adjustment of annotation tools)

Even though primarily created for language learning, the PAISÀ corpus also provides a rich resource for research.

This web site will serve a learner-oriented interface for online access to the corpus. The interface will offer different modes for accessing the corpus, ranging from precompiled searches to fully flexible search options for constructing complex queries, aiming to serve different user groups. It is work in progress and gets continuously updated.

In addition, you find information on the project PAISÀ, details on the corpus construction and the full corpus for download.

Funding for the project PAISÀ is provided by the Ministero dell'Istruzione, dell'Università e della Ricerca (MIUR), by means of the program Fondo per gli Investimenti della Ricerca di Base (FIRB).