Welcome

Welcome

welcome to PAISÀ
Corpus dell'Italiano

Corpus dell'Italiano

general info & download

construction steps

online access
PAISÀ project

PAISÀ project

description

partnership

funding
Materials

Materials

publications

help pages / manuals

Welcome to PAISÀ

On these pages we present the corpus PAISÀ, a large corpus of authentic contemporary Italian texts from the web. It was created in the context of the project PAISÀ (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) with the aim to provide a large resource of freely available Italian texts for language learning by studying authentic text materials.

It constitutes a unique language resource for Italian in combining the following features:

corpus of web texts (harvested in September/October 2010)
composed entirely of freely available and freely distributable texts (under Creative Commons license, Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike)
dimension of about 250 mio tokens
fully annotated in CoNLL format (lemmatized, POS-tagged (or also here) and annotated for syntactic dependencies)
automatically cleaned-up and in part manually corrected (on different processing stages: retrieval of URLs, clean-up of harvested texts, and correction of annotations for adjustment of annotation tools)

Even though primarily created for language learning, the PAISÀ corpus also provides a rich resource for research.

This web site will serve a learner-oriented interface for online access to the corpus. The interface will offer different modes for accessing the corpus, ranging from precompiled searches to fully flexible search options for constructing complex queries, aiming to serve different user groups. It is work in progress and gets continuously updated.

In addition, you find information on the project PAISÀ, details on the corpus construction and the full corpus for download.

Funding for the project PAISÀ is provided by the Ministero dell'Istruzione, dell'Università e della Ricerca (MIUR), by means of the program Fondo per gli Investimenti della Ricerca di Base (FIRB).