A statistical study of the WPT05 crawl of the Portuguese Web

David Batista, Mário Silva

Abstract: This article presents a statistical study of WPT05, a text corpus constructed from a crawl of the Portuguese Web done in 2005. This textual corpus is a valuable resource for researchers in the area of Natural Language Processing (NLP). This is one of the biggest publicly available texts collection of European Portuguese texts. We provide a statistical analysis of the contents, covering the languages identified, the top-level domains of crawled URLs and terms frequency and size. An analysis of an n-grams collection extracted from the Portuguese documents in the corpus is also presented. We analyze the occurrence of first names, surnames and geographic names in the corpus. Since some toponyms are named after personal names we show the overlap of Portuguese names with geographic entities corresponding to places in Portugal.

Index Terms: web corpus, resources, Portuguese.

Full Paper