Sometimes these categories overlap, notably in the case of topical categories as a text can be relevant to more than one topic.Occasionally, text collections have temporal structure, news collections being the most common example.For convenience, the corpus methods accept a single fileid or a list of fileids.Similarly, we can specify the words or sentences we want in terms of files or categories.The graph in fig-inaugural used "word offset" as one of the axes; this is the numerical index of the word in the corpus, counting from the first word of the first address.However, the corpus is actually a collection of 55 texts, one for each presidential address.

This corpus contains text from 500 sources, and the sources have been categorized by genre, such as Next, we need to obtain counts for each genre of interest.The simplest kind lacks any structure: it is just a collection of texts.Often, texts are grouped into categories that might correspond to genre, source, author, language, etc.We examined some small text collections in 1., such as the speeches known as the US Presidential Inaugural Addresses.This particular corpus actually contains dozens of individual texts — one per address — but for convenience we glued them end-to-end and treated them as a single text. also used various pre-defined texts that we accessed by typing This program displays three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (our lexical diversity score).

