Ancient text corpora are the entire collection of texts from the period of ancient history, defined in this article as the period from the beginning of writing up to 300 AD. These corpora are important for the study of literature, history, linguistics, and other fields, and are a fundamental component of the world's cultural heritage.
Chinese, Latin, and Greek are examples of ancient languages with significant text corpora, although much of these corpora are known to us via transmission (frequently via medieval manuscript copies) rather than in their original form. These texts – both transmitted and original – provide valuable insights into the history and culture of different regions of the world, and have been studied for centuries by scholars and researchers. Other ancient texts – particularly stone inscriptions and papyrus scrolls – have been published following archaeological research, notably the cuneiform corpus of c.10 million words and the c.5 million words in ancient Egyptian.
Through advances in technology and digitization, ancient text corpora are more accessible than ever before. Tools such as the Perseus Digital Library and the Digital Corpus of Sanskrit[1] have made it easier for researchers to access and analyze these texts.
Two types of ancient texts are known to modern scholars – those that have only survived in younger manuscripts, but whose great age is undisputed (this applies to the bulk of the Chinese, Brahmi, Greek, Latin, Hebrew and Avestan tradition), and those known from original inscriptions, papyri and other manuscripts.[2]
Counting of the words in each corpus presents significant methodological challenges – in principle, every single occurrence of a word in the text is counted separately, but in the case of parallel transmission of literary texts, only a single transmission is taken into account. Just as the Book of the Dead and the coffin texts are only included once in the number given for the Egyptian, the Greek and Latin literary works should only be counted according to one manuscript. If, on the other hand, tombs, royal inscriptions or economic documents of certain ancient languages often show a more or less identical form, this is not evaluated as a purely "parallel tradition". Attached prepositions are counted as separate words, except in the case of the definite article in Hebrew, Aramaic and Greek since it has no equivalent in most languages, so its frequency would significantly affect the comparability of numbers.[2]
Script | Language | Dates used | Number of texts prior to 300AD | Number of words prior to 300AD | Ref. | ||
---|---|---|---|---|---|---|---|
Archaeological | Transmission | Total | |||||
Egyptian hieroglyphs / Hieratic | Egyptian | 5,000,000 | none | 5,000,000 | [3][4] | ||
Demotic | 1,000,000 | none | 1,000,000 | [5] | |||
Greek (Ancient Greek literature, New Testament, Church Fathers, etc.) | 57,000,000 | [6][7] | |||||
Latin | 10,000,000 | [8][7] | |||||
Cuneiform | Akkadian | 144,000[9] | 9,900,000[9] | none | 9,900,000 | [10] | |
Sumerian | 102,300[11] | 3,076,000[11] | none | 3,076,000 | [12] | ||
Hurrian | 12,500 | none | 12,500 | [13] | |||
Urartian | 400 | 10,000 | none | 10,000 | |||
Hittite | 700,000 | none | 700,000 | [14] | |||
Hattic | 500 | none | 500 | [15] | |||
Cuneiform Luwian | 3000 | none | 3000 | [16] | |||
Elamite | 2,087 | 100,000 | none | 100,000 | [17] | ||
Protoelamic | 1,435 | 20,000 | none | 20,000 | [18] | ||
Eblaite | 16,000 | 300,000 | none | 300,000 | [19] | ||
Amorite | 7,000 | 11,600 | none | 11,600 | [20] | ||
Ugaritic | 40,000 | none | 40,000 | [21] | |||
Old Persian | 7,000 | 100,000 | 107,000 | [22] | |||
Canaanite and Aramaic | Ancient Hebrew (inc. Hebrew Bible) | 35,000 | 265,000 | 300,000 | [23][24] | ||
Aramaic (ancient, imperial, biblical, Hasmonean, Nabataean, Palmyrenean) | 100,000 | [25] | |||||
Phoenician/Punic | 10,000 | 68[26] | [27][28] [29] | ||||
Old South Arabian | 10,500 | 112,500 | none | 112,500 | [30][31] | ||
Etruscan | 25,000 | 25,000 | [32][33] |
There are a significant number of ancient micro-corpus languages. Estimating the total number of attested ancient languages may be as difficult as estimating their corpus size. For example, Greek and Latin sources hand down an enormous amount of foreign-language glosses, the seriousness of which is not always certain.[59]
See also: Archival science and Conservation and restoration of cultural property |
Historic preservation and maintaining ancient text corpora presents several challenges, including issues with preservation, translation, and digitization. Many ancient texts have been lost over time, and those that survive may be damaged or fragmented. Translating ancient languages and scripts requires specialized expertise, and digitizing texts can be time-consuming and resource-intensive.
Main article: Corpus linguistics |
The field of corpus linguistics studies language as expressed in text corpora. This includes the analysis of word frequency, collocations, grammar, and semantics. Ancient text corpora provide a valuable resource for corpus linguistics research, enabling scholars to explore the evolution of language and culture over time.