Project Gutenberg
A portrait of literary history through 79,491 digitized works
The Project Gutenberg collection contains 79,491 works by 25,942 authors, written in 119 languages and spanning subjects from ancient philosophy to pulp science fiction. This analysis traces the contours of this remarkable archive.
The Shape of Literary Production
The archive reveals a striking pattern: authors born in the 1860s dominate the collection, contributing 9,430 works—more than any other decade. This peak reflects the intersection of Victorian-era productivity, the public domain threshold, and Gutenberg's digitization priorities.
The Dominance of English
English accounts for 76% of all works—an overwhelming majority that reflects both the project's American origins and the global reach of English-language publishing. Yet beneath this dominance lies unexpected variety.
| English | 60,693 |
| French | 3,973 |
| Finnish | 3,313 |
| German | 2,324 |
| Italian | 1,056 |
| Dutch | 1,046 |
| Spanish | 885 |
| Portuguese | 647 |
Finnish ranks third—a striking overrepresentation for a language spoken by 5 million people, explained by Finland's active digitization community and the richness of its 19th-century literary tradition.
Fiction and Non-Fiction by Language
Languages differ markedly in their fiction-to-nonfiction ratios. Finnish and Hungarian collections are predominantly fiction (61% and 60% respectively), while Portuguese leans heavily toward non-fiction (only 22% fiction).
The Emergence of Genres
Genres can be dated by the average birth year of their authors. Historical fiction (avg. author born 1833) represents the oldest tradition; science fiction (1906) is distinctly modern, its authors born nearly 75 years later.
The Science Fiction Explosion
Science fiction's growth is dramatic. Authors born in the 1910s contributed 816 works—up from just 30 by those born in the 1850s. The genre essentially crystallized in the early 20th century.
Literary Productivity
Shakespeare leads in absolute output (334 works), but measuring works per year of life reveals different patterns of productivity. Jack London, dead at 40, produced 2.85 works per year—among the highest sustained outputs of any major author.
| Author | Works | Life | Per year |
|---|---|---|---|
| Shakespeare | 334 | 1564–1616 | 6.42 |
| Dickens | 197 | 1812–1870 | 3.40 |
| Twain | 251 | 1835–1910 | 3.35 |
| Bulwer-Lytton | 226 | 1803–1873 | 3.23 |
| Balzac | 159 | 1799–1850 | 3.12 |
| Ebers | 177 | 1837–1898 | 2.90 |
| London | 114 | 1876–1916 | 2.85 |
| Stevenson | 114 | 1850–1894 | 2.59 |
| Dumas | 165 | 1802–1870 | 2.43 |
| Verne | 177 | 1828–1905 | 2.30 |
Brief Lives, Lasting Words
Some of literature's most enduring voices were silenced early. Wilhelm Hauff died at 25, yet left 26 works. Stephen Crane lived 29 years; Shelley and Robert E. Howard just 30. Byron produced 32 works before drowning at 36.
Author Longevity Over Time
Authors' lifespans have increased steadily. Writers born in the 1500s lived a median of 63 years; by the 1900s, this had risen to 77. The modal lifespan cluster is 70–79 years (5,228 authors), with surprisingly many reaching their 90s.
The Library of Congress View
Using Library of Congress classification codes, American literature (PS) narrowly leads English literature (PR). Juvenile fiction (PZ) claims third place—a reminder that children's literature was a massive Victorian industry.
What They Wrote About
Science fiction leads all subject headings with 3,208 works—a testament to Gutenberg's pulp-era acquisitions. Short stories, adventure, and detective fiction follow. The archive is, above all, a treasury of popular fiction.
This analysis draws on four linked datasets: metadata for 79,491 works, biographical records for 26,077 authors, 76,205 language assignments, and 255,312 subject classifications. The data reflects not just literary history but the history of digitization itself—what we chose to preserve, and when.
Data: TidyTuesday 2025-06-03 / Project Gutenberg