America’s Public Bible is unusual, at least in comparison to conventional historical scholarship, as a work of history that is based on machine learning and data analysis and that presents itself as an interactive scholarly work. Within the field of digital history, innovations in research technique or form have often spurred methodological discussion, and this project will be no different. However, the methodological discussion below is not simply about programming, statistics, and web development. Rather, I hope to show how those kinds of techniques are only a part of a broader humanistic methodology, which I am going to call disciplined serendipity. Accordingly, this discussion of methodology has three parts: how the biblical quotations were identified and analyzed; why this website takes the form that it does; and how these kinds of digital history approaches amount to a method for learning about the past.1
The newspaper corpora
The starting place for this project is Chronicling America: Historic American Newspapers. Created by the Library of Congress, this project makes available over 15 million newspaper pages, the vast majority of which come during the period from the 1830s to the 1920s. The Library of Congress makes the data publicly available for both conventional and computational historical research.
For this project, I downloaded the entirety of the Chronicling America collection that was available at the time, which was 14,985,326 newspaper pages, to be exact. (The National Digital Newspaper Program adds pages continually, and so not everything that is in Chronicling America currently has been used for this site.) Chronicling America has a good API: the newspaper metadata comes from the newspaper metadata API, and the OCR plain text of the newspaper pages was downloaded via the OCR Bulk Data API.
An additional corpus of newspaper articles provides confirmation of the methods used on Chronicling America. Gale’s Nineteenth Century U.S. Newspapers (hereafter NCNP) contains 19.6 million articles from the nineteenth-century United States. In some ways the contents of this collection overlap with Chronicling America, but it also contains distinct sources. Just as important, the texts are segmented by articles instead of pages, and they have independently been converted to plain text via optical character recognition (OCR). This collection was licensed by the George Mason University libraries for text analysis, and trend lines of rates of quotations from NCNP appear throughout the site. However, given the limitations of this license, which preclude the ability to reproduce the context of the quotations, this corpus is mostly included as a point of comparison with the trend lines from Chronicling America.
Finally, although it is not reproduced on this site, the machine-learning models below were also run on a corpus of Civil War-era sermons gathered by Jimmy Byrd for his book, A Holy Baptism of Fire and Blood: The Bible and the American Civil War. The full process by which quotations were identified in that corpus is explained in an appendix to the book.2 However, the fact that a meaningful set of quotations was identified from a very different corpus, containing a different kind of text altogether, demonstrates the success of the machine-learning model in identifying biblical quotations.
Given those corpora, I created a machine-learning model that predicts whether a particular verse from the Bible was quoted or alluded to on a particular newspaper page, in the case of Chronicling America, or in a particular article, in the case of NCNP. Below is a brief, mostly non-technical explanation of how this works. However, I have included certain statistical details, and the code for the machine-learning model as well as the exploratory notebooks for the model are available in this project’s GitHub repository.
How the quotations were identified
The task of identifying the quotations had essentially two steps: measuring certain aspects of the text which indicate that a biblical quotation might be present, and then distinguishing between potential and genuine matches. Those two tasks could be called feature extraction and prediction.
To begin, each verse in the Bible is turned into tokens, or n-grams. Take, for instance, the first verse of the Bible: “In the beginning, God created the heaven and the earth” (Genesis 1:1 KJV). This verse would be turned into tokens ranging from three to five words long, skipping stop words (such as an, of, the), which convey little meaning. So the text of Genesis 1:1 becomes tokens like these:
"god created heaven"
"beginning god created heaven earth"
"beginning god created"
"created heaven earth"
"beginning god created heaven"
"god created heaven earth"
The tokens that were created for this project used the King James (or Authorized) Version, the American Standard Version, the Revised Version, and the Douay-Rheims Version, including the Apocrypha as well as the Old and New Testaments in most instances.3 Below I will describe how these different versions were distinguished from one another.
These tokens then were used to create a document-term matrix, where the rows are the Bible verses, the columns are the tokens, and the cells indicate how many times that token appears in that verse. For instance, a subset of the Bible matrix might look like this:
|beginning god created||god created heaven||without form void|
|Genesis 1:1 (KJV)||1||1||0|
|Genesis 1:2 (KJV)||0||0||1|
|Genesis 1:3 (KJV)||0||0||0|
Then each newspaper page or article is turned into tokens using the exact same function that was used for tokenizing the Bible. This creates a second matrix where the rows are newspaper pages, the columns are three-to-five-word tokens from the Bible, and the cells indicate how many times that string of words from the Bible is found on the newspaper page. Although the newspapers include vastly more possible tokens than are found in just the Bible, this method restricts the number of columns in the newspaper document-term matrix to the same as in the Bible matrix.
|beginning god created||god created heaven||without form void|
Because the two matrices share a dimension, the Bible matrix can be multiplied by the transpose of the newspaper matrix. The result is a matrix with Bible verses in the rows and newspaper pages in the columns. The numbers in the cells of the matrix indicate how many tokens from that verse were found on that newspaper page. So in the sample matrix below, page_A shares two tokens with Genesis 1:1 and page_B shares one token with Genesis 1:2, indicating that those verses might appear on those pages. Of course the vast majority of cells in the resulting matrix are zeros.
|Genesis 1:1 (KJV)||2||0||0|
|Genesis 1:2 (KJV)||0||1||0|
|Genesis 1:3 (KJV)||0||0||0|
The multiplication of these document-term matrices is the primary means I have used to find matches, but the token count is only one of the features of potential matches that I measure. I tried four different features in training the model. Each of these can be matched to a common sense understanding of what a quotation looks like on a newspaper page.
- Token count: the number of tokens from a particular verse that appear on a particular newspaper page. The more words from a Bible verse that are present on a newspaper page, the more likely that newspaper quotes that verse.
- TF-IDF: Not every token contains the same amount of information about whether a newspaper page contains a particular biblical verse. For instance, the phrase “went into the city” could be a quotation from a dozen or more Bible verses, but it might just as well be any English sentence. But the phrase “through a glass, darkly” is obviously a reference to 1 Corinthians 13:12. By weighting the matching tokens according to their term frequency-inverse document frequency, more significant terms count for more in determining a match.
- Proportion: Bible verses can vary in length, from just two words (“Jesus wept” [John 11:35] and “Rejoice evermore” [1 Thessalonians 5:16]) to the longest, Esther 8:9, which has ninety words in the King James Version. (In fact, Esther 8:9 appears in several Chronicling America newspapers as the punchline of a joke about a “boy who boasted of his wonderful memory.”4 This feature measures what proportion of the entire verse is found on the page.
- Runs test: Where the matching tokens appear on the page is as important as how many matches there are. If the tokens appear widely scattered across the page, then they are likely to be just random matches to unimportant phrases. If the tokens are all clustered right next to each other (perhaps with a few gaps for incorrect OCR), then they are likely to be a quotation from the verse. This feature uses a statistical test to determine whether the sequence of matches (called a “run”) is random or not.
After having created a technique for extracting these features from newspapers, I ran the feature extraction on a random selection of thousands of newspaper pages. The goal was to create some sample data knowing that it contained many potential quotations, only some of which were genuine quotations.
After measuring the potential matches, we need a means of distinguishing between accurate matches and false positives. This is a difficult problem because of the way that the Bible was quoted in newspapers (or indeed, used more generally). If we were looking for complete quotations, then we would look for candidates where there were many matching tokens, or where a high proportion of the matching verse is present on the page. But often quotations can be highly compressed. A single unusual phrase (“Quench not the Spirit” [1 Thessalonians 5:19] or “Remember Lot’s wife” [Luke 17:32] or “She hath done what she could” [Mark 14:8] or “The Lord called Samuel” [1 Samuel 3:6]) may be enough to identify one quotation, where even a half dozen commonplace matching phrases might not actually be a quotation. Then too, sometimes allusions function by changing the actual words while retaining the syntax or cadence, as in this joke:
Rather than specify arbitrary thresholds, a more accurate approach is to teach an algorithm to distinguish between quotations and noise by showing it what many genuine matches and false positives look like. (Hence the term machine learning.) After taking a sample of potential matches, I identified (or labeled) a couple of thousand potential matches as genuine quotations or noise. This data was separated into training and testing sets. This makes it possible to observe patterns in the features that have been measured. The chart below, for instance, shows that genuine matches tend to have a much higher token count and a much higher TF-IDF score. But it is not possible to draw a single line on either chart which cleanly distinguishes between all genuine matches and all false positives.
I then used that data to train and test an array of machine-learning models. These models used different techniques for setting the parameters for the models, especially which combination of features was used. This model takes the predictors mentioned above and assigns it a class (“quotation” or “noise”) and a probability that that classification is correct. While I evaluated a number of models, including random forests, support vector machines, neural networks, and ensembles of other models, a comparatively simple logistic classifier had the best performance and most understandable characteristics.5
To determine which of the models I trained had the best performance, I measured the area under the receiver operating characteristic curve. The idea there is simple: the best classifier is the one that maximizes the number of genuine matches while minimizing the number of false positives. Ultimately I chose a model which used the token count, TF-IDF, and proportion predictors. This model had slightly less predictive performance than one which included the runs test as well, but the runs test was far more computationally expensive to measure than the other three predictors, and eliminating it was well worth the very tiny dip in predictive ability.
Having selected the best model, an additional question arose. When the machine-learning model distinguishes between a potential quotation which is genuinely a quotation and one which is not, it returns a probability for that judgment. A high probability means that there is almost certainly a quotation; a low probability means it is almost certainly noise. The question is what the right threshold is for distinguishing between quotations and noise. This is a tradeoff between precision and recall, or between sensitivity and specificity, to use the technical terms. In everyday language, the tradeoff is between finding as many genuine matches as possible, knowing that keeping more genuine quotations will also increase the number of quotations that are returned which are not genuine. In the prototype version of this site, I used a very high threshold (a probability of 0.9). The intention there was to make sure to eliminate as many false positives as possible, but the downside was that it left many genuine matches undetected. For the current version of the site, I used the J-statistic, which is intended to find the threshold that balances all predictions. Using that statistic, I settled on a probability threshold of 0.58. Note that the tables of quotations on this site display the probabilities for quotations as “lower,” “medium,” or “higher.” This design choice is intended to draw users’ attention to the probabilistic nature of finding quotations.
Finally, I measured the performance of the model. The performance was measured using testing data which had been held back from the training data, so the model could not have learned from this smaller set of labeled data. It thus represents a genuine test of the model’s accuracy. One way of representing the model’s results is with a confusion matrix. A confusion matrix shows how many potential quotations in the testing dataset were predicted to be genuine or noise, and how many of those predictions were correct or incorrect. Below is the confusion matrix for the testing data for the model used for this site.
There are a number of measures of a prediction model’s accuracy. Below are the most relevant ones as measured on the testing data for the model used on this site.
Having trained the model, the next task was to run it across all of the newspaper pages in Chronicling America and NCNP. I did this using George Mason University’s high-performance computing cluster. The result were millions of predicted quotations, stored in a PostgreSQL database. Exports of these derivative datasets are available for download from this site.
These quotations underwent a series of four further steps to clean them.
First, because of the way that the prediction model worked, if a verse was on a newspaper page, the model would return quotations from all of the versions of the English Bible that I was using. It is possible, of course, that a newspaper page could quote the same verse in different versions. Sometimes, for instance, newspapers published detailed comparisons between the King James Version and newly translated English versions.6 However, it is a safe assumption that in the vast majority of cases, the newspaper quoted only a single version. The question then becomes, which version was quoted? I first eliminated any anachronisms, based on the date of the newspaper issue: a newspaper could not have quoted a version of the Bible that did not yet exist. Then I chose the version of the Bible with the highest probability, as long as that probability was slightly higher than the prediction for the King James Version. For the period under study, the King James Version remained dominant, and so I assumed it was the version quoted unless the probability predicted for a different version was higher. I only ever kept one version of a verse on the same page.
Second, many verses in the Bible are similar to one another. A good example is “Suffer little children to come unto me,” which appears in somewhat different versions in Luke 18:16, Matthew 19:14, and Mark 10:14. That particular example is drawn from the synoptic Gospels, which are full of examples of intertextual borrowing. But the New Testament also quotes or alludes to the Old Testament frequently, and parts of the Old Testament are reproduced across different books of the Bible, among other examples of borrowings. Thus, the model would sometimes identify a newspaper page as quoting, to use the example above, Luke 18:16, Matthew 19:14, and Mark 10:14, when in fact only one of them had been quoted. In such cases, I have deduplicated these quotations. For example, all three of the texts mentioned are referred to as Luke 18:16, and only one occurrence per page or article is counted.
Third, some verses of the Bible do not have significant content or use perfectly routine phrases, and so they are extremely unlikely to have been the subject of quotation. For example, a Bible verse might describe someone as entering a city. That identical phrase could be used in a thousand ways in a newspaper without any of them being a quotation or allusion to the Bible. Another common false positive were biblical lists (“first, second, etc.”). In such instances, I have simply removed such examples as not accurate or relevant.
Finally, there is the problem of optical character recognition. The method that I have created uses the plain text versions of the newspaper pages from Chronicling America, which have been created via OCR.7 The quality of the OCR text in Chronicling America (and, to a much lesser extent in NCNP) is uneven. Some newspaper pages have excellent, nearly word-perfect OCR; others have OCR that is completely unusable; many, of course, are somewhat useable and somewhat not. I used a very simple measure of OCR quality for each newspaper page: in essence, the percentage of words on the page that can be found in a dictionary of known English words. I then chose a cutoff which eliminated newspaper pages where the quality of the OCR was so poor that the algorithm had no real chance of finding a quotation. I eliminated those pages from consideration in two senses: first, those pages were not counted in computing the rate of quotations for verses, but also in that I eliminated any quotations which were detected on those pages.
Trend lines and individual quotations
This site takes advantage of the dataset created by the machine-learning model in several different ways. Each of those ways can be thought of as a process of aggregation from individual instances of a quotation to the trend.
In several places on this site, individual instances of quotations are made available to the users. One place where that happens is in the gallery of quotations. These quotations are the most interesting quotations that I have found in the course of my research. Examples that are interesting and (if the demeanor of serious scholarship will permit me to say so) fun are the primary criteria for inclusion in this collection. However, there is a purpose to including these quotations in that way. They show something of the range of possibilities for biblical quotations. With millions of quotations, it is not possible to examine each one, so this gallery gives a sense of the quotations which are at the base of this project. And each of these is, in itself, a text or primary source that could be of scholarly interest even apart from the broader trends.
The second place where one can find individual instances of quotations is in the tables that appear in the verse viewers. These tables contain all the predicted quotations for a given verse from Chronicling America.8 They are organized chronologically and include the date of the newspaper and a link to where you can see the quotation in context at Chronicling America. These tables allow users to see the quotations in their context. Using them in this way is the basis of much of the writing on this site.
The other form in which quotations are presented on this site are the trend lines. These trend lines show the rate of quotations over time. It is important to show the rate of quotation, rather than the simple number of quotations detected, because there are many more newspaper pages in the corpus over time, and using the rate normalizes the data and shows the change over time. It is thus possible to see not just when a verse was popular, but why. I caution users against reading a great deal into the overall magnitude of the rate: it is really the change over time that is significant. Change over time, however, is a fundamental scholarly primitive, especially for historians.9
In the verse viewer, I am showing the trend lines from 1836 to 1922. The reason for this is that there are many fewer newspapers in the corpus before and after that date. (The date before which copyright has expired is especially significant.) Otherwise, even a few instances of a quotation can have a disproportionate influence on the rate.10 Another thing to note about these charts is that they show a centered rolling average over five years so that the trends are less spiky.
Finally, a different kind of aggregation is the availability of the datasets themselves. This site contains downloads of the key datasets that went into the creation of this site, as well as the data that is created by it. This includes the training data used to create the prediction model. It also includes the cleaned-up dataset of quotations predicted, showing which verses appeared on which pages of Chronicling America and NCNP.
The code for this project is available in a single GitHub repository. The
bin/ directory contains the script that finds the quotations, as well as the trained machine-learning model (
prediction-model.rds, 7.8 MB) and a payload containing a document-term matrix of the Bible and functions for tokenizing the newspaper pages (
bible.rda, 119 MB). The
notebooks/ directory contains computation created along the way. The
website/ repository also contains the code to make the interactive visualizations for this website. Only one piece of the project is not contained within this repository. The data is served by RRCHNM’s general purpose Data API, the source code for which is available in a separate repository.
Although the discussion of methods so far has focused primarily on the computational and technological, it is also possible to talk about the methods behind this project in a more humanistic way to understand how computational methods drawn from other disciplines can be used within humanities disciplines. In other words, how does one use computational methods to make a historical interpretation?
The abundance of digitized sources has already transformed historical research.11 As Lara Putnam has pointed out her article, “The Transnational and the Text Searchable,” searching digitized collections is now a basic scholarly practice. Putnam argues that the ability to search for sources without the constraints of the national archive has allowed new angles of vision on transnational history, because “transnational approaches among historians did not become commonplace until technology radically reduced the cost of discovering information about people, places, and processes outside the borders of one’s prior knowledge.” Yet as Putnam points out, “Digital search makes possible radically more decontextualized research” and makes it possible to find examples of what we are looking for without a sense of its significance. To deal with this problem of context, Putnam observes that “computational tools can discipline our term-searching if we ask them to. By measuring proximity and comparing frequencies, topic modeling [or other text analysis methods, we might add] can balance easy hits with evidence of other topics more prevalent in those sources.”12
America’s Public Bible can be thought of as an interface (this website) on top of a search tool (the prediction model). Keyword search, after all, is itself an algorithmic process, albeit one that has become so common as to be unremarkable. The machine-learning prediction model behind this site is only slightly more exotic and arguably less complex than keyword searching. In essence, America’s Public Bible is a search for quotations.
But it is also an interface, designed with humanistic research principles in mind. The project creates serendipitous findings through computational history by surfacing sources that would otherwise go unnoticed. The project disciplines those searches by setting the results in a much broader chronological context in which the typical and the exceptional can be identified. This disciplined serendipity constitutes a method of approaching the past, and in this case for approaching a past relevant to history, religious studies, and other humanistic disciplines.
The ability to move between the trend line of the verse’s quotation and its location in the actual primary source is the way that the site enables disciplined serendipity. The serendipity lies in how the site surfaces hundreds of thousands of instances of quotations that the user can readily browse. This may be a subjective judgement, to be sure, but this has been the most fun project that I have ever worked on because I am constantly surprised by the quotation finder—and not just that it works at all! For example, how could I have known to look for the time when a Democratic newspaper thought Samuel Tilden had been elected in the disputed presidential race of 1876, and plastered the banner “The Lord called Samuel” across the paper (figure 5). Biblical jokes are another frequent category that I did not expect.13 I take this as a sign that the method truly is serendipitous.
The second context is the place of the text on the newspaper page itself (see figure 4 above). This context allows the scholar to understand how the Bible verse was used. The Bible was a common yet contested text, and the fact that a verse was quoted does not show the meaning of that quotation. Take the trend for John 15:13 (“Greater love hath no man than this, that a man lay down his life for his friends”). This verse exploded in popularity around World War I, and looking at how the verse was used in specific newspapers confirms that it was popular because of obituaries. Investigating earlier uses of the verse shows that it was not associated with the military in any significant way until the Great War. It was more likely to be used to memorialize medical personnel who died taking care of people infected with cholera or yellow fever.
In other words, the contribution of this site in terms of method is not just to bring computational searching to historians. Arguably, historians already do that all the time through keyword searching. Rather it is to try a particularly useful kind of prediction (quotation identification) and to build on top of it a disciplined interface. That interface turns up unusual sources, but also contextualizes them both chronologically and on the page. Those contexts in turn enable historical inquiry, both as expressed on this site, and—I hope—also for its users.
For an earlier discussion of this project, please see Lincoln Mullen, “The Making of America’s Public Bible: Computational Text Analysis in Religious History,” in Introduction to Digital Humanities: Research Methods for the Study of Religion, edited by Christopher D. Cantwell and Kristian Petersen (DeGruyter, 2021), 31–52, https://doi.org/10.1515/9783110573022-003. Some of this discussion is taken from that chapter. ↩︎
James P. Byrd, A Holy Baptism of Fire and Blood: The Bible and the American Civil War (Oxford University Press, 2021), 303–308. ↩︎
I also attempted to use the Book of Mormon, Doctrine and Covenants, and the Pearl of Great Price, all of which are sacred scriptures for the Latter-day Saints. However, these texts frequently quote from or allude to the King James Version and theu use language which is highly similar to the Bible. It was thus very difficult to distinguish computationally between the Bible and the Book of Mormon. These texts were also quoted very infrequently, as far as I can tell, in non-LDS newspapers. The resulting noise made the overall algorithm less effective, and I regretfully had to drop those texts from consideration. ↩︎
The prototype version of this site used a more complicated neural network as its classifier. ↩︎
See Peter J. Thuesen, In Discordance with the Scriptures: American Protestant Battles Over Translating the Bible (Oxford University Press, 1999). ↩︎
Again, licensing restrictions prevent reproducing the context of quotations from NCNP. ↩︎
John Unsworth, “Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How Might Our Tools Reflect This?” (Humanities Computing: Formal Methods, Experimental Practice, King’s College, London, 2000). ↩︎
Call this the batting average phenomenon: it is easy to bat 1.000 when having only one plate appearance. ↩︎
Lara Putnam, “The Transnational and the Text-Searchable: Digitized Sources and the Shadows They Cast,” _American Historical Review_ 121, no. 2 (2016): 377–402, https://doi.org/10.1093/ahr/121.2.377. Quotations at pp. 383, 392. See also Tim Hitchcock, “Confronting the Digital: or How Academic History Writing Lost the Plot,” Cultural and Social History 10, no. 1 (2013): 9–23, https://doi.org/10.2752/147800413X13515292098070; Ted Underwood, “Theorizing Research Practices We Forgot to Theorize Twenty Years Ago,” Representations 127, no. 1 (2014): 64–72, https://doi.org/10.1525/rep.2014.127.1.64. ↩︎
“Our own inability to get the joke is an indication of the distance that separates us from the workers of preindustrial Europe… When you realize that you are not getting something—a joke, a proverb, a ceremony—that is particularly meaningful to the natives, you can see where to grasp a foreign system of meaning in order to unravel it.” Robert Darnton, The Great Cat Massacre: And Other Episodes in French Cultural History (Basic Books, 1984), 77–78. ↩︎