Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2024.07.03 edition

Newswire articles, death penalty status by country, beach replenishment, hurricane forecast accuracy, and UK film stats.

Historical newswire articles. Emily Silcock et al. have created Newswire, a dataset of 2.7 million newswire articles published in the US between 1878 and 1977. To build it, they extracted 138 million articles from scans of newspapers’ front pages and then used machine learning to group those coming “from the same underlying newswire source article, in the presence of significant abridgement and noise.” For each detected newswire item, the dataset lists the newspapers that carried it, dates published, the text of a representative version, its extracted byline, dispatch location, people mentioned in the text, general topic, and more. Previously: American Stories (DIP 2023.09.13), a dataset of historical newspaper articles — also from Melissa Dell’s research group. [h/t Robin Sloan]

Death penalty status by country. The Comparative Death Penalty Database, compiled by Carsten Anckar and Thomas Denk, tracks the status of capital punishment in 206 independent countries annually from 1800 to 2022. It places each observation into one of five categories, indicating whether the death penalty is: (a) fully abolished, (b) abolished “for ordinary crimes only,” (c) abolished for “for ordinary crimes only but where at least one execution has occurred in the last 10 years,” (d) de facto abolished, or (e) still in use. Previously: The Death Penalty Information Center’s database of US executions (DIP 2019.05.15); data on death sentences from The Intercept (DIP 2019.12.11) and from Brandon L. Garrett (DIP 2018.08.01).

Beach replenishment. The Program for the Study of Developed Shorelines at Western Carolina University maintains a database of 2,500+ beach-replenishing efforts since the 1920s. The project is “a 25-year research and data collection effort that, to the best of our knowledge, represents the most comprehensive compilation of beach nourishment history in the United States.” For each sand-adding “episode,” the dataset indicates its location, year completed, sand volume, length of shoreline treated, total cost, primary type of funding source (private, federal, state, etc.), and justification (shore protection, navigation, emergency dune construction, etc.). As seen in: “Sand Dollars,” by CBS News Investigations.

Hurricane forecast accuracy. The National Hurricane Center says it “receives frequent inquiries on the accuracy and skill of its forecasts and of the computer models available to it.” To help answer those questions, the agency publishes a series of regularly-updated verification reports, as well as a database quantifying its forecast errors. For each official projection since 1970, the database compares each storm’s predicted location and wind speed to those attributes’ ultimate values. As seen in: “The Social Value of Hurricane Forecasts,” a study by Renato Molina and Ivan Rudik.

UK film stats. The British Film Institute publishes a variety of statistical reports, including spreadsheets of weekend box office figures. Those spreadsheets cover each weekend’s 15 highest-grossing films, all UK-originated films, and other newly-released films; they list each film’s title, country of origin, distributor, cinema count, weekend gross, total gross to date, and more. [h/t Gina Acosta Gutiérrez]