Data Is Plural — 2024.04.03 edition

2024.04.03 edition

Power outages, European Parliament activity, candid animals, LLM data provenance, and deaths in plague-era London.

Power outages. Christa Brelsford et al. have compiled a county-level estimates of the number of US customers experiencing power outages at 15-minute intervals from 2014 to 2023. The records come from Oak Ridge National Laboratory’s restricted-access Environment for Analysis of Geo-Located Energy Information, a “platform created to monitor electric utility customer outages from data gathered from public sources.” The data’s coverage has increased over time; by 2022, it represented 92% of customers in the 50 states, DC, and Puerto Rico. “The remaining 8% of customers belong to utilities which do not report outage information publicly in near-real time in a format that is currently accessible to EAGLE-I parsers,” the authors write. “These are most typically small, rural, municipal utilities which lack robust information technology infrastructure.”

European Parliament activity. Parltrack keeps tabs on 4,000+ active and prior members of the European Parliament, 23,000+ policy dossiers, 39,000+ votes, and much more. The project, launched in 2011, scrapes data from various official websites and links it together — so that you can see, for example, any given member’s dossiers, committee roles, and activities such as plenary speeches and proposed legislative amendments. Its bulk datasets are updated daily and include details beyond what the online interfaces offer. [h/t Stefan Marsiske]

Candid animals. The Labeled Information Library of Alexandria data repository is “intended as a resource for both machine learning (ML) researchers and those that want to harness ML for biology and conservation.” Its datasets include millions of images, mostly captured by motion-triggered cameras. Its North American Camera Trap Images dataset, for instance, “contains 3.7M camera trap images from five locations across the United States, with labels for 28 animal categories, primarily at the species level.” Read more: “Machine learning to classify animal species in camera trap images: Applications in ecology.” [h/t Corin Faife]

LLM data provenance. The Data Provenance Initiative “is a multi-disciplinary volunteer effort to improve transparency, documentation, and responsible use of training datasets for AI.” Its first release, the Data Provenance Collection, catalogs dozens of corpora used for fine-tuning large language models, as well as their component datasets’ names, task categories, known sources, licensing, various text metrics, and more. Related: Yang Liu et al.’s “Datasets for Large Language Models: A Comprehensive Survey,” accompanied by semi-structured descriptions of hundreds of training and evaluation datasets. [h/t u/cavedave]

Deaths in plague-era London. Death by Numbers, also known as the Bills of Mortality Project, aims to transcribe ~8,000 official weekly tallies of deaths in London published in the 1600s and 1700s. Initially focused on plague deaths, the reports expanded to “dozens of other causes of death, such as childbirth, measles, syphilis, and suicide, ensuring their continued publication for decades after the final outbreak of plague in England.” The project’s data are available to browse online, to download, and via API. [h/t Derek M. Jones + Cody Winchester]