Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2024.09.11 edition

The National Health and Nutrition Examination Survey, monthly crime trends, source code, Italian tax-to-charity allocations, and snakes.

Health and nutrition. Since 1999, CDC has been continuously fielding its National Health and Nutrition Examination Survey, interviewing and testing approximately 5,000 people in 15 different counties each year. The survey combines “demographic, socioeconomic, dietary, and health-related questions” with an “examination component” involving “medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.” Its public-access data files provide anonymized, respondent-level records and are currently available for surveys conducted through March 2020. As seen in: Catherine McDonough et al.’s dataset and interactive dashboard “exploring factors associated with prediabetes and diabetes mellitus among youth in the United States.”

Monthly crime trends. The Real-Time Crime Index, launched last week by a team of crime-data analysts, presents a “sample of reported crime data from hundreds of law enforcement agencies nationwide which mimics national crime trends with as little lag and the most accuracy possible.” Framed as a supplement to the FBI’s slow-to-update official statistics, the project provides monthly and rolling 12-month totals of reported crimes (using the FBI’s UCR Part I offense categories) for the nation, individual cities, and by city population size. You can download the data and see the sources for each of the 300+ local agencies in the national sample. Read more: “The Real-Time Crime Index Shows Declining Crime in 2024,” from project co-leader Jeff Asher’s newsletter.

Source code. Software Heritage, a nonprofit initiative collaborating with UNESCO, maintains “the largest public collection of source code in existence”: an archive tracking 20 billion source files and 4 billion code-commits from 317 million projects from a range of public software hosts (GitHub, GitLab, BitBucket, npm, et cetera). Its Graph Dataset, which provides access to the archive’s content and internal relationships, is available via bulk downloads and APIs. [h/t Derek M. Jones]

Italian tax-to-charity allocations. Italy’s “five per thousand” program allows taxpayers to allocate 0.5% of their income tax to certain nonprofits, research institutions, and other social-benefit organizations. The country’s Ministry of Economy and Finance has published information about 2022’s beneficiaries, but initially did so only via PDFs. Earlier this year, the Liberiamoli tutti! initiative converted those PDFs into structured data that list each recipient organization’s name, tax ID, category, region, province, and municipality, number of taxpayers choosing it, and amount of money allocated. The ministry has since added structured files of its own.

Snakes. SnakeDB — created by Sascha Steinhoff “after [he] accidentally stepped into a snake in South-East Asia” — provides downloadable data on the maximum size, fang position, pupil shape, mode of reproduction, and toxicity of thousands of species, drawn from a broad range of sources. As seen in: Oleksandra Oskyrko et al.’s ReptTraits database.