Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2024.07.24 edition

News homepages, human rights scores, commercial zones, national park species, and Australia shipwrecks.

News homepages, archived. Since launching in March 2022, homepages.news has archived millions of screenshots, performance audits, robots.txt files, accessibility trees, and hyperlink lists from the homepages of 1,100+ news sites. The open-source project, run by journalist Ben Welsh, provides bulk data for each of those assets. The screenshots themselves are stored on the Internet Archive; you can also view the latest screenshots from all the sites on one page. To date, the publications span 32 countries and 17 languages. Related: Welsh and volunteer Alex Garcia are using the robots.txt data to track which sites block OpenAI, Google AI, and Common Crawl — findings that have been cited widely.

Human rights scores. The CIRIGHTS project aims “to create numerical measures for every internationally recognized human right for all countries of the world.” The team has developed a detailed guide to scoring each government’s record on dozens of such rights, such as freedom of religion, women’s political rights, freedom from extrajudicial killings, the right to a fair trial, and “reasonable limits” on working hours. For each year from 1981 to 2021, the project’s scorers have rated each country on each right, generally on a three-point scale, based on information in the US State Department’s Country Reports on Human Rights Practices, Amnesty International’s annual reports, and similar sources. The resulting dataset includes those scores, as well as several summary metrics.

Commercial zones. Byeonghwa Jeong et al. have constructed a dataset estimating the geographic boundaries of 23,000+ commercial zones in 69 metro areas in the US and Canada. To build it, they used data on retail and office locations from OpenStreetMap, and on job density from the US Census Bureau’s Longitudinal Employer-Household Dynamics program (DIP 2021.05.26) and Statistics Canada. For each detected commercial zone, the dataset provides its outline, total area, a score of its relative concentration (on which the zone comprising most of Manhattan scored the highest), its MSA, and the street at its centroid.

National park species. The National Park Service’s NPSpecies portal “documents our knowledge about the occurrence and status of species” on the agency’s lands. For each NPS-managed area, you can download a list of the species, their scientific and common names, occurrence status (present, probably present, unconfirmed), nativeness, conservation status, and more. Related: Noting that “many of the observations in NPSpecies remain unverified and the lists are often outdated,” Benjamin J. LaFrance et al. have created an updated dataset for amphibian species, which they checked against other sources and verified with regional experts.

Australia shipwrecks. The Western Australia Museum hosts a range of datasets, including details concerning 1,600+ local shipwrecks and 30,000+ artifacts recovered from them. The shipwreck dataset lists each ship’s builder, construction materials, owner, cargo, wreck location, date wrecked, known deaths, date found, and more. Previously: Ancient shipwrecks (DIP 2024.07.10). [h/t Kristin Milton]