Data Is Plural

... is a weekly newsletter of useful/curious datasets.

FAQ » What Makes A Dataset Good For DIP?

First published: 2022.08.22

Last updated: 2022.09.15

Status: Rough draft

When evaluating datasets for potential inclusion in the newsletter, what do I look for? These are the main criteria I have in mind … while understanding that few datasets will meet them all:

The dataset is freely, publicly available.

A reader should encounter as few barriers (e.g., account registration) to direct access as possible. (If your raw dataset contains sensitive information, consider creating a public-use version that anyone can access.)

The dataset is fully downloadable in bulk and/or via an API.

Exceptions include datasets that are simple enough to fit on in a single, non-paginated HTML table. If a dataset is particularly large or elaborate, bonus points for also providing simpler, smaller versions for people who might not need it all.

The dataset is well documented.

Public documentation should, ideally, define the project’s focus, describe the data collection process, explain each variable, list known issues, and acknowledge the dataset’s limitations. The documentation should remain in sync with the dataset itself — not contradicting what a reader will see when examining the data.

The dataset is accurate.

I can’t fully fact-check every dataset I examine, but a basic review shouldn’t reveal any glaring errors.

The dataset uses open, standard formats.

Three cheers for CSV, TSV, JSON, GeoJSON, SQLite, and friends. Excel files are widely-supported by free software programs, but organization matters. Bonus points for tidy data.

The dataset specifies its terms of use.

Who can use the data? Under what circumstances? Can they republish it? How should people using the dataset attribute it? The documentation should clearly explain such constraints and/or use a standard license (such as those developed by Creative Commons or Open Data Commons).

The dataset has its own, unique, directly-linkable URL.

It should be easy to point readers to the dataset and its documentation.

The dataset is original.

If it overlaps with preexisting datasets, I recommend providing a clear explanation of how it differs from (and/or improves upon) those others.

The dataset is relatively fresh.

I like featuring explicitly-historical datasets, but I shy away from contemporary datasets that have gone stale. (E.g., a dataset in 2022 that includes data only for 2002–2017.) Bonus points for stating your plans/schedule for updating it.

The dataset is sufficiently detailed.

Some datasets are inherently simple. But, generally speaking, the more detail you can (accurately) provide the public, the better. And although statistical aggregations make sense in certain cases, I tend to favor datasets that represent their real-world observations more directly.

The dataset is interesting.

This criterion is the squishiest of all. What makes a dataset interesting? Perhaps it is interesting just because it exists — a dataset that makes you say to yourself, “Huh, I never would have guessed people were working on this.” More often, it’s useful in some immediate way — a dataset that I can imagine readers exploring for work, for the public good, and/or for fun. But that just raises another slew of questions, doesn’t it?

I’d much rather receive more submissions than fewer. So, regardless of whether your dataset hits these marks, don’t hesitate to let me know about it. How? Send me an email at jsvine@gmail.com. Thanks!