Expanding on our Iffy.news/Veri.FYI session at WikiCredCon (February 15 2025). Please consider supporting the new Domain-Name Property Proposal in Wikidata, that emerged from this conference…
Wikimedia is a wonderful source of information about the credibility of news sources. But it’s not a source of credibility data. It’s not a credibility tool. Even though it could be.
What is.
data:image/s3,"s3://crabby-images/62451/62451d14c69f39fc441fbaddf6cc2dde67527c0a" alt="Existing Wikipedia Infobox for the Alliance Times-Herald with little data"
What could be.
data:image/s3,"s3://crabby-images/f39e0/f39e04297177349bea43e2343ba2be960abc71d6" alt="Proposed Wikipedia Infobox for the Alliance Times-Herald with data imported from external datasets"
That latter, proposed Infobox is full of good credibility indicators: a press association membership, inclusion in vetted news-media databases, a local address (distinguishing it from fake-local pink-slime sites), early founding and domain registration dates,
I have all that data and more for about 14K US, mostly English news-outlets, harvested from dozens of databases and APIs. What I don’t have is good way to get it into Wikidata. (Or any way at all to get it into Wikipedia.)
For that to happen, Wikimedia would need to play nice with external datasets. And Wikidata would need to share with Wikipedia English’s Infobox (as it does in other languages).
Right now, though, there’s no good way to exactly match a paper in my dataset with its Wikidata item, a.k.a. QID.
data:image/s3,"s3://crabby-images/1b459/1b459d87f537b493c0e3d1e44daa9dc8fd9d4a3c" alt="Wikidata earch for The Bulletin returns many results"
Name ain’t gonna cut it.
Newspapers names are far from unique. Papers are often known by multiple names: Bend Bulletin or The Bend Bulletin or usually just The Bulletin. Or the name only Wikimedia uses: The Bulletin (Bend).
Nine papers in my dataset are named The Bulletin. None are the one in Bend. However, only one paper on the entire planet is identified by this domain name: bendbulletin.com.
The problem is…
Wikidata knows no domain
Almost all the external media datasets use domain-name to identify a news outlet. But Wikidata has no property for domain name. It has no place in its entire database to store an item’s domain name.
We could enter the domain as an alias. That makes it easily located by computers and humans.
data:image/s3,"s3://crabby-images/31b5d/31b5d6b931a2931f15cf6491a6ae04919e50def5" alt="Adding domain name as an Alias makes searches quicker and easier for computer and humans"
More than 2K news-outlets in Wikidata have their domain as an alias. But Wikidata editors tell me that domain shouldn’t be an alias. You may ask me, as many Wikidatasticians have: Why not get the domain from an item’s ‘official website’ URL (P856)?
That approach has many problems (outlined below). But the main reason domain name should be its own separate value: So Wikidata can quickly and easily share with other databases, like those with credibility-related data.
This SPARQL query for a domain name to exactly match an alias took 0.14 seconds and returned a single result. Just what we need.
data:image/s3,"s3://crabby-images/b4cff/b4cff7e8341cd779337e1e4501fe7f22907209e7" alt="SPARQL query for domain name as an alias returns 1 result in 0.1 second"
A query for that domain name within an ‘official website’ URL took 31.8 seconds (230 times longer) and found multiple results. Not what we need, especially when programmatically trying to match thousands of news outlets.
data:image/s3,"s3://crabby-images/3dba5/3dba595fca7d7e873f370a322646b287f5df11a2" alt="SPARQL query for domain name within website URL returns multiple results in 32 seconds"
I was able to find QIDs for about 10K news-outlets. That required running lots of laborious searches and scripts on CSVs and spreadsheets using multiple tools (OpenRefine, Quick Statements, Wikibase-CLI, and Wikipedia and Wikidata Tools).
I then uploaded several boatloads of data from external datasets into Wikidata (including press associations, Media Bias/Fact Check ratings, addresses, place of publication, and updated URLs) — but only for the mostly English, US outlets I was working with. That complicated process wouldn’t be practical for all countries and all languages. Nor would it be needed if Wikidata stored domain names as data.
Taming Wikidata’s Wild West classes
One more request: It’d be nice if all news media in Wikidata were classified as ‘news media’ (Q1193236) or one of its subclasses. That would make searches faster and more complete. I found newspapers that weren’t an instance of ‘newspaper’ (Q11032) and news classes that weren’t subclasses of news media (e.g., ‘news program’ and ‘news magazine’). Fixed those but there’s likely more stragglers out there.
Here’s a Wikigraph of ‘news media’ and its hundreds of subclasses:
data:image/s3,"s3://crabby-images/410d3/410d31539e8090b27e0d6e02f7f9deba5784b6a5" alt="Wikigraph data visualization for new-media's many subclasses"
News-outlet categories are a sprawling, free-form, wild-west of classifications. So it’d also be nice if the ‘news media’ hierarchy was better organized, like this Metaphacts viz:
data:image/s3,"s3://crabby-images/1bd86/1bd86fed240326718fd33ad915b26e87890cf58b" alt="Metaphacts data visualization for new-media's main subclasses"
Advice/Asks
Summing up, we’re asking Wikidata editors to:
- Make domain-name searches quick & easy.
- Populate Wikipedia (en) Infoboxes with Wikidata.
- Classify all news media under ‘news media’.
Thanks for making it this far. A couple sheets that might help your Wikidata work:
- SpiffyNews-2024: US reliable sources with QIDs and domain names.
- US Place FIPS/Wikidata QIDs: QIDs of US cities and states.
Domain rocks, URLs roll
Addendum: More reasons getting domain name from URL isn’t optimum, for API searches or information architecture:
- You can get an item’s place name from its address. But no one’s suggesting that ‘city’ shouldn’t be a separate Wikidata value. Domain name, like place name, is the atomic unit of data, from which URLs (or addresses) are derived.
- Domains are regulated and registered in the DNS. URLs aren’t.
- Domains stay the same. URLs can change (e.g., http to https).
- URLs are absent from many items and removed from others on blocklists.
data:image/s3,"s3://crabby-images/92f74/92f7403648167fb5aa198c0f80fcf1ae38530013" alt="Wikidata item with official website listed as No Value"