Using a "lens test" to escape a data void

OK, it's essentially just a little personal A/B test, but I like this metaphor so don't judge

Jun 10, 2021

One problem searchers have with search terms is that sometimes specifying you are checking if something is a hoax or misinformation is really helpful in terms of the SERP, pulling up a better set of results. Other times adding a term like “true” or “false” to your search term results could push you deeper down the data void hole. But without knowing the status of your claim, it’s hard to know which situation you’re in.

This used to be much, much more of a problem in the past than now, to be honest. Back when I wrote my open textbook on web literacy, here were the results for 9/11 hoax:

Results for search ‘was 9/11 a hoax’ circa early 2017. It’s a trash fire, tbh.

In the past several years, things have gotten much better. How do I know this? Because a big part of my work is coming up with prompts for students that do things like force them to deal with data voids. And in 2017, this was trivially easy. Play around with a search or too, load bias into the terms and watch the Holocaust disappear from history, or discover secret plans for “white genocide”. Nowadays, not so much. Here’s that hoax search now:

It also used to be pretty easy to bias a search. Search for “are we eating too much protein” and the answer used to be, yeah, you’re eating too much. Search for “are we eating too little protein” and voilà , turned out we’d been eating too little. Nowadays, however, this just isn’t the case, both searches tell us, yeah, we’re probably eating more than we need:

Two protein searches, more or less the same results.

Here the first set of results (“too much”) starts with an article from Harvard Health saying we’re probably eating too much and then goes to a BBC article saying the same, and the second (“too little”) includes the same two articles, as well as very reputable article from WebMD that says protein deficiency is rare in the U.S. but here are some symptoms to look for.

When I say it’s harder to find examples, I don’t mean a little harder. It’s harder by several orders of magnitude. I don’t know why, but the pattern seems here to stay. (My theory is some of this may be due to the newer Deep Rank/BERT algorithm taking things less literally, but there was also an update I think after the Holocaust debacle).

Do I still find data voids? Yes, for very niche stuff. Take this example

Decalcify your pineal gland? Is that something I should be doing? I do a search:

Now the thing is that I know the sort of neighborhood I’m in here. I see “supplements”, I see “detox”, I see “avoid fluoride”. I mean, it even has “sungazing” for pete’s sake (don’t ask). But the way I know I’ve entered a neighborhood potentially rife with misinfo is based on a lot of background that your students won’t have. (Trust me, I’ve run things like this and students who haven’t learned SIFT don’t get the vibe from these results).

So we’re stuck with a dilemma here. If we tell students, oh, add “misinformation” or “fact-check” to the keywords, then they have the potential to bias search results that might have been otherwise good. (Again, not nearly the problem it was in 2017, but term bias is still a concern). But if they enter a void like this, sometimes the best way out is to explicitly look for articles calling bullshit on this claim.

To solve this problem for students, I’ve developed an activity around data voids I am currently calling a “lens test”, based on an eye doctor metaphor. It’s admittedly just a bit of personalized A/B comparison, but I like the the analogy. The idea of it is this: when you have to make a change to your search terms that *might* add bias, but seems necessary, don’t just change and keep going forward — explicitly compare the two results based on the heuristic “which set of terms returns the sort of sources I would expect for an issue of this nature and importance?”

I walk through it below. It’s a straight run-through, 15 minutes talking about various issues, if you want to jump 1:30m in you’ll save some time, and when I start ranting about misinterpretations of NIH/PubMed you can probably stop there unless you’re curious, and save the final 5 minutes. I’d love to get nice polished versions of these done at some point, but for now this is what we have. :)

The End(s) of Argument

Discussion about this post