Interesting People mailing list archives

Re Chronicle of Higher Education: Google and the Misinformed Public


From: "Dave Farber" <farber () gmail com>
Date: Wed, 18 Jan 2017 18:13:16 -0500




Begin forwarded message:

From: Chuck McManis <chuck.mcmanis () gmail com>
Date: January 18, 2017 at 5:34:59 PM EST
To: Dave Farber <dave () farber net>
Cc: ip <ip () listbox com>
Subject: Re: [IP] Re Chronicle of Higher Education: Google and the Misinformed Public

It sounds like there's a market opportunity here for a search engine that explicitly provides context for search 
results: credibility, fact checking, bias (not as a value judgement), research articles vs. journalism reporting on 
them, etc. Could also incorporate some form of crowd sourcing, etc.

Fortunately this experiment has been done at least once. The search engine Blekko was founded by Rich Skrenta and his 
friends from Topix, I joined as VP of operations (and later operations and engineering) in 2010 about 6 months before 
the product officially launched in November of 2010. The search engine was predicated on the idea that much of the 
content of the web was useless and that the original search engine mission of finding pages you would not normally 
find, had flipped over to finding useful pages out of all the fluff. That concept was implemented as a curated index 
based on web sites that had been identified by humans (content editors) and generally validated by search traffic 
(click throughs). 

What had been observed was that Google had created an attractive nuisance in the early 2000's where a page with 
AdSense ads on it that was on the first page of results for a given topic would deliver thousands of dollars a month 
in passive income, the same pages with would provide additional income with affiliated links (sending referral cash 
to the referral network). Blekko defined such pages which were simply place holders to drive advertising traffic as 
web "spam" in much the same vein as email spam. By 2015 Google had an estimated an index size of 10 trillion 
documents, by analyzing over 200 billion pages Blekko estimated that there were no more than 100 billion web 
documents with actual content. In that same year Doug Smith presented a paper he and I had authored at ICSC and Doug 
and I had further refined the research to use training data from known good web pages to begin automating growing the 
index with high quality documents.

As a product idea it resonated strongly with anyone who used search as part of their job. Blekko was beloved by the 
reference librarians around the world, had a tremendous following with attorneys and journalists who used use for 
research, and students trying to research term papers. As a business model it was less successful. Specifically the 
only search 'intent' that makes money is commercial intent. Blekko was unable to pursue subscription access to the 
index and because we did not index all content the engine would do poorly on topics that were not curated, or "long 
tail" topics[1].  The company had a "3 card monte" game built into the interface where it would show results from 
Blekko, Bing, and Google (the only 3 US based indexes available, today we're back to only 2). The user was asked to 
pick the column with the "best" results and that showed a consistent correlation between contested searches, long 
tail searches, and topic searches. A "contested" search was one where a lot of people were attempting to game the 
algorithm (search for "no fee credit card" some time to see a good example of a contested search), in those Blekko 
consistently 'won' because our index wasn't influenced by these folks and so we only returned good pages. In "long 
tail" searches Google generally won, it has a really really big index. And in topic searches we typically would tie 
with Bing (and beat Google) or win if the topic was one of our curated topics.

In March of 2015 IBM bought the assets of Blekko and made it part of the Watson Group where it's crawler continues to 
go out and collect web pages but now does so in service of building data sets for Watson rather than a search engine.

What I learned was that it costs about $7.5M/year grossed up to operate a 5 billion page index with enough bandwidth 
to serve on the order of 10M queries/day and you can't make a profit with that unless you build up your own ad 
network (which Blekko never did). People love quality search results, but they aren't actually willing to pay any 
money for them. (or conversely they are willing to put up with the queries that return nothing but junk on Google for 
the ones that work well). Digital advertising networks are filled with people who make boiler room sellers of 
sub-prime mortgages look like angels.  

--Chuck
 
[1] A "long tail" topic is one for which only a handful of web pages exist and it is not referenced widely in the 
existing web

On Wed, Jan 18, 2017 at 11:33 AM, Dave Farber <farber () gmail com> wrote:



Begin forwarded message:

From: Thomas Leavitt <thomleavitt () gmail com>
Date: January 18, 2017 at 2:26:30 PM EST
To: Dave Farber <dave () farber net>
Subject: Re: [IP] Chronicle of Higher Education: Google and the Misinformed Public

Dave,

It sounds like there's a market opportunity here for a search engine that explicitly provides context for search 
results: credibility, fact checking, bias (not as a value judgement), research articles vs. journalism reporting on 
them, etc. Could also incorporate some form of crowd sourcing, etc.

Would be an interesting technical challenge to make this applicable across a broad range of searches, and of course 
there's the business case (or lack thereof) and going up against Google. On the other hand, it seems like there's a 
real need for genuine innovation in the space, and some obvious candidates that would likely be interested in 
executing an buy out for a successful implementation prior to the company going to market.

Regards,
Thomas Leavitt

On Jan 17, 2017 10:13 AM, "Dave Farber" <farber () gmail com> wrote:



Begin forwarded message:

From: Lauren Weinstein <lauren () vortex com>
Date: January 17, 2017 at 11:20:06 AM EST
To: nnsquad () nnsquad org
Subject: [ NNSquad ] Chronicle of Higher Education: Google and the Misinformed Public


Chronicle of Higher Education: Google and the Misinformed Public

http://www.chronicle.com/article/Googlethe-Misinformed/238868

     Digital media platforms like Google and Facebook may disavow
   responsibility for the results of their algorithms, but they
   can have tremendous -- and disturbing -- social effects.
   Racist and sexist bias, misinformation, and profiling are
   frequently unnoticed byproducts of those algorithms. And
   unlike public institutions (like the library), Google and
   Facebook have no transparent curation process by which the
   public can judge the credibility or legitimacy of the
   information they propagate.  That misinformation can be
   debilitating for a democracy -- and in some instances deadly
   for its citizens.

- - -

--Lauren--
REPORT Fake News Here! - https://factsquad.com
CRUSHING the Internet Liars - https://vortex.com/crush-net-liars


Archives  | Modify  Your Subscription | Unsubscribe Now       




-------------------------------------------
Archives: https://www.listbox.com/member/archive/247/=now
RSS Feed: https://www.listbox.com/member/archive/rss/247/18849915-ae8fa580
Modify Your Subscription: https://www.listbox.com/member/?member_id=18849915&id_secret=18849915-aa268125
Unsubscribe Now: 
https://www.listbox.com/unsubscribe/?member_id=18849915&id_secret=18849915-32545cb4&post_id=20170118181325:B47F58FC-DDD3-11E6-AE4C-E326D0A49613
Powered by Listbox: http://www.listbox.com

Current thread: