Interesting People mailing list archives

Best of Luck, Jimmy Wales ...

From: David Farber <dave () farber net>
Date: Tue, 19 Jun 2007 15:15:28 -0400


Begin forwarded message:

From: Randall <rvh40 () insightbb com>
Date: June 19, 2007 3:02:38 PM EDT
To: johnmacsgroup () yahoogroups com
Cc: David Farber <dave () farber net>
Subject: Best of Luck, Jimmy Wales ...

Open-source search engine gangs up on Google
http://www.newscientisttech.com/article.ns?id=mg19426066.500&print=true

* 30 May 2007
* Paul Marks

IF HOPE alone could spawn a world-class search engine, Google would
be dead by now. In reality it's going to take more than faith to
topple the search giant, which has pioneered cutting-edge technology
and grabbed a "mindshare" that secured it a place in the Oxford
English Dictionary. Yet despite the sizeable hurdles ahead, a
rebellious group of engineers is hoping to do just that.

Led by Wikipedia's co-founder Jimmy Wales, hundreds of software
engineers - ranging from fledgling teenage coders to retired,
respected software gurus - are combining in an unlikely attempt to
overturn Google's domination of the search market. Their weapon? The
transparency provided by open source software.

The idea underpinning their search engine - dubbed Wikia Search
after Wales's umbrella company Wikia - is that its search algorithm,
which determines which web pages appear top of the lists of links it
serves up, will be made public. Wikia's search engineers think this
will elicit the trust of users in a way that Google, which keeps its
algorithm a closely guarded secret, never will. Open source search
results will also be more relevant, as the algorithm will
continually be tweaked by its users, keeping it up to date with new
technologies as they are deployed, Wales says. The Wikia Search team
believes this process of continual improvement will also make it
better than Google at dodging the efforts of the spammers who
constantly try to "game" Google's search algorithms to put their own
nefarious web pages top of the list of search results (see "A spark
for spam, or an end to it?").

Google is the top search engine today thanks to an innovative way of
determining which pages are the most relevant to a web user's query
pioneered by its founders, Sergey Brin and Larry Page, back in the
1990s. Yahoo and Microsoft followed with similar algorithms to rank
pages (New Scientist, 20 November 2004, p 23). These algorithms form
the heart of each company's intellectual property and so are kept
secret. But that, Wales told New Scientist earlier this year, is
their Achilles' heel, because it means no one knows why search
results appear in the order they do.

Last month, for instance, Google upgraded its algorithm to serve up
links to images, news, video, music and books, as well as web pages,
in a single search results page, saving users the trouble of having
to search under different headings. The company is keeping quiet
about how it does this too.

Faced with that silence, people rightfully question the quality of
search results, says Jeremie Miller, Wikia's technology chief, who
is based in San Mateo, California. Some ask whether Google's
algorithm skews results towards its advertising clients, which
earned the company more than $10 billion in 2006. Google denies
this, but equally, the secrecy means it is difficult to prove
otherwise. Similar criticism can be levelled at other search
engines. Last year several companies filed lawsuits against Google
and Yahoo alleging that the companies unfairly skew their search
results (New Scientist, 19 August 2006, p 24).

Politicians are worried too. "European governments have been getting
concerned about the competition aspects of search engines,
particularly as Google has become so dominant," says Ian Brown, an
electronic privacy expert at University College London. "They think
there should be much more transparency with search algorithms."

Web surfers may wish to turn to Wikia Search for another reason: it
is vowing not to record the terms people search for. Google, Yahoo
and Microsoft store this data as they say it helps them improve
their technology, but there are concerns that it could be used more
intrusively.

Wikia Search still has a long way to go before it becomes reality.
Though the discussion forums on the project's website
(search.wikia.com) and its associated email list have been up and
running since January, and are brimming with ideas about better ways
of running a search engine, no clear way forward has yet been
decided. What has emerged is that the code will probably incorporate
the best elements of two existing open source search programs,
neither of which is ready for prime time. One, called Lucene,
creates lists of websites and their contents; the other, called
Nutch, picks out search results from vast clusters of computers.

Google says it welcomes the competition. "We're just really excited
when a new development comes to the space because it is good for
everybody," says Jon Steinback of Google.

To take on Google, Yahoo, Microsoft and the rest, Wales and his
coterie of coders face some tough challenges. One is a lack of cash
to buy a fleet of global data centres. Today's search engines create
lists containing the contents of billions of web pages, known as
indexes, and store them on tens of thousands of servers around the
globe. The exact number is another trade secret, but there is no
doubt that maintaining and powering them is hugely expensive.

Wikia Search is already considering one solution. Rather than
investing in data centres, it might store its index on a distributed
computing "grid" made up of thousands of volunteers' home PCs and
servers connected via the internet. The model for this is the
SETI@home screen saver, which divvies up data from a radio telescope
among volunteers' home PCs. Each computer would hold a small part of
Wikia Search's index and handle search requests relevant to that
part.

This strategy brings a bunch of problems of its own, though. What do
you do when individual machines are switched off? And how do you
stop spammers posing as Wikia volunteers and flooding the index with
nefarious web pages?

Miller is confident these problems can be overcome. Video
distribution networks that use BitTorrent software also store
material on users' machines and can continue to function even when
some are switched off by spreading copies of the data across a
number of machines. Google itself shows the distributed approach
works, says Brown. Using clusters of desktop-class PCs, it deploys
clever distributed algorithms to shunt search data between them.

Can Wikia Search's creators win the day? Clearly they are spirited.
"Kill and destroy Google," jokes one contributor. "Let's drive a
stake through the evil dragon's heart." In the end it may come down
to how much users value transparency. "Search needs to be part of
the internet's infrastructure, not the domain of commercial giants,"
says Miller. "Google is an advertising service."

A spark for spam, or an end to it?

Going open source should ensure the ordering of a search engine's
results cannot be secretly bent to its owner's whim. But will it
make the results any less prone to manipulation by spammers?

Search engine spam has plagued Google's results since the company
was founded. One way spammers initially "gamed" Google's search
algorithm, which ranks pages more highly if lots of other pages
contain links pointing to them, was to put up spoof web pages
crammed with links pointing to their own sites. As Google got wise
to this, spammers got more sophisticated and the two sides are now
locked in an arms race. Spammers deduce how Google's algorithm works
from observing how it seems to rank pages, and then devise their own
technologies to take advantage of the algorithm and propel their
pages to the top. Meanwhile Google has to constantly modify its
algorithm to dodge these tricks.

Now Wikia, a company co-founded by Wikipedia pioneer Jimmy Wales,
plans to build an open source search engine to rival Google that
will publicise the way its algorithm ranks results. Ben Laurie, an
open source programmer based in London, says that this will make it
easier for spammers to game the algorithms. Instead of having to
guess at how an algorithm works, as they do now, they will simply be
able to peek inside the software to come up with ways to manipulate
it. "By publishing its search algorithm, it's going to be pretty
obvious to spammers how to get to the top of the search hits,
risking a huge spamfest," Laurie says. "Some genius might come up
with algorithms that, despite being published, are resistant to
that. But it strikes me as unlikely."

The Wikia Search team, however, expect that to happen. They hope
their algorithms will be more responsive than Google to new spam
techniques because of the vast number of volunteers' brains that
will be thrown at the problem.

Danny Sullivan of the news site searchengineland.com thinks that
Wikia Search will turn its army of volunteers to finding ways to
block spammers in the same way that Wikipedia handles vandalism in
its articles using an army of human editors. "I think they might
come up with some novel technology to let humans shape or refine
search results," he says.


My Original Writing blog: http://itgotworse.blogsource.com



-------------------------------------------
Archives: http://v2.listbox.com/member/archive/247/=now
RSS Feed: http://v2.listbox.com/member/archive/rss/247/
Powered by Listbox: http://www.listbox.com
Current thread:

Best of Luck, Jimmy Wales ... David Farber (Jun 19)
- <Possible follow-ups>
- Re: Best of Luck, Jimmy Wales ... David Farber (Jun 20)