Thanks
for talking to us, Rich. Can you tell us a bit about
your background, and what you're up to now?
My background is in engineering. I started my career
working on Unix operating system
internals, first at Commodore-Amiga (they actually
had a version of Unix
which could run on the Amiga), and then at Unix
System Labs. In 1995 I went
to work for Sun Microsystems on network security
and encryption products.
In 1998 a group of us formed GnuHoo/NewHoo, and
it was quickly bought by Netscape. We
ran engineering Netscape Search for a few years,
and subsequently managed
AOL Music (including Spinner and Winamp) and finally
AOL Shopping.
In 2002 we left AOL as a team and set up shop as
a startup again to play around with crawling and AI technology. We came
up with the idea to build
a news site which would track relevant news for
every
conceivable subject
and place in the world, updated from the fullest
set
of information available
on the Internet.
Our site Topix.net launch early last year. We have
a news page for
every city in the US, every country in the world,
and thousands of
subjects, all culled from over 10,000 sources. There's
a lot of text
analysis software behind what we do, and an AI system
that uses a
massive knowledge base to make sure the categorizations
are relevant.
We had a great first year, inking deals with Ask
Jeeves, Infospace,
and most recently CitySearch. Traffic has been good
and we just
became profitable last month.
With
so many sources, you'd have your work cut out for
you in terms of categorization.
Can you talk a bit more about how the text analysis
and AI
is achieved? What challenges do you face?
The first step of categorization is to recognize "named
entity references". If
we're looking for news for our Janet Jackson channel,
is this really
the Janet Jackson who is the pop star, or is it
one of the other 10,000
people with that name.
The second step is what we
call about-vs-mention
discrimination. Janet Jackson is mentioned in many
articles that
aren't
actually about her. We want to separate those
out
from the articles
which would be editorially appropriate for her news
channel.
Our
system won't be complete until it performs
as well as a human editor. It's pretty good now,
with over 99% accuracy, but we'll be continually
improving the system to increase the coverage
and relevance of the news channels.
Many of the large newspaper companies
lock-up content behind subscription
forms, while the same content is often distributed
elsewhere
without impediment. How do you see news services
developing on the web in future?
Are the large, incumbent news businesses going to
be increasingly challenged by stealthy, smaller
outfits and/or automated aggregation services?
Registration gates cause the majority of our user
complaints.
Because of this, the Topix.net robo-editor will
do its best to find
a related story to link to that is not behind a
reg gate. So if the
same story comes out in two places, but one is behind
a reg gate and
the other is not, topix.net will link to the story
that won't impede
users clicking on it with a form.
We're experiencing an explosion in the number of
news outlets.
Beyond the newspaper web sites, TV stations, news
radio stations,
and online magazines, there are increasing amounts
of information
being provided by corporate and government entities,
as well as
weblog authors and other non-journalist writers.
To scan this
massive amount of incremental information available
each day for
items which can be personally relevant is a big
job, but one amenable
to computer automation. At a high level Topix.net's
mission is to
read everything new on the Internet every 30 minutes
and let you
know about new, relevant information that's of interest
to you --
whether that interest is based on a local city,
a hobby interest,
a business sector, or some other content channel.
I think there's a big opportunity for existing
media organizations
to take advantage of the current online trends,
and add incremental
value and profit to their delivery of content online.
The exact
nature of these business models is still sorting
itself out but some
online operations are doing very well, and we've
talked with some
very forward-looking folks in traditional media
businesses.
I'd like to talk a little more about your
background, particularly in
relation to DMOZ. An article published in 1998 stated
that you created GnuHoo out of frustration with Yahoo!, in particular,
the slowness at which the
Yahoo! directory added and updated sites back then.
Can you tell us a
bit about those times? How do you think DMOZ has
progressed since you left to pursue other ventures?
Dmoz has continued to grow since we left AOL. The
open directory
project vastly exceeded our original goals -- we
thought if we could
get 1,000 editors and 1 million sites, the direction
would be a
huge success. It achieved these goals and has fulfilled
its mission
of becoming the largest human-edited directory of
the web. But the
web moved on, and while directories were very interesting
in the mid
'90's, keyword search has eclipsed them as the main
ways consumers find information on the Internet.
This is in part due to the growth of the web. When
the web is small --
say 30 million documents -- a directory is a great
way to find organize
and find sites. This was Yahoo's strength in the
early days. But as
the web grows from 30 million sites to 5-10 billion,
directories,
even very large ones, can't keep up. Dmoz has 4M
sites and over
600,000 categories. This is almost too large to
be useful; one
can't easily click around browsing through a 600,000
page directory.
At the same time, since the directory, even with
as many as 4M sites
contains only .1% of the Internet or less, it can't
be as big as it
needs to be to cover the content available on the
Internet.
DMOZ has often been credited as a source of
citation data, which algorithmic
search engines may use in their calculations. We
spoke a little about how a
mix of human and machine input might produce the
most desirable search
results.Can you talk a little about how you think
human factors can be best
used to create meaningful search results today?
AOL Search is doing interesting work in this area.
(e.g.,
http://battellemedia.com/archives/001199.php) I
think the future direction of making search results
more relevant will be based on a better understanding
of what users are searching for, and matching this
up with the right kind of websites. This will require
search
engines to understand more of the "semantics" of
the web site -- e.g.,
is this an ecommerce site, a review site, a weblog,
or ... If it's an
ecommerce site, is it a big retailer, or a small
site? How long have
they been in business? What's their Better Business
Bureau rating?
This is going to be a combination of algorithms,
as well as human input. But the human editorial
layer may look very different than directories like
Yahoo's and Dmoz.
Agreed. The search services are changing
shape. For example, I've heard rumours that one
major search service might be looking to turn search
into a platform. How do you see the search world
developing over the next few years?
Search is definitely a platform. Microsoft Word
has a spell checker
built in, but it seems to not know about many common
terms I use.
Beyond that even, can it correct misspellings of
'Skrenta'? Google
can do both, using the world's largest document
collection and some
fancy algorithms.
The platform that search can provide also has monetization
built in.
Imagine shareware that doesn't ask you to send $5
to a PO Box, but
can channel relevant advertising into a web-enabled
application.
Search is a first step to full utilization of a
world-sized corpus of
encyclopedic information, combined with the full
value that community
participation in the content & commerce process
can provide.
Many thanks, Rich. More interviews coming up soon
:)