search engine optimisation and marketing news | searchengineblog.com
 

Rich Skrenta Interview

rich skenta

Rich Skrenta was the co- founder of NewHoo (now DMOZ), has worked with Netscape and AOL, and now runs the news service Topix.net. I caught up with Rich for a chat.

Thanks for talking to us, Rich. Can you tell us a bit about your background, and what you're up to now?

My background is in engineering. I started my career working on Unix operating system internals, first at Commodore-Amiga (they actually had a version of Unix which could run on the Amiga), and then at Unix System Labs. In 1995 I went to work for Sun Microsystems on network security and encryption products.

In 1998 a group of us formed GnuHoo/NewHoo, and it was quickly bought by Netscape. We ran engineering Netscape Search for a few years, and subsequently managed AOL Music (including Spinner and Winamp) and finally AOL Shopping.

In 2002 we left AOL as a team and set up shop as a startup again to play around with crawling and AI technology. We came up with the idea to build a news site which would track relevant news for every conceivable subject and place in the world, updated from the fullest set of information available on the Internet.

Our site Topix.net launch early last year. We have a news page for
every city in the US, every country in the world, and thousands of
subjects, all culled from over 10,000 sources. There's a lot of text
analysis software behind what we do, and an AI system that uses a
massive knowledge base to make sure the categorizations are relevant.

We had a great first year, inking deals with Ask Jeeves, Infospace,
and most recently CitySearch. Traffic has been good and we just
became profitable last month.

With so many sources, you'd have your work cut out for you in terms of categorization. Can you talk a bit more about how the text analysis and AI is achieved? What challenges do you face?

The first step of categorization is to recognize "named entity references". If we're looking for news for our Janet Jackson channel, is this really the Janet Jackson who is the pop star, or is it one of the other 10,000 people with that name.

The second step is what we call about-vs-mention discrimination. Janet Jackson is mentioned in many articles that aren't actually about her. We want to separate those out from the articles which would be editorially appropriate for her news channel.

Our system won't be complete until it performs as well as a human editor. It's pretty good now, with over 99% accuracy, but we'll be continually improving the system to increase the coverage and relevance of the news channels.

Many of the large newspaper companies lock-up content behind subscription forms, while the same content is often distributed elsewhere without impediment. How do you see news services developing on the web in future? Are the large, incumbent news businesses going to be increasingly challenged by stealthy, smaller outfits and/or automated aggregation services?

Registration gates cause the majority of our user complaints.
Because of this, the Topix.net robo-editor will do its best to find
a related story to link to that is not behind a reg gate. So if the
same story comes out in two places, but one is behind a reg gate and
the other is not, topix.net will link to the story that won't impede
users clicking on it with a form.

We're experiencing an explosion in the number of news outlets.
Beyond the newspaper web sites, TV stations, news radio stations,
and online magazines, there are increasing amounts of information
being provided by corporate and government entities, as well as
weblog authors and other non-journalist writers. To scan this
massive amount of incremental information available each day for
items which can be personally relevant is a big job, but one amenable
to computer automation. At a high level Topix.net's mission is to
read everything new on the Internet every 30 minutes and let you
know about new, relevant information that's of interest to you --
whether that interest is based on a local city, a hobby interest,
a business sector, or some other content channel.

I think there's a big opportunity for existing media organizations
to take advantage of the current online trends, and add incremental
value and profit to their delivery of content online. The exact
nature of these business models is still sorting itself out but some
online operations are doing very well, and we've talked with some
very forward-looking folks in traditional media businesses.

I'd like to talk a little more about your background, particularly in
relation to DMOZ. An article published in 1998 stated that you created GnuHoo out of frustration with Yahoo!, in particular, the slowness at which the Yahoo! directory added and updated sites back then. Can you tell us a bit about those times? How do you think DMOZ has progressed since you left to pursue other ventures?

Dmoz has continued to grow since we left AOL. The open directory
project vastly exceeded our original goals -- we thought if we could
get 1,000 editors and 1 million sites, the direction would be a
huge success. It achieved these goals and has fulfilled its mission
of becoming the largest human-edited directory of the web. But the
web moved on, and while directories were very interesting in the mid
'90's, keyword search has eclipsed them as the main ways consumers find information on the Internet.

This is in part due to the growth of the web. When the web is small --
say 30 million documents -- a directory is a great way to find organize
and find sites. This was Yahoo's strength in the early days. But as
the web grows from 30 million sites to 5-10 billion, directories,
even very large ones, can't keep up. Dmoz has 4M sites and over
600,000 categories. This is almost too large to be useful; one
can't easily click around browsing through a 600,000 page directory.
At the same time, since the directory, even with as many as 4M sites
contains only .1% of the Internet or less, it can't be as big as it
needs to be to cover the content available on the Internet.

DMOZ has often been credited as a source of citation data, which algorithmic search engines may use in their calculations. We spoke a little about how a mix of human and machine input might produce the most desirable search results.Can you talk a little about how you think human factors can be best used to create meaningful search results today?

AOL Search is doing interesting work in this area. (e.g.,
http://battellemedia.com/archives/001199.php) I think the future direction of making search results more relevant will be based on a better understanding of what users are searching for, and matching this up with the right kind of websites. This will require search
engines to understand more of the "semantics" of the web site -- e.g.,
is this an ecommerce site, a review site, a weblog, or ... If it's an
ecommerce site, is it a big retailer, or a small site? How long have
they been in business? What's their Better Business Bureau rating?

This is going to be a combination of algorithms, as well as human input. But the human editorial layer may look very different than directories like Yahoo's and Dmoz.

Agreed. The search services are changing shape. For example, I've heard rumours that one major search service might be looking to turn search into a platform. How do you see the search world developing over the next few years?

Search is definitely a platform. Microsoft Word has a spell checker
built in, but it seems to not know about many common terms I use.
Beyond that even, can it correct misspellings of 'Skrenta'? Google
can do both, using the world's largest document collection and some
fancy algorithms.

The platform that search can provide also has monetization built in.
Imagine shareware that doesn't ask you to send $5 to a PO Box, but
can channel relevant advertising into a web-enabled application.
Search is a first step to full utilization of a world-sized corpus of
encyclopedic information, combined with the full value that community
participation in the content & commerce process can provide.

Many thanks, Rich. More interviews coming up soon :)


 

Rich is currently working on:

Topix.net - 150,000 news channels, from Autos to your ZIP code.


Adv

Tired of waiting for directory listings? List with Rubberstamped.org for $25.00

© Peter Da Vanzo 2002-2004 All Rights Reserved