Poll: Crawler flexibility?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Poll: Crawler flexibility?

kangas
Dear nutch-user readers,

I have a question for everyone here: Is the current Nutch crawler  
(Fetcher/Fetcher2) flexible enough for your needs?
If not, what would you like to see it do?

I'm asking because, last week, I suggested that the Nutch crawler  
could be much more useful to many people if it was structured more as  
a "crawler construction toolkit". But I realize that my comments  
could seem like sour grapes unless there's some plan for moving  
forward. So, I thought I'd just ask everybody what you think and  
tally the results.

What kind of crawls would you like to do that aren't supported? I'll  
start with some nonstandard crawls I've done:

1) Outlinks-only crawl: crawl a specific website, keep only the  
outlinks from articles (, etc)
2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
3) Plug in a "feature detector" (address, date, brand-name, etc) and  
use this signal to guide the crawl

4) .... (fill in your own here!)

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Poll: Crawler flexibility?

searchfresco
Hi Matt

I read through your comments last week and didn't have time to reply but
I wanted to share some thoughts.

> Crawling in the same manner as Google is probably a disaster for any
> startup. Whole-web crawling is quite tricky & expensive, and Google
> has done such a good job already here that, once your crawl succeeds,
> how do you provide results that are noticeably better than Google's?
> Failure to differentiate your product is also a quick path to death
> for a startup.
 

I think at some level you are ignoring human intervention in this
process. A lot of what you are trying to achieve crawler-wise is doable
via tools already built into nutch, i.e: prune, url-filters, etc.. Do
you realize how much human intervention is involved in maintaining
google indexes? Assume you could assign an editor to each million page
segment and allow pruning, adding, etc. and you are probably close to
what the google does. There is no way any crawler can automagically
deliver a high quality index, there are just too many variables out
there in the wild.

> If you bet on Nutch as your foundation but cannot build a
> differentiated product quickly, you'll be screwed, and you will drop
> out of the Nutch community and move on. Nutch will lose a
> possibly-valuable contributor.

That is missing the point IMO as nutch can be used to create search
indexes that deliver near identical (to google) result sets to users, so
its not a quality issue. I see failure of SE startups as more of
marketing issue. Have you considered googles weak points, click fraud,
hostility towards privacy, etc? Can you do better in those areas?

Soley blaming or relying on tech/nutch to compete in SE land is a  "fish
bowl" perspective.

Just my 2 cents.

John


Matt Kangas wrote:

> Dear nutch-user readers,
>
> I have a question for everyone here: Is the current Nutch crawler
> (Fetcher/Fetcher2) flexible enough for your needs?
> If not, what would you like to see it do?
>
> I'm asking because, last week, I suggested that the Nutch crawler
> could be much more useful to many people if it was structured more as
> a "crawler construction toolkit". But I realize that my comments could
> seem like sour grapes unless there's some plan for moving forward. So,
> I thought I'd just ask everybody what you think and tally the results.
>
> What kind of crawls would you like to do that aren't supported? I'll
> start with some nonstandard crawls I've done:
>
> 1) Outlinks-only crawl: crawl a specific website, keep only the
> outlinks from articles (, etc)
> 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
> 3) Plug in a "feature detector" (address, date, brand-name, etc) and
> use this signal to guide the crawl
>
> 4) .... (fill in your own here!)
>
> --
> Matt Kangas / [hidden email]
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Poll: Crawler flexibility?

Tranquil
In reply to this post by kangas
Hi Matt,

i've posted some messages on the nutch-dev & nutch user mailing lists on
downloading content by using the fetcher.

if you ask me what i'm looking in the fetcher its:


   1. real time (or very fast) on demand scan on a given url(s) & depth
   required for the purpose of extracting data from it (this is diffrent from
   wget in that it can handle redirects and all sorts of issues in web
   crawling) - maybe design a deamon that will accept requests on the fly.
   2. configuration that will enable nutch do d/l specific content types
   without the need to write a specific plugin for each extension.

i've mentioned before that my purpose is soley fetching pages and not
indexing/searching results... so i'd rather see an optimized fetcher..

Eyal.



On 10/24/07, Matt Kangas <[hidden email]> wrote:

> Dear nutch-user readers,
>
> I have a question for everyone here: Is the current Nutch crawler
> (Fetcher/Fetcher2) flexible enough for your needs?
> If not, what would you like to see it do?
>
> I'm asking because, last week, I suggested that the Nutch crawler
> could be much more useful to many people if it was structured more as
> a "crawler construction toolkit". But I realize that my comments
> could seem like sour grapes unless there's some plan for moving
> forward. So, I thought I'd just ask everybody what you think and
> tally the results.
>
> What kind of crawls would you like to do that aren't supported? I'll
> start with some nonstandard crawls I've done:
>
> 1) Outlinks-only crawl: crawl a specific website, keep only the
> outlinks from articles (, etc)
> 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
> 3) Plug in a "feature detector" (address, date, brand-name, etc) and
> use this signal to guide the crawl
>
> 4) .... (fill in your own here!)
>
> --
> Matt Kangas / [hidden email]
>
>
>


--
Eyal Edri
Reply | Threaded
Open this post in threaded view
|

RE: Poll: Crawler flexibility?

Howie Wang
In reply to this post by searchfresco
> Do you realize how much human intervention is involved in maintaining
> google indexes? Assume you could assign an editor to each million page
> segment and allow pruning, adding, etc. and you are probably close to
> what the google does.

Interesting. Do you have a link that describes what they're doing as far
as manual interaction?

Howie


_________________________________________________________________
Climb to the top of the charts!  Play Star Shuffle:  the word scramble challenge with star power.
http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_oct
Reply | Threaded
Open this post in threaded view
|

Re: Poll: Crawler flexibility?

Marcin Okraszewski-3
In reply to this post by kangas
Seems that my use case is pretty similar. What I want is to crawl all pages in some sites, but index just those which contains features, as you call it. In my case probably only 10% of pages contains the feature, so keeping / indexing all pages would be waste of space and time. Though, I need to get the pages without features because otherwise I won't be able to reach pages with features.

For now, during parsing I write a file with URLs containing features. Then I make mergeseg with URL filter which loads file from previous step and accepts only URLs from this file. Finally I index the filtered segment.

I don't like this solution because it wont work properly when I add second computer. I think of extending mergeseg, which would filter pages based on mata data. I'll need to ask some questions on the group to do it, but so far I don't have time for this. But it seems you have solved some of those problems. Maybe you could contribute or share some code / ideas on how to do it in the best way.

Regards,
Marcin


Dnia 24 października 2007 6:48 Matt Kangas <[hidden email]> napisał(a):

> Dear nutch-user readers,
>
> I have a question for everyone here: Is the current Nutch crawler  
> (Fetcher/Fetcher2) flexible enough for your needs?
> If not, what would you like to see it do?
>
> I'm asking because, last week, I suggested that the Nutch crawler  
> could be much more useful to many people if it was structured more as  
> a "crawler construction toolkit". But I realize that my comments  
> could seem like sour grapes unless there's some plan for moving  
> forward. So, I thought I'd just ask everybody what you think and  
> tally the results.
>
> What kind of crawls would you like to do that aren't supported? I'll  
> start with some nonstandard crawls I've done:
>
> 1) Outlinks-only crawl: crawl a specific website, keep only the  
> outlinks from articles (, etc)
> 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
> 3) Plug in a "feature detector" (address, date, brand-name, etc) and  
> use this signal to guide the crawl
>
> 4) .... (fill in your own here!)
>
> --
> Matt Kangas / [hidden email]
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Poll: Crawler flexibility?

Tim_G
You could write a new MapReduce job that takes a directory of nutch
segments and outputs a single segment of the data that contains the
features you're looking for.  Basically, it would just be a copy of
the mergeseg job that only calls output.collect if a feature was
found.

Then you would be left with a concise segment of only the data you
want, which you could then index.

On 10/24/07, Marcin Okraszewski <[hidden email]> wrote:

> Seems that my use case is pretty similar. What I want is to crawl all pages in some sites, but index just those which contains features, as you call it. In my case probably only 10% of pages contains the feature, so keeping / indexing all pages would be waste of space and time. Though, I need to get the pages without features because otherwise I won't be able to reach pages with features.
>
> For now, during parsing I write a file with URLs containing features. Then I make mergeseg with URL filter which loads file from previous step and accepts only URLs from this file. Finally I index the filtered segment.
>
> I don't like this solution because it wont work properly when I add second computer. I think of extending mergeseg, which would filter pages based on mata data. I'll need to ask some questions on the group to do it, but so far I don't have time for this. But it seems you have solved some of those problems. Maybe you could contribute or share some code / ideas on how to do it in the best way.
>
> Regards,
> Marcin
>
>
> Dnia 24 października 2007 6:48 Matt Kangas <[hidden email]> napisał(a):
>
> > Dear nutch-user readers,
> >
> > I have a question for everyone here: Is the current Nutch crawler
> > (Fetcher/Fetcher2) flexible enough for your needs?
> > If not, what would you like to see it do?
> >
> > I'm asking because, last week, I suggested that the Nutch crawler
> > could be much more useful to many people if it was structured more as
> > a "crawler construction toolkit". But I realize that my comments
> > could seem like sour grapes unless there's some plan for moving
> > forward. So, I thought I'd just ask everybody what you think and
> > tally the results.
> >
> > What kind of crawls would you like to do that aren't supported? I'll
> > start with some nonstandard crawls I've done:
> >
> > 1) Outlinks-only crawl: crawl a specific website, keep only the
> > outlinks from articles (, etc)
> > 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
> > 3) Plug in a "feature detector" (address, date, brand-name, etc) and
> > use this signal to guide the crawl
> >
> > 4) .... (fill in your own here!)
> >
> > --
> > Matt Kangas / [hidden email]
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Poll: Crawler flexibility?

Tsengtan A Shuy
I can back from a long break, because my ubuntu OS no longer works.
And I need to reinstall the windows OS.
I have one question below:
Does Nutch use database to store the key word and file names (or website
name)?
Any help will be much appreciated.

Adam Shuy, President
ePacific Web Design & Hosting
www.epacificweb.com
TEL: (408)272-6946

Reply | Threaded
Open this post in threaded view
|

Re: Poll: Crawler flexibility?

Sebastian Steinmetz
In reply to this post by kangas
Hey there,

i'm quite new to the nutch-scene and read the list just for about 2  
weeks or so.

at the moment i've got the following problem: we want to crawl all  
pages in the scope, but save only the ones with a special feature. So  
i think for us your 3rd proposal would be really useful. Maybe there  
is an easy way to achieve, what we are trying to do, but there is no  
(or i haven't found any) Documentation about this.

As some people might have had similiar problems, and maybe already  
solved them. It would be great, if you could share your experiences  
and maybe some code-fragments (maybe even on the wiki, so that new  
people who are not yet reading the list might find it).

so long,
        Sebastian Steinmetz


Am 24.10.2007 um 06:48 schrieb Matt Kangas:

> Dear nutch-user readers,
>
> I have a question for everyone here: Is the current Nutch crawler  
> (Fetcher/Fetcher2) flexible enough for your needs?
> If not, what would you like to see it do?
>
> I'm asking because, last week, I suggested that the Nutch crawler  
> could be much more useful to many people if it was structured more  
> as a "crawler construction toolkit". But I realize that my comments  
> could seem like sour grapes unless there's some plan for moving  
> forward. So, I thought I'd just ask everybody what you think and  
> tally the results.
>
> What kind of crawls would you like to do that aren't supported?  
> I'll start with some nonstandard crawls I've done:
>
> 1) Outlinks-only crawl: crawl a specific website, keep only the  
> outlinks from articles (, etc)
> 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
> 3) Plug in a "feature detector" (address, date, brand-name, etc)  
> and use this signal to guide the crawl
>
> 4) .... (fill in your own here!)
>
> --
> Matt Kangas / [hidden email]
>
>