nutch - functionality..

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

nutch - functionality..

aaaaa
hi...

i might be a little out of my league.. but here goes...

i'm in need of an app to crawl through sections of sites, and to return
pieces of information. i'm not looking to do any indexing, just returning
raw html/text...

however, i need the ability to set certain criteria to help define the
actual pages that get returned...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneath the URL. nutch does this as well as providing some
additional functionality.

i need more functionality....

in particular, i'd like to be able to modify the way nutch handles forms,
and links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'nutch' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information....

any thoughts/comments/ideas/etc.. regarding this process.

if i shouldn't use nutch, are there any suggestions as to what app i should
use.

thanks

-bruce



Reply | Threaded
Open this post in threaded view
|

RE: nutch - functionality..

Fuad Efendi
Nutch is plugin-based, similar to Eclipse;
You can extend Nutch functionality, just browse src/plugin/parse-html source
folder as a sample; you can modify Java code so that it will handle 'POST'
from forms (Outlink class instances) (I am well familiar with v.0.7.1, new
version of Nutch is significantly richer)
It's the easiest starting point: parse-html plugin...

I don't see any reason why search engine should return list of pages found
(including POST of forms), and <A href="...">PageFound</A> (of a Nutch
Search Results end-user screen) does not have functionality of POST.

Only one case: response may provide new Outlink instances, such as response
from 'Search' pages of E-Commerce sites... And most probably such
'second-level' outlinks are reachable via GET; sample - 'Search' page with
POST on any E-Commerce site...



-----Original Message-----
From: bruce
Subject: nutch - functionality..


hi...

i might be a little out of my league.. but here goes...

i'm in need of an app to crawl through sections of sites, and to return
pieces of information. i'm not looking to do any indexing, just returning
raw html/text...

however, i need the ability to set certain criteria to help define the
actual pages that get returned...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneath the URL. nutch does this as well as providing some
additional functionality.

i need more functionality....

in particular, i'd like to be able to modify the way nutch handles forms,
and links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'nutch' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information....

any thoughts/comments/ideas/etc.. regarding this process.

if i shouldn't use nutch, are there any suggestions as to what app i should
use.

thanks

-bruce





Reply | Threaded
Open this post in threaded view
|

RE: nutch - functionality..

aaaaa
hi fuad,

it lloks like you're looking at what i'm trying to do as though it's for a
search engine... it's not...

i'm looking to create a crawler to extract specific information. as such, i
need to emulate some of the function of a crawler. i also need to implement
other functionality that's apparently not in the usual spider/crawler
function. being able to selectively iterate/follow through forms (GET/POST)
in a recursive manner is a requirement. as is being able to selectively
define which form elements i'm going to use when i do the crawling....

of course this approach is only possible because i have causal knowledge of
the structure of the site prior to me crawling it...

-bruce



-----Original Message-----
From: Fuad Efendi [mailto:[hidden email]]
Sent: Friday, June 23, 2006 8:28 PM
To: [hidden email]
Subject: RE: nutch - functionality..


Nutch is plugin-based, similar to Eclipse;
You can extend Nutch functionality, just browse src/plugin/parse-html source
folder as a sample; you can modify Java code so that it will handle 'POST'
from forms (Outlink class instances) (I am well familiar with v.0.7.1, new
version of Nutch is significantly richer)
It's the easiest starting point: parse-html plugin...

I don't see any reason why search engine should return list of pages found
(including POST of forms), and <A href="...">PageFound</A> (of a Nutch
Search Results end-user screen) does not have functionality of POST.

Only one case: response may provide new Outlink instances, such as response
from 'Search' pages of E-Commerce sites... And most probably such
'second-level' outlinks are reachable via GET; sample - 'Search' page with
POST on any E-Commerce site...



-----Original Message-----
From: bruce
Subject: nutch - functionality..


hi...

i might be a little out of my league.. but here goes...

i'm in need of an app to crawl through sections of sites, and to return
pieces of information. i'm not looking to do any indexing, just returning
raw html/text...

however, i need the ability to set certain criteria to help define the
actual pages that get returned...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneath the URL. nutch does this as well as providing some
additional functionality.

i need more functionality....

in particular, i'd like to be able to modify the way nutch handles forms,
and links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'nutch' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information....

any thoughts/comments/ideas/etc.. regarding this process.

if i shouldn't use nutch, are there any suggestions as to what app i should
use.

thanks

-bruce





Reply | Threaded
Open this post in threaded view
|

RE: nutch - functionality..

Fuad Efendi
Bruce,

I had similar problem a year ago... I needed very specific crawling and data
mining, I decided to use a database, and I was able to rewrite everything
within a week (thanks to Nutch developers!) (needed to implement very
specific business case);

My first approach was to modify parse-html plugin; it writes directly to a
database 'path' and 'query params'; and it writes to a database some
specific 'tokens' such as product name, price, etc.

What I found:
- performance of a database (such as Oracle or Postgre) is the main
bottleneck
- I need to 'mine' everything in-memory and minimize file read/write
operations (minimize HDD I/O, and use pure Java)

I had some (may be useful) ideas:
- using statistics (how many anchors have similar text pointing same page
during period in time), define 'category' of info, such as 'product
category', 'subcategory', 'manufacturer'
- define 'dynamic' crawl, e.g. frequently re-crawl 'frequently-queried'
pages

I think existing Nutch is very 'generic', a lot of plugins such as
'parse-mpg', 'parse-pdf',... It repeats logic/functionality of Google...
'Anchor text is a true subject of a page!' - Google Bombing...

So, if it is 'Data Mining' engine, I believe just creating of additional
plugin for Nutch is not enough: you have to define additional classes,
'Outlink' does not have functionality of 'Query Parameters' etc... And you
need to define datastore, existing WebDB interface is not enough... You will
need to rewrite Nutch... And there are no any suitable 'extension points'...


If you need just HTML crawl/mining - focus on it...


-----Original Message-----
From: bruce
Sent: Saturday, June 24, 2006 2:40 AM
To: [hidden email]
Cc: [hidden email]
Subject: RE: nutch - functionality..


hi fuad,

it lloks like you're looking at what i'm trying to do as though it's for a
search engine... it's not...

i'm looking to create a crawler to extract specific information. as such, i
need to emulate some of the function of a crawler. i also need to implement
other functionality that's apparently not in the usual spider/crawler
function. being able to selectively iterate/follow through forms (GET/POST)
in a recursive manner is a requirement. as is being able to selectively
define which form elements i'm going to use when i do the crawling....

of course this approach is only possible because i have causal knowledge of
the structure of the site prior to me crawling it...

-bruce



-----Original Message-----
From: Fuad Efendi
Sent: Friday, June 23, 2006 8:28 PM
To: [hidden email]
Subject: RE: nutch - functionality..


Nutch is plugin-based, similar to Eclipse;
You can extend Nutch functionality, just browse src/plugin/parse-html source
folder as a sample; you can modify Java code so that it will handle 'POST'
from forms (Outlink class instances) (I am well familiar with v.0.7.1, new
version of Nutch is significantly richer)
It's the easiest starting point: parse-html plugin...

I don't see any reason why search engine should return list of pages found
(including POST of forms), and <A href="...">PageFound</A> (of a Nutch
Search Results end-user screen) does not have functionality of POST.

Only one case: response may provide new Outlink instances, such as response
from 'Search' pages of E-Commerce sites... And most probably such
'second-level' outlinks are reachable via GET; sample - 'Search' page with
POST on any E-Commerce site...



-----Original Message-----
From: bruce
Subject: nutch - functionality..


hi...

i might be a little out of my league.. but here goes...

i'm in need of an app to crawl through sections of sites, and to return
pieces of information. i'm not looking to do any indexing, just returning
raw html/text...

however, i need the ability to set certain criteria to help define the
actual pages that get returned...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneath the URL. nutch does this as well as providing some
additional functionality.

i need more functionality....

in particular, i'd like to be able to modify the way nutch handles forms,
and links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'nutch' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information....

any thoughts/comments/ideas/etc.. regarding this process.

if i shouldn't use nutch, are there any suggestions as to what app i should
use.

thanks

-bruce







Reply | Threaded
Open this post in threaded view
|

Will pay for someone to help

Honda-Search Administrator
I'm having a difficult time configuring nutch to behave the way I want it to
behave.

In a nutshell here is my situation:

I crawl a number of forums that relate to Hondas every night for posts.  The
purpose of my website is to be a search engine for all of the forums at
once.

I have a base set of URLs in the webDB right now.  Every day I write a file
of URLs (that I place in urls/inject.txt) that I want nutch to inject into
the database to crawl.  I do NOT want to recrawl other URLS.  I only want to
crawl/recrawl the urls in my list.

Can you help me configure nutch (or help with the correct scripts, crons,
etc.) to do this?  i've tried without success.

I am running nutch 0.7.2 and am totally confused with what to do next.  It
seems to me to be a simple fix, but I can't figure it out.

As I mentioned I will pay if someone can set me up.  I've run the crawl a
number of times now and i just keep on screwing things up.

Matt

Reply | Threaded
Open this post in threaded view
|

Re: Will pay for someone to help

Thomas Delnoij-3
Matt,

AFAIK Nutch does not support fetching arbitrary fetch lists out of the box.

here is a tool in JIRA that supports this though:
http://issues.apache.org/jira/browse/NUTCH-68.

- Thomas


On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:

> I'm having a difficult time configuring nutch to behave the way I want it to
> behave.
>
> In a nutshell here is my situation:
>
> I crawl a number of forums that relate to Hondas every night for posts.  The
> purpose of my website is to be a search engine for all of the forums at
> once.
>
> I have a base set of URLs in the webDB right now.  Every day I write a file
> of URLs (that I place in urls/inject.txt) that I want nutch to inject into
> the database to crawl.  I do NOT want to recrawl other URLS.  I only want to
> crawl/recrawl the urls in my list.
>
> Can you help me configure nutch (or help with the correct scripts, crons,
> etc.) to do this?  i've tried without success.
>
> I am running nutch 0.7.2 and am totally confused with what to do next.  It
> seems to me to be a simple fix, but I can't figure it out.
>
> As I mentioned I will pay if someone can set me up.  I've run the crawl a
> number of times now and i just keep on screwing things up.
>
> Matt
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Will pay for someone to help

Honda-Search Administrator
Thomas,

I was under the impressiong that I could inject a list of URLS using:

 bin/nutch inject database -urlfile filewithurls.txt

Is this not true?

Matt
----- Original Message -----
From: "TDLN" <[hidden email]>
To: <[hidden email]>; "Honda-Search Administrator"
<[hidden email]>
Sent: Sunday, June 25, 2006 3:02 AM
Subject: Re: Will pay for someone to help


> Matt,
>
> AFAIK Nutch does not support fetching arbitrary fetch lists out of the
> box.
>
> here is a tool in JIRA that supports this though:
> http://issues.apache.org/jira/browse/NUTCH-68.
>
> - Thomas
>
>
> On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:
>> I'm having a difficult time configuring nutch to behave the way I want it
>> to
>> behave.
>>
>> In a nutshell here is my situation:
>>
>> I crawl a number of forums that relate to Hondas every night for posts.
>> The
>> purpose of my website is to be a search engine for all of the forums at
>> once.
>>
>> I have a base set of URLs in the webDB right now.  Every day I write a
>> file
>> of URLs (that I place in urls/inject.txt) that I want nutch to inject
>> into
>> the database to crawl.  I do NOT want to recrawl other URLS.  I only want
>> to
>> crawl/recrawl the urls in my list.
>>
>> Can you help me configure nutch (or help with the correct scripts, crons,
>> etc.) to do this?  i've tried without success.
>>
>> I am running nutch 0.7.2 and am totally confused with what to do next.
>> It
>> seems to me to be a simple fix, but I can't figure it out.
>>
>> As I mentioned I will pay if someone can set me up.  I've run the crawl a
>> number of times now and i just keep on screwing things up.
>>
>> Matt
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Will pay for someone to help

Thomas Delnoij-3
You can but injecting != fetching.

On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:

> Thomas,
>
> I was under the impressiong that I could inject a list of URLS using:
>
>  bin/nutch inject database -urlfile filewithurls.txt
>
> Is this not true?
>
> Matt
> ----- Original Message -----
> From: "TDLN" <[hidden email]>
> To: <[hidden email]>; "Honda-Search Administrator"
> <[hidden email]>
> Sent: Sunday, June 25, 2006 3:02 AM
> Subject: Re: Will pay for someone to help
>
>
> > Matt,
> >
> > AFAIK Nutch does not support fetching arbitrary fetch lists out of the
> > box.
> >
> > here is a tool in JIRA that supports this though:
> > http://issues.apache.org/jira/browse/NUTCH-68.
> >
> > - Thomas
> >
> >
> > On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:
> >> I'm having a difficult time configuring nutch to behave the way I want it
> >> to
> >> behave.
> >>
> >> In a nutshell here is my situation:
> >>
> >> I crawl a number of forums that relate to Hondas every night for posts.
> >> The
> >> purpose of my website is to be a search engine for all of the forums at
> >> once.
> >>
> >> I have a base set of URLs in the webDB right now.  Every day I write a
> >> file
> >> of URLs (that I place in urls/inject.txt) that I want nutch to inject
> >> into
> >> the database to crawl.  I do NOT want to recrawl other URLS.  I only want
> >> to
> >> crawl/recrawl the urls in my list.
> >>
> >> Can you help me configure nutch (or help with the correct scripts, crons,
> >> etc.) to do this?  i've tried without success.
> >>
> >> I am running nutch 0.7.2 and am totally confused with what to do next.
> >> It
> >> seems to me to be a simple fix, but I can't figure it out.
> >>
> >> As I mentioned I will pay if someone can set me up.  I've run the crawl a
> >> number of times now and i just keep on screwing things up.
> >>
> >> Matt
> >>
> >>
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Will pay for someone to help

Honda-Search Administrator
In reply to this post by Thomas Delnoij-3
Of course the other option is to treat this like any other search engine and
just let nutch "do it's thing" knowing that everything will eventually get
indexed.

Matt

----- Original Message -----
From: "TDLN" <[hidden email]>
To: <[hidden email]>; "Honda-Search Administrator"
<[hidden email]>
Sent: Sunday, June 25, 2006 3:02 AM
Subject: Re: Will pay for someone to help


> Matt,
>
> AFAIK Nutch does not support fetching arbitrary fetch lists out of the
> box.
>
> here is a tool in JIRA that supports this though:
> http://issues.apache.org/jira/browse/NUTCH-68.
>
> - Thomas
>
>
> On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:
>> I'm having a difficult time configuring nutch to behave the way I want it
>> to
>> behave.
>>
>> In a nutshell here is my situation:
>>
>> I crawl a number of forums that relate to Hondas every night for posts.
>> The
>> purpose of my website is to be a search engine for all of the forums at
>> once.
>>
>> I have a base set of URLs in the webDB right now.  Every day I write a
>> file
>> of URLs (that I place in urls/inject.txt) that I want nutch to inject
>> into
>> the database to crawl.  I do NOT want to recrawl other URLS.  I only want
>> to
>> crawl/recrawl the urls in my list.
>>
>> Can you help me configure nutch (or help with the correct scripts, crons,
>> etc.) to do this?  i've tried without success.
>>
>> I am running nutch 0.7.2 and am totally confused with what to do next.
>> It
>> seems to me to be a simple fix, but I can't figure it out.
>>
>> As I mentioned I will pay if someone can set me up.  I've run the crawl a
>> number of times now and i just keep on screwing things up.
>>
>> Matt
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Will pay for someone to help

Thomas Delnoij-3
Right, that is probably best for a start.

It really also depends on your hardware / network setup; if you have
enough resources to crawl / index your whole network in less than the
default refetch interval you can consider decreading that property.

There is also work going on related to a so called adaptive refetch
interval, i.e. resources that are changing frequently being refetched
more often than thoes that don't. But this is not yet in trunk.

Rgrds, Thomas

On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:

> Of course the other option is to treat this like any other search engine and
> just let nutch "do it's thing" knowing that everything will eventually get
> indexed.
>
> Matt
>
> ----- Original Message -----
> From: "TDLN" <[hidden email]>
> To: <[hidden email]>; "Honda-Search Administrator"
> <[hidden email]>
> Sent: Sunday, June 25, 2006 3:02 AM
> Subject: Re: Will pay for someone to help
>
>
> > Matt,
> >
> > AFAIK Nutch does not support fetching arbitrary fetch lists out of the
> > box.
> >
> > here is a tool in JIRA that supports this though:
> > http://issues.apache.org/jira/browse/NUTCH-68.
> >
> > - Thomas
> >
> >
> > On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:
> >> I'm having a difficult time configuring nutch to behave the way I want it
> >> to
> >> behave.
> >>
> >> In a nutshell here is my situation:
> >>
> >> I crawl a number of forums that relate to Hondas every night for posts.
> >> The
> >> purpose of my website is to be a search engine for all of the forums at
> >> once.
> >>
> >> I have a base set of URLs in the webDB right now.  Every day I write a
> >> file
> >> of URLs (that I place in urls/inject.txt) that I want nutch to inject
> >> into
> >> the database to crawl.  I do NOT want to recrawl other URLS.  I only want
> >> to
> >> crawl/recrawl the urls in my list.
> >>
> >> Can you help me configure nutch (or help with the correct scripts, crons,
> >> etc.) to do this?  i've tried without success.
> >>
> >> I am running nutch 0.7.2 and am totally confused with what to do next.
> >> It
> >> seems to me to be a simple fix, but I can't figure it out.
> >>
> >> As I mentioned I will pay if someone can set me up.  I've run the crawl a
> >> number of times now and i just keep on screwing things up.
> >>
> >> Matt
> >>
> >>
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Will pay for someone to help

Howie Wang
In reply to this post by Honda-Search Administrator
I haven't tried this (or even thought it through much),
but it seems an easy way to achieve this would be to set
the db.default.fetch.interval to an arbitrarily large number
(maybe 36500 days, or 100 years). All pages that are
fetched will not be re-fetched for 100 years. So only
newly injected pages will be fetched when you re-crawl.

I can't remember if there's code in 0.7 that intelligently
resets the fetch time according to some algorithm.
If there is, this might not work without some code modifications.
The place to look is probably in UpdateDatabaseTool.java,
in the pageContentsChanged and pageContentsUnchanged
methods. You might have to change the calls to
setNextFetchTime according to your needs.

Howie

>I'm having a difficult time configuring nutch to behave the way I want it
>to behave.
>
>In a nutshell here is my situation:
>
>I crawl a number of forums that relate to Hondas every night for posts.  
>The purpose of my website is to be a search engine for all of the forums at
>once.
>
>I have a base set of URLs in the webDB right now.  Every day I write a file
>of URLs (that I place in urls/inject.txt) that I want nutch to inject into
>the database to crawl.  I do NOT want to recrawl other URLS.  I only want
>to crawl/recrawl the urls in my list.
>
>Can you help me configure nutch (or help with the correct scripts, crons,
>etc.) to do this?  i've tried without success.
>
>I am running nutch 0.7.2 and am totally confused with what to do next.  It
>seems to me to be a simple fix, but I can't figure it out.
>
>As I mentioned I will pay if someone can set me up.  I've run the crawl a
>number of times now and i just keep on screwing things up.
>
>Matt
>


Reply | Threaded
Open this post in threaded view
|

Re: Will pay for someone to help

Honda-Search Administrator
In reply to this post by Thomas Delnoij-3
Thomas,

It appears to me that this is exactly what I need.  I can create a fetchlist
on the urls I need to crawl and can then fetch them.  I can essentially not
worry about the older entries unless they are modified.

Two question:

First of all, will this re-fetch old document already in the database?  In
my instance if a forum topic is updated it would be put into the flat url
list.  Would it be refetched with this tool?

Secondly, can anyone point me in the direction of how to properly set this
up?  As I mentioned in another post I'm lost when it comes to java.  I want
to be able to compile this and use it but the last thing I want to do is
screw anything up.

Matt

----- Original Message -----
From: "TDLN" <[hidden email]>
To: <[hidden email]>; "Honda-Search Administrator"
<[hidden email]>
Sent: Sunday, June 25, 2006 3:02 AM
Subject: Re: Will pay for someone to help


> Matt,
>
> AFAIK Nutch does not support fetching arbitrary fetch lists out of the
> box.
>
> here is a tool in JIRA that supports this though:
> http://issues.apache.org/jira/browse/NUTCH-68.
>
> - Thomas
>
>
> On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:
>> I'm having a difficult time configuring nutch to behave the way I want it
>> to
>> behave.
>>
>> In a nutshell here is my situation:
>>
>> I crawl a number of forums that relate to Hondas every night for posts.
>> The
>> purpose of my website is to be a search engine for all of the forums at
>> once.
>>
>> I have a base set of URLs in the webDB right now.  Every day I write a
>> file
>> of URLs (that I place in urls/inject.txt) that I want nutch to inject
>> into
>> the database to crawl.  I do NOT want to recrawl other URLS.  I only want
>> to
>> crawl/recrawl the urls in my list.
>>
>> Can you help me configure nutch (or help with the correct scripts, crons,
>> etc.) to do this?  i've tried without success.
>>
>> I am running nutch 0.7.2 and am totally confused with what to do next.
>> It
>> seems to me to be a simple fix, but I can't figure it out.
>>
>> As I mentioned I will pay if someone can set me up.  I've run the crawl a
>> number of times now and i just keep on screwing things up.
>>
>> Matt
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Will pay for someone to help

Roberto Monge
I do something similar on a daily basis with nutch 0.7.  I look in DATE_DIR
folder for new files to index and pass that into nutch via the fetch_new.txt
file. Here is the daily indexing script I use (since the files are local I
replace the root with my webservers base directory).

find /${DATE_DIR} -name '*.txt' > out.txt
sed -e 's@/@http://<myserver.com>/@' < out.txt > fetch_new.txt
nutch inject db -urlfile ./fetch_new.txt
nutch generate db segments
s=`ls -d segments/2* | tail -n 1`
echo Segment is $s
nutch fetch $s
echo Done Fetching
nutch updatedb db $s
nutch analyze db 2
nutch index $s
nutch dedup segments tmpfile

I have the refetch time set high.

<property>
  <name>db.default.fetch.interval</name>
  <value>120</value>
  <description>The default number of days between re-fetches of a page.
  </description>
</property>
Every month or so I do a segment merge
nutch mergesegs -dir segments  -i -ds

And latetly I've been deleting the /contents folder in the segments
directory since i don't need the cached version of the files since I have
them on the local filesystem, this helps save disk space (in 0.8-dev it's a
property option)

Roberto

On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:

>
> Thomas,
>
> It appears to me that this is exactly what I need.  I can create a
> fetchlist
> on the urls I need to crawl and can then fetch them.  I can essentially
> not
> worry about the older entries unless they are modified.
>
> Two question:
>
> First of all, will this re-fetch old document already in the database?  In
> my instance if a forum topic is updated it would be put into the flat url
> list.  Would it be refetched with this tool?
>
> Secondly, can anyone point me in the direction of how to properly set this
> up?  As I mentioned in another post I'm lost when it comes to java.  I
> want
> to be able to compile this and use it but the last thing I want to do is
> screw anything up.
>
> Matt
>
> ----- Original Message -----
> From: "TDLN" <[hidden email]>
> To: <[hidden email]>; "Honda-Search Administrator"
> <[hidden email]>
> Sent: Sunday, June 25, 2006 3:02 AM
> Subject: Re: Will pay for someone to help
>
>
> > Matt,
> >
> > AFAIK Nutch does not support fetching arbitrary fetch lists out of the
> > box.
> >
> > here is a tool in JIRA that supports this though:
> > http://issues.apache.org/jira/browse/NUTCH-68.
> >
> > - Thomas
> >
> >
> > On 6/25/06, Honda-Search Administrator <[hidden email]> wrote:
> >> I'm having a difficult time configuring nutch to behave the way I want
> it
> >> to
> >> behave.
> >>
> >> In a nutshell here is my situation:
> >>
> >> I crawl a number of forums that relate to Hondas every night for posts.
> >> The
> >> purpose of my website is to be a search engine for all of the forums at
> >> once.
> >>
> >> I have a base set of URLs in the webDB right now.  Every day I write a
> >> file
> >> of URLs (that I place in urls/inject.txt) that I want nutch to inject
> >> into
> >> the database to crawl.  I do NOT want to recrawl other URLS.  I only
> want
> >> to
> >> crawl/recrawl the urls in my list.
> >>
> >> Can you help me configure nutch (or help with the correct scripts,
> crons,
> >> etc.) to do this?  i've tried without success.
> >>
> >> I am running nutch 0.7.2 and am totally confused with what to do next.
> >> It
> >> seems to me to be a simple fix, but I can't figure it out.
> >>
> >> As I mentioned I will pay if someone can set me up.  I've run the crawl
> a
> >> number of times now and i just keep on screwing things up.
> >>
> >> Matt
> >>
> >>
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Will pay for someone to help

HUYLEBROECK Jeremy RD-ILAB-SSF-2
In reply to this post by Honda-Search Administrator
Hey Thomas,
Do you have any pointer to that work?
Thanks

-----Original Message-----

There is also work going on related to a so called adaptive refetch
interval, i.e. resources that are changing frequently being refetched
more often than thoes that don't. But this is not yet in trunk.

Reply | Threaded
Open this post in threaded view
|

RE: Will pay for someone to help

Thomas Delnoij-3
In reply to this post by Honda-Search Administrator
Adaptive Refetch Interval Patch: http://issues.apache.org/jira/browse/NUTCH-61

(Thanks to Andrzej)

Rgrds. Thomas



On 6/28/06, HUYLEBROECK Jeremy RD-ILAB-SSF
<[hidden email]> wrote:

> Hey Thomas,
> Do you have any pointer to that work?
> Thanks
>
> Jeremy.
>
> -----Original Message-----
>
> There is also work going on related to a so called adaptive refetch
> interval, i.e. resources that are changing frequently being refetched
> more often than thoes that don't. But this is not yet in trunk.
>