nutch-default.xml configuration

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

nutch-default.xml configuration

Lourival Júnior
Hi all!

I have a question about nutch-default.xml configuration file. There is a
parameter db.default.fetch.interval that is set by default to 30. It means
that pages from the webdb are recrawled every 30
days.<http://www.mail-archive.com/nutch-user@.../msg02058.html>I
want to know if this "recrawled" here means automatic recrawl or I
have to
execute some shell script before this period to make possible updates to my
WebDB.

I really wanna know this because at this time I did not obtain a update in
fact.

Thanks a lot!

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: nutch-default.xml configuration

Nuther
Hi,Lourival.


You wrote 12 èþíÿ 2006 ã., 19:33:15:

> Hi all!

> I have a question about nutch-default.xml configuration file. There is a
> parameter db.default.fetch.interval that is set by default to 30. It means
> that pages from the webdb are recrawled every 30
> days.<http://www.mail-archive.com/nutch-user@.../msg02058.html>I
> want to know if this "recrawled" here means automatic recrawl or I
> have to
> execute some shell script before this period to make possible updates to my
> WebDB.

> I really wanna know this because at this time I did not obtain a update in
> fact.

> Thanks a lot!


You have to recrawl db manually.


--
Regards,
 Dima                          mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: nutch-default.xml configuration

Stefan Groschupf-2
In reply to this post by Lourival Júnior
Hi Lourival,

this means all pages older than 30 days are potential candidates for  
a fetch list that is created by segment generation process.

Stefan



Am 12.06.2006 um 16:33 schrieb Lourival Júnior:

> Hi all!
>
> I have a question about nutch-default.xml configuration file. There  
> is a
> parameter db.default.fetch.interval that is set by default to 30.  
> It means
> that pages from the webdb are recrawled every 30
> days.<http://www.mail-archive.com/nutch-user@.../ 
> msg02058.html>I
> want to know if this "recrawled" here means automatic recrawl or I
> have to
> execute some shell script before this period to make possible  
> updates to my
> WebDB.
>
> I really wanna know this because at this time I did not obtain a  
> update in
> fact.
>
> Thanks a lot!
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: nutch-default.xml configuration

Lourival Júnior
In reply to this post by Nuther
Ok. So, have you any solution to do this job automatically? I have a shell
script, but I don't see if this really works yet.

Sorry if I'm being redundant. I'm learn about this tool and I have a lot of
questions :).

Thanks!

On 6/12/06, Dima Mazmanov <[hidden email]> wrote:

>
> Hi,Lourival.
>
>
> You wrote 12 èþíÿ 2006 ã., 19:33:15:
>
> > Hi all!
>
> > I have a question about nutch-default.xml configuration file. There is a
> > parameter db.default.fetch.interval that is set by default to 30. It
> means
> > that pages from the webdb are recrawled every 30
> > days.<
> http://www.mail-archive.com/nutch-user@.../msg02058.html>I
> > want to know if this "recrawled" here means automatic recrawl or I
> > have to
> > execute some shell script before this period to make possible updates to
> my
> > WebDB.
>
> > I really wanna know this because at this time I did not obtain a update
> in
> > fact.
>
> > Thanks a lot!
>
>
> You have to recrawl db manually.
>
>
> --
> Regards,
> Dima                          mailto:[hidden email]
>
>


--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: nutch-default.xml configuration

Stefan Groschupf-2
> Ok. So, have you any solution to do this job automatically? I have  
> a shell
> script, but I don't see if this really works yet.
Shell scripts are the best solution.
>
> Sorry if I'm being redundant. I'm learn about this tool and I have  
> a lot of
> questions :).
No Problem, but  the nutch user mailing list would be a better list  
to ask such questions.
Thanks!
Stefan

>
> Thanks!
>
> On 6/12/06, Dima Mazmanov <[hidden email]> wrote:
>>
>> Hi,Lourival.
>>
>>
>> You wrote 12 èþíÿ 2006 ã., 19:33:15:
>>
>> > Hi all!
>>
>> > I have a question about nutch-default.xml configuration file.  
>> There is a
>> > parameter db.default.fetch.interval that is set by default to  
>> 30. It
>> means
>> > that pages from the webdb are recrawled every 30
>> > days.<
>> http://www.mail-archive.com/nutch-user@.../ 
>> msg02058.html>I
>> > want to know if this "recrawled" here means automatic recrawl or I
>> > have to
>> > execute some shell script before this period to make possible  
>> updates to
>> my
>> > WebDB.
>>
>> > I really wanna know this because at this time I did not obtain a  
>> update
>> in
>> > fact.
>>
>> > Thanks a lot!
>>
>>
>> You have to recrawl db manually.
>>
>>
>> --
>> Regards,
>> Dima                          mailto:[hidden email]
>>
>>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re[2]: nutch-default.xml configuration

Nuther
In reply to this post by Lourival Júnior
Hi,Lourival.

What kind of shell script do you have?
You wrote 12 èþíÿ 2006 ã., 19:51:06:

> Ok. So, have you any solution to do this job automatically? I have a shell
> script, but I don't see if this really works yet.

> Sorry if I'm being redundant. I'm learn about this tool and I have a lot of
> questions :).

> Thanks!

> On 6/12/06, Dima Mazmanov <[hidden email]> wrote:
>>
>> Hi,Lourival.
>>
>>
>> You wrote 12 èþíÿ 2006 ã., 19:33:15:
>>
>> > Hi all!
>>
>> > I have a question about nutch-default.xml configuration file. There is a
>> > parameter db.default.fetch.interval that is set by default to 30. It
>> means
>> > that pages from the webdb are recrawled every 30
>> > days.<
>> http://www.mail-archive.com/nutch-user@.../msg02058.html>I
>> > want to know if this "recrawled" here means automatic recrawl or I
>> > have to
>> > execute some shell script before this period to make possible updates to
>> my
>> > WebDB.
>>
>> > I really wanna know this because at this time I did not obtain a update
>> in
>> > fact.
>>
>> > Thanks a lot!
>>
>>
>> You have to recrawl db manually.
>>
>>
>> --
>> Regards,
>> Dima                          mailto:[hidden email]
>>
>>





--
Regards,
 Dima                          mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re[2]: nutch-default.xml configuration

Lourival Júnior
Let explain the problem. I have this shell script:

#!/bin/bash
# A simple script to run a Nutch re-crawl
if [ -n "$1" ]
then
  crawl_dir=$1
else
  echo "Usage: recrawl crawl_dir [depth] [adddays]"
  exit 1
fi

if [ -n "$2" ]
then
  depth=$2
else
  depth=5
fi

if [ -n "$3" ]
then
  adddays=$3
else
  adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
done

# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
  bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

I got it in this web
site.<http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html>I
want to update a web page that was crawled with N links and now have
M,
where M > N or M < N. It's a simple example, with a little set o files
linked in this page. But in a production enviroment it's very important.

I hope I am being clearly. I'm brazilian and i'm improving my english :).

Again, Thanks a lot!

On 6/12/06, Dima Mazmanov <[hidden email]> wrote:

>
> Hi,Lourival.
>
> What kind of shell script do you have?
> You wrote 12 июня 2006 г., 19:51:06:
>
> > Ok. So, have you any solution to do this job automatically? I have a
> shell
> > script, but I don't see if this really works yet.
>
> > Sorry if I'm being redundant. I'm learn about this tool and I have a lot
> of
> > questions :).
>
> > Thanks!
>
> > On 6/12/06, Dima Mazmanov <[hidden email]> wrote:
> >>
> >> Hi,Lourival.
> >>
> >>
> >> You wrote 12 июня 2006 г., 19:33:15:
> >>
> >> > Hi all!
> >>
> >> > I have a question about nutch-default.xml configuration file. There
> is a
> >> > parameter db.default.fetch.interval that is set by default to 30. It
> >> means
> >> > that pages from the webdb are recrawled every 30
> >> > days.<
> >> http://www.mail-archive.com/nutch-user@.../msg02058.html
> >I
> >> > want to know if this "recrawled" here means automatic recrawl or I
> >> > have to
> >> > execute some shell script before this period to make possible updates
> to
> >> my
> >> > WebDB.
> >>
> >> > I really wanna know this because at this time I did not obtain a
> update
> >> in
> >> > fact.
> >>
> >> > Thanks a lot!
> >>
> >>
> >> You have to recrawl db manually.
> >>
> >>
> >> --
> >> Regards,
> >> Dima                          mailto:[hidden email]
> >>
> >>
>
>
>
>
>
> --
> Regards,
> Dima                          mailto:[hidden email]
>
>


--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re[4]: nutch-default.xml configuration

Nuther
Hi,Lourival.

Ok after first indexing you must merge segments,
and if you want to reindex your db, you have to delete segments wich
are older then predefined date, in your case 30 days.
this is my solution, if someone has better , please share your
experience!


> Let explain the problem. I have this shell script:

> #!/bin/bash
> # A simple script to run a Nutch re-crawl
> if [ -n "$1" ]
> then
>   crawl_dir=$1
> else
>   echo "Usage: recrawl crawl_dir [depth] [adddays]"
>   exit 1
> fi

> if [ -n "$2" ]
> then
>   depth=$2
> else
>   depth=5
> fi

> if [ -n "$3" ]
> then
>   adddays=$3
> else
>   adddays=0
> fi

> webdb_dir=$crawl_dir/db
> segments_dir=$crawl_dir/segments
> index_dir=$crawl_dir/index

> # The generate/fetch/update cycle
> for ((i=1; i <= depth ; i++))
> do
>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>   segment=`ls -d $segments_dir/* | tail -1`
>   bin/nutch fetch $segment
>   bin/nutch updatedb $webdb_dir $segment
> done

> # Update segments
> mkdir tmp
> bin/nutch updatesegs $webdb_dir $segments_dir tmp
> rm -R tmp

> # Index segments
> for segment in `ls -d $segments_dir/* | tail -$depth`
> do
>   bin/nutch index $segment
> done

> # De-duplicate indexes
> # "bogus" argument is ignored but needed due to
> # a bug in the number of args expected
> bin/nutch dedup $segments_dir bogus

> # Merge indexes
> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

> I got it in this web
> site.<http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html>I
> want to update a web page that was crawled with N links and now have
> M,
where M >> N or M < N. It's a simple example, with a little set o files
> linked in this page. But in a production enviroment it's very important.

> I hope I am being clearly. I'm brazilian and i'm improving my english :).

> Again, Thanks a lot!

> On 6/12/06, Dima Mazmanov <[hidden email]> wrote:
>>
>> Hi,Lourival.
>>
>> What kind of shell script do you have?
>> You wrote 12 июня 2006 г., 19:51:06:
>>
>> > Ok. So, have you any solution to do this job automatically? I have a
>> shell
>> > script, but I don't see if this really works yet.
>>
>> > Sorry if I'm being redundant. I'm learn about this tool and I have a lot
>> of
>> > questions :).
>>
>> > Thanks!
>>
>> > On 6/12/06, Dima Mazmanov <[hidden email]> wrote:
>> >>
>> >> Hi,Lourival.
>> >>
>> >>
>> >> You wrote 12 июня 2006 г., 19:33:15:
>> >>
>> >> > Hi all!
>> >>
>> >> > I have a question about nutch-default.xml configuration file. There
>> is a
>> >> > parameter db.default.fetch.interval that is set by default to 30. It
>> >> means
>> >> > that pages from the webdb are recrawled every 30
>> >> > days.<
>> >>
>> http://www.mail-archive.com/nutch-user@.../msg02058.html
>> >I
>> >> > want to know if this "recrawled" here means automatic recrawl or I
>> >> > have to
>> >> > execute some shell script before this period to make possible updates
>> to
>> >> my
>> >> > WebDB.
>> >>
>> >> > I really wanna know this because at this time I did not obtain a
>> update
>> >> in
>> >> > fact.
>> >>
>> >> > Thanks a lot!
>> >>
>> >>
>> >> You have to recrawl db manually.
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Dima                          mailto:[hidden email]
>> >>
>> >>
>>
>>
>>
>>
>>
>> --
>> Regards,
>> Dima                          mailto:[hidden email]
>>
>>





--
Regards,
 Dima                          mailto:[hidden email]