Recrawl script for 0.8.0 completed...

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Recrawl script for 0.8.0 completed...

mholt
Thanks for putting up with all the messages to the list... Here is the
recrawl script for 0.8.0 if anyone is interested.
        Matt
-------------------------------

#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
# Modified by Matthew Holt

if [ -n "$1" ]
then
  crawl_dir=$1
else
  echo "Usage: recrawl crawl_dir [depth] [adddays]"
  exit 1
fi

if [ -n "$2" ]
then
  depth=$2
else
  depth=5
fi

if [ -n "$3" ]
then
  adddays=$3
else
  adddays=0
fi


# EDIT THIS - List the location to your nutch servlet container.
nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/

# No need to edit anything past this line #
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
#ls -d $segments_dir/* | tail -$depth | xargs
bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*

# De-duplicate indexes
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

# Clean up
rm -rf $new_indexes

Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

Lourival Júnior
Hi Matt!

In the article found at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou
said the re-crawl script have a problem with updating the live search
index. In my tests with Nutch version 0.7.2 when I run the script the index
could not be update because the tomcat loads it to the memory. Could you
suggest a modification to this script or to the NutchBean that accepts
modifications to the index without restart tomcat (Actually, I use net stop
"Apache Tomcat" before the index updation...)?

Thanks

On 7/21/06, Matthew Holt <[hidden email]> wrote:

>
> Thanks for putting up with all the messages to the list... Here is the
> recrawl script for 0.8.0 if anyone is interested.
>         Matt
> -------------------------------
>
> #!/bin/bash
>
> # Nutch recrawl script.
> # Based on 0.7.2 script at
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> # Modified by Matthew Holt
>
> if [ -n "$1" ]
> then
>   crawl_dir=$1
> else
>   echo "Usage: recrawl crawl_dir [depth] [adddays]"
>   exit 1
> fi
>
> if [ -n "$2" ]
> then
>   depth=$2
> else
>   depth=5
> fi
>
> if [ -n "$3" ]
> then
>   adddays=$3
> else
>   adddays=0
> fi
>
>
> # EDIT THIS - List the location to your nutch servlet container.
> nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/
>
> # No need to edit anything past this line #
> webdb_dir=$crawl_dir/crawldb
> segments_dir=$crawl_dir/segments
> linkdb_dir=$crawl_dir/linkdb
> index_dir=$crawl_dir/index
>
> # The generate/fetch/update cycle
> for ((i=1; i <= depth ; i++))
> do
>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>   segment=`ls -d $segments_dir/* | tail -1`
>   bin/nutch fetch $segment
>   bin/nutch updatedb $webdb_dir $segment
> done
>
> # Update segments
> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>
> # Index segments
> new_indexes=$crawl_dir/newindexes
> #ls -d $segments_dir/* | tail -$depth | xargs
> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>
> # De-duplicate indexes
> bin/nutch dedup $new_indexes
>
> # Merge indexes
> bin/nutch merge $index_dir $new_indexes
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> # Clean up
> rm -rf $new_indexes
>
>


--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

Renaud Richardet-3
Hi Matt and Lourival,

Matt, thank you for the recrawl script. Any plans to commit it to trunk?

Lourival, here's in the script what "reloads Tomcat", not the cleanest,
but it should work
# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

HTH,
Renaud


Lourival Júnior wrote:

> Hi Matt!
>
> In the article found at
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou 
>
> said the re-crawl script have a problem with updating the live search
> index. In my tests with Nutch version 0.7.2 when I run the script the
> index
> could not be update because the tomcat loads it to the memory. Could you
> suggest a modification to this script or to the NutchBean that accepts
> modifications to the index without restart tomcat (Actually, I use net
> stop
> "Apache Tomcat" before the index updation...)?
>
> Thanks
>
> On 7/21/06, Matthew Holt <[hidden email]> wrote:
>>
>> Thanks for putting up with all the messages to the list... Here is the
>> recrawl script for 0.8.0 if anyone is interested.
>>         Matt
>> -------------------------------
>>
>> #!/bin/bash
>>
>> # Nutch recrawl script.
>> # Based on 0.7.2 script at
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html 
>>
>> # Modified by Matthew Holt
>>
>> if [ -n "$1" ]
>> then
>>   crawl_dir=$1
>> else
>>   echo "Usage: recrawl crawl_dir [depth] [adddays]"
>>   exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>>   depth=$2
>> else
>>   depth=5
>> fi
>>
>> if [ -n "$3" ]
>> then
>>   adddays=$3
>> else
>>   adddays=0
>> fi
>>
>>
>> # EDIT THIS - List the location to your nutch servlet container.
>> nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/
>>
>> # No need to edit anything past this line #
>> webdb_dir=$crawl_dir/crawldb
>> segments_dir=$crawl_dir/segments
>> linkdb_dir=$crawl_dir/linkdb
>> index_dir=$crawl_dir/index
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>   segment=`ls -d $segments_dir/* | tail -1`
>>   bin/nutch fetch $segment
>>   bin/nutch updatedb $webdb_dir $segment
>> done
>>
>> # Update segments
>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>
>> # Index segments
>> new_indexes=$crawl_dir/newindexes
>> #ls -d $segments_dir/* | tail -$depth | xargs
>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>
>> # De-duplicate indexes
>> bin/nutch dedup $new_indexes
>>
>> # Merge indexes
>> bin/nutch merge $index_dir $new_indexes
>>
>> # Tell Tomcat to reload index
>> touch $nutch_dir/WEB-INF/web.xml
>>
>> # Clean up
>> rm -rf $new_indexes
>>
>>
>
>

--
Renaud Richardet
COO America
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                     mobile +1 617 230 9112
renaud.richardet <at> wyona.com              http://www.wyona.com

Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

mholt
Renaud Richardet wrote:

> Hi Matt and Lourival,
>
> Matt, thank you for the recrawl script. Any plans to commit it to trunk?
>
> Lourival, here's in the script what "reloads Tomcat", not the
> cleanest, but it should work
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> HTH,
> Renaud
>
>
> Lourival Júnior wrote:
>> Hi Matt!
>>
>> In the article found at
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou 
>>
>> said the re-crawl script have a problem with updating the live search
>> index. In my tests with Nutch version 0.7.2 when I run the script the
>> index
>> could not be update because the tomcat loads it to the memory. Could you
>> suggest a modification to this script or to the NutchBean that accepts
>> modifications to the index without restart tomcat (Actually, I use
>> net stop
>> "Apache Tomcat" before the index updation...)?
>>
>> Thanks
>>
>> On 7/21/06, Matthew Holt <[hidden email]> wrote:
>>>
>>> Thanks for putting up with all the messages to the list... Here is the
>>> recrawl script for 0.8.0 if anyone is interested.
>>>         Matt
>>> -------------------------------
>>>
>>> #!/bin/bash
>>>
>>> # Nutch recrawl script.
>>> # Based on 0.7.2 script at
>>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html 
>>>
>>> # Modified by Matthew Holt
>>>
>>> if [ -n "$1" ]
>>> then
>>>   crawl_dir=$1
>>> else
>>>   echo "Usage: recrawl crawl_dir [depth] [adddays]"
>>>   exit 1
>>> fi
>>>
>>> if [ -n "$2" ]
>>> then
>>>   depth=$2
>>> else
>>>   depth=5
>>> fi
>>>
>>> if [ -n "$3" ]
>>> then
>>>   adddays=$3
>>> else
>>>   adddays=0
>>> fi
>>>
>>>
>>> # EDIT THIS - List the location to your nutch servlet container.
>>> nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/
>>>
>>> # No need to edit anything past this line #
>>> webdb_dir=$crawl_dir/crawldb
>>> segments_dir=$crawl_dir/segments
>>> linkdb_dir=$crawl_dir/linkdb
>>> index_dir=$crawl_dir/index
>>>
>>> # The generate/fetch/update cycle
>>> for ((i=1; i <= depth ; i++))
>>> do
>>>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>>   segment=`ls -d $segments_dir/* | tail -1`
>>>   bin/nutch fetch $segment
>>>   bin/nutch updatedb $webdb_dir $segment
>>> done
>>>
>>> # Update segments
>>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>>
>>> # Index segments
>>> new_indexes=$crawl_dir/newindexes
>>> #ls -d $segments_dir/* | tail -$depth | xargs
>>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>>
>>> # De-duplicate indexes
>>> bin/nutch dedup $new_indexes
>>>
>>> # Merge indexes
>>> bin/nutch merge $index_dir $new_indexes
>>>
>>> # Tell Tomcat to reload index
>>> touch $nutch_dir/WEB-INF/web.xml
>>>
>>> # Clean up
>>> rm -rf $new_indexes
>>>
>>>
>>
>>
>
I'll commit it to trunk, just have to modify it a little so users dont
have to edit the tomcat location in their file and can do it through the
command line.. Kinda busy @ work with this right now, so I'll follow up
later regarding the commit.
Matt
Reply | Threaded
Open this post in threaded view
|

PLease help... this has to be simple (re: mergesegs)

Honda-Search Administrator
I'm running a few commands every week to keep my nutch clean, but I'm a bit
confused if I'm doing it right.

I merge the segments using the following command:

bin/nutch mergesegs -dir crawl/segments/ -i -ds

this should index the new segment and delete the old ones, which it does.

After this what do I do?  Should I update the database?  merge indexes?
I've done this before and after I updated the database (updatedb) it told me
I had over 260k records, while my segment only has around 80k records.

Please help so I don't think I'm doing something wrong.

Matt

Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

Lourival Júnior
In reply to this post by Renaud Richardet-3
Hi Renaud!

I'm newbie with shell scripts and I know stops tomcat service is not the
better way to do this. The problem is, when a run the re-crawl script with
tomcat started I get this error:

060721 132224 merging segment indexes to: crawl-legislacao2\index
Exception in thread "main" java.io.IOException: Cannot delete _0.f0
        at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
        at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
        at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
:141)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)

So, I want another way to re-crawl my pages without this error and without
restarting the tomcat. Could you suggest one?

Thanks a lot!

On 7/21/06, Renaud Richardet <[hidden email]> wrote:

>
> Hi Matt and Lourival,
>
> Matt, thank you for the recrawl script. Any plans to commit it to trunk?
>
> Lourival, here's in the script what "reloads Tomcat", not the cleanest,
> but it should work
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> HTH,
> Renaud
>
>
> Lourival Júnior wrote:
> > Hi Matt!
> >
> > In the article found at
> >
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou
> >
> > said the re-crawl script have a problem with updating the live search
> > index. In my tests with Nutch version 0.7.2 when I run the script the
> > index
> > could not be update because the tomcat loads it to the memory. Could you
> > suggest a modification to this script or to the NutchBean that accepts
> > modifications to the index without restart tomcat (Actually, I use net
> > stop
> > "Apache Tomcat" before the index updation...)?
> >
> > Thanks
> >
> > On 7/21/06, Matthew Holt <[hidden email]> wrote:
> >>
> >> Thanks for putting up with all the messages to the list... Here is the
> >> recrawl script for 0.8.0 if anyone is interested.
> >>         Matt
> >> -------------------------------
> >>
> >> #!/bin/bash
> >>
> >> # Nutch recrawl script.
> >> # Based on 0.7.2 script at
> >>
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> >>
> >> # Modified by Matthew Holt
> >>
> >> if [ -n "$1" ]
> >> then
> >>   crawl_dir=$1
> >> else
> >>   echo "Usage: recrawl crawl_dir [depth] [adddays]"
> >>   exit 1
> >> fi
> >>
> >> if [ -n "$2" ]
> >> then
> >>   depth=$2
> >> else
> >>   depth=5
> >> fi
> >>
> >> if [ -n "$3" ]
> >> then
> >>   adddays=$3
> >> else
> >>   adddays=0
> >> fi
> >>
> >>
> >> # EDIT THIS - List the location to your nutch servlet container.
> >> nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/
> >>
> >> # No need to edit anything past this line #
> >> webdb_dir=$crawl_dir/crawldb
> >> segments_dir=$crawl_dir/segments
> >> linkdb_dir=$crawl_dir/linkdb
> >> index_dir=$crawl_dir/index
> >>
> >> # The generate/fetch/update cycle
> >> for ((i=1; i <= depth ; i++))
> >> do
> >>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> >>   segment=`ls -d $segments_dir/* | tail -1`
> >>   bin/nutch fetch $segment
> >>   bin/nutch updatedb $webdb_dir $segment
> >> done
> >>
> >> # Update segments
> >> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
> >>
> >> # Index segments
> >> new_indexes=$crawl_dir/newindexes
> >> #ls -d $segments_dir/* | tail -$depth | xargs
> >> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
> >>
> >> # De-duplicate indexes
> >> bin/nutch dedup $new_indexes
> >>
> >> # Merge indexes
> >> bin/nutch merge $index_dir $new_indexes
> >>
> >> # Tell Tomcat to reload index
> >> touch $nutch_dir/WEB-INF/web.xml
> >>
> >> # Clean up
> >> rm -rf $new_indexes
> >>
> >>
> >
> >
>
> --
> Renaud Richardet
> COO America
> Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
> office +1 857 776-3195                     mobile +1 617 230 9112
> renaud.richardet <at> wyona.com              http://www.wyona.com
>
>


--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

mholt
Lourival Júnior wrote:

> Hi Renaud!
>
> I'm newbie with shell scripts and I know stops tomcat service is not the
> better way to do this. The problem is, when a run the re-crawl script
> with
> tomcat started I get this error:
>
> 060721 132224 merging segment indexes to: crawl-legislacao2\index
> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>        at
> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>        at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>        at
> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
> :141)
>        at
> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
>        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)
>
> So, I want another way to re-crawl my pages without this error and
> without
> restarting the tomcat. Could you suggest one?
>
> Thanks a lot!
>
>
Try this updated script and tell me what command exactly you run to call
the script. Let me know the error message then.

Matt


#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
# Modified by Matthew Holt

if [ -n "$1" ]
then
  nutch_dir=$1
else
  echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
  echo "servlet_path - Path of the nutch servlet (i.e.
/usr/local/tomcat/webapps/ROOT)"
  echo "crawl_dir - Name of the directory the crawl is located in."
  echo "[depth] - The link depth from the root page that should be crawled."
  echo "[adddays] - Advance the clock # of days for fetchlist generation."
  exit 1
fi

if [ -n "$2" ]
then
  crawl_dir=$2
else
  echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
  echo "servlet_path - Path of the nutch servlet (i.e.
/usr/local/tomcat/webapps/ROOT)"
  echo "crawl_dir - Name of the directory the crawl is located in."
  echo "[depth] - The link depth from the root page that should be crawled."
  echo "[adddays] - Advance the clock # of days for fetchlist generation."
  exit 1
fi

if [ -n "$3" ]
then
  depth=$3
else
  depth=5
fi

if [ -n "$4" ]
then
  adddays=$4
else
  adddays=0
fi

# Only change if your crawl subdirectories are named something different
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
#ls -d $segments_dir/* | tail -$depth | xargs
bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*

# De-duplicate indexes
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

# Clean up
rm -rf $new_indexes

Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

Lourival Júnior
I thing it wont work with me because i'm using the Nutch version 0.7.2.
Actually I use this script (some comments are in Portuguese):

#!/bin/bash

# A simple script to run a Nutch re-crawl
# Fonte do script:
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

#{

if [ -n "$1" ]
then
  crawl_dir=$1
else
  echo "Usage: recrawl crawl_dir [depth] [adddays]"
  exit 1
fi

if [ -n "$2" ]
then
  depth=$2
else
  depth=5
fi

if [ -n "$3" ]
then
  adddays=$3
else
  adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

#Para o serviço do TomCat
#net stop "Apache Tomcat"

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
  echo
  echo "Fim do ciclo $i."
  echo
done

# Update segments
echo
echo "Atualizando os Segmentos..."
echo
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
echo "Indexando os segmentos..."
echo
for segment in `ls -d $segments_dir/* | tail -$depth`
do
  bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
#echo "Unindo os segmentos..."
#echo
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

chmod 777 -R $index_dir

#Inicia o serviço do TomCat
#net start "Apache Tomcat"

echo "Fim."

#} > recrawl.log 2>&1

How you suggested I used the touch command instead stops the tomcat. However
I get that error posted in previous message. I'm running nutch in windows
plataform with cygwin. I only get no errors when I stops the tomcat. I use
this command to call the script:

./recrawl crawl-legislacao 1

Could you give me more clarifications?

Thanks a lot!

On 7/21/06, Matthew Holt <[hidden email]> wrote:

>
> Lourival Júnior wrote:
> > Hi Renaud!
> >
> > I'm newbie with shell scripts and I know stops tomcat service is not the
> > better way to do this. The problem is, when a run the re-crawl script
> > with
> > tomcat started I get this error:
> >
> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0
> >        at
> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
> >        at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
> >        at
> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
> > :141)
> >        at
> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
> >        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
> :92)
> >        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
> :160)
> >
> > So, I want another way to re-crawl my pages without this error and
> > without
> > restarting the tomcat. Could you suggest one?
> >
> > Thanks a lot!
> >
> >
> Try this updated script and tell me what command exactly you run to call
> the script. Let me know the error message then.
>
> Matt
>
>
> #!/bin/bash
>
> # Nutch recrawl script.
> # Based on 0.7.2 script at
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> # Modified by Matthew Holt
>
> if [ -n "$1" ]
> then
>   nutch_dir=$1
> else
>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>   echo "servlet_path - Path of the nutch servlet (i.e.
> /usr/local/tomcat/webapps/ROOT)"
>   echo "crawl_dir - Name of the directory the crawl is located in."
>   echo "[depth] - The link depth from the root page that should be
> crawled."
>   echo "[adddays] - Advance the clock # of days for fetchlist generation."
>   exit 1
> fi
>
> if [ -n "$2" ]
> then
>   crawl_dir=$2
> else
>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>   echo "servlet_path - Path of the nutch servlet (i.e.
> /usr/local/tomcat/webapps/ROOT)"
>   echo "crawl_dir - Name of the directory the crawl is located in."
>   echo "[depth] - The link depth from the root page that should be
> crawled."
>   echo "[adddays] - Advance the clock # of days for fetchlist generation."
>   exit 1
> fi
>
> if [ -n "$3" ]
> then
>   depth=$3
> else
>   depth=5
> fi
>
> if [ -n "$4" ]
> then
>   adddays=$4
> else
>   adddays=0
> fi
>
> # Only change if your crawl subdirectories are named something different
> webdb_dir=$crawl_dir/crawldb
> segments_dir=$crawl_dir/segments
> linkdb_dir=$crawl_dir/linkdb
> index_dir=$crawl_dir/index
>
> # The generate/fetch/update cycle
> for ((i=1; i <= depth ; i++))
> do
>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>   segment=`ls -d $segments_dir/* | tail -1`
>   bin/nutch fetch $segment
>   bin/nutch updatedb $webdb_dir $segment
> done
>
> # Update segments
> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>
> # Index segments
> new_indexes=$crawl_dir/newindexes
> #ls -d $segments_dir/* | tail -$depth | xargs
> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>
> # De-duplicate indexes
> bin/nutch dedup $new_indexes
>
> # Merge indexes
> bin/nutch merge $index_dir $new_indexes
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> # Clean up
> rm -rf $new_indexes
>
>


--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

mholt
Lourival Júnior wrote:

> I thing it wont work with me because i'm using the Nutch version 0.7.2.
> Actually I use this script (some comments are in Portuguese):
>
> #!/bin/bash
>
> # A simple script to run a Nutch re-crawl
> # Fonte do script:
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>
> #{
>
> if [ -n "$1" ]
> then
>  crawl_dir=$1
> else
>  echo "Usage: recrawl crawl_dir [depth] [adddays]"
>  exit 1
> fi
>
> if [ -n "$2" ]
> then
>  depth=$2
> else
>  depth=5
> fi
>
> if [ -n "$3" ]
> then
>  adddays=$3
> else
>  adddays=0
> fi
>
> webdb_dir=$crawl_dir/db
> segments_dir=$crawl_dir/segments
> index_dir=$crawl_dir/index
>
> #Para o serviço do TomCat
> #net stop "Apache Tomcat"
>
> # The generate/fetch/update cycle
> for ((i=1; i <= depth ; i++))
> do
>  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>  segment=`ls -d $segments_dir/* | tail -1`
>  bin/nutch fetch $segment
>  bin/nutch updatedb $webdb_dir $segment
>  echo
>  echo "Fim do ciclo $i."
>  echo
> done
>
> # Update segments
> echo
> echo "Atualizando os Segmentos..."
> echo
> mkdir tmp
> bin/nutch updatesegs $webdb_dir $segments_dir tmp
> rm -R tmp
>
> # Index segments
> echo "Indexando os segmentos..."
> echo
> for segment in `ls -d $segments_dir/* | tail -$depth`
> do
>  bin/nutch index $segment
> done
>
> # De-duplicate indexes
> # "bogus" argument is ignored but needed due to
> # a bug in the number of args expected
> bin/nutch dedup $segments_dir bogus
>
> # Merge indexes
> #echo "Unindo os segmentos..."
> #echo
> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>
> chmod 777 -R $index_dir
>
> #Inicia o serviço do TomCat
> #net start "Apache Tomcat"
>
> echo "Fim."
>
> #} > recrawl.log 2>&1
>
> How you suggested I used the touch command instead stops the tomcat.
> However
> I get that error posted in previous message. I'm running nutch in windows
> plataform with cygwin. I only get no errors when I stops the tomcat. I
> use
> this command to call the script:
>
> ./recrawl crawl-legislacao 1
>
> Could you give me more clarifications?
>
> Thanks a lot!
>
> On 7/21/06, Matthew Holt <[hidden email]> wrote:
>>
>> Lourival Júnior wrote:
>> > Hi Renaud!
>> >
>> > I'm newbie with shell scripts and I know stops tomcat service is
>> not the
>> > better way to do this. The problem is, when a run the re-crawl script
>> > with
>> > tomcat started I get this error:
>> >
>> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
>> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>> >        at
>> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>> >        at
>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>> >        at
>> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>> > :141)
>> >        at
>> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>> >        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>> :92)
>> >        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>> :160)
>> >
>> > So, I want another way to re-crawl my pages without this error and
>> > without
>> > restarting the tomcat. Could you suggest one?
>> >
>> > Thanks a lot!
>> >
>> >
>> Try this updated script and tell me what command exactly you run to call
>> the script. Let me know the error message then.
>>
>> Matt
>>
>>
>> #!/bin/bash
>>
>> # Nutch recrawl script.
>> # Based on 0.7.2 script at
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html 
>>
>> # Modified by Matthew Holt
>>
>> if [ -n "$1" ]
>> then
>>   nutch_dir=$1
>> else
>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>   echo "servlet_path - Path of the nutch servlet (i.e.
>> /usr/local/tomcat/webapps/ROOT)"
>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>   echo "[depth] - The link depth from the root page that should be
>> crawled."
>>   echo "[adddays] - Advance the clock # of days for fetchlist
>> generation."
>>   exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>>   crawl_dir=$2
>> else
>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>   echo "servlet_path - Path of the nutch servlet (i.e.
>> /usr/local/tomcat/webapps/ROOT)"
>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>   echo "[depth] - The link depth from the root page that should be
>> crawled."
>>   echo "[adddays] - Advance the clock # of days for fetchlist
>> generation."
>>   exit 1
>> fi
>>
>> if [ -n "$3" ]
>> then
>>   depth=$3
>> else
>>   depth=5
>> fi
>>
>> if [ -n "$4" ]
>> then
>>   adddays=$4
>> else
>>   adddays=0
>> fi
>>
>> # Only change if your crawl subdirectories are named something different
>> webdb_dir=$crawl_dir/crawldb
>> segments_dir=$crawl_dir/segments
>> linkdb_dir=$crawl_dir/linkdb
>> index_dir=$crawl_dir/index
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>   segment=`ls -d $segments_dir/* | tail -1`
>>   bin/nutch fetch $segment
>>   bin/nutch updatedb $webdb_dir $segment
>> done
>>
>> # Update segments
>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>
>> # Index segments
>> new_indexes=$crawl_dir/newindexes
>> #ls -d $segments_dir/* | tail -$depth | xargs
>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>
>> # De-duplicate indexes
>> bin/nutch dedup $new_indexes
>>
>> # Merge indexes
>> bin/nutch merge $index_dir $new_indexes
>>
>> # Tell Tomcat to reload index
>> touch $nutch_dir/WEB-INF/web.xml
>>
>> # Clean up
>> rm -rf $new_indexes
>>
>>
>
>
Oh yea, you're right the one i sent out was for 0.8.... you should just
be able to put this at the end of your script..

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

and fill in the appropriate path of course.
gluck
matt
Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

Lourival Júnior
Ok. However a few minutes ago I ran the script exactly you said and I still
get this error:

Exception in thread "main" java.io.IOException: Cannot delete _0.f0
        at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
        at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
        at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
:141)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)

I dont know but I thing it occurs because nutch tries to delete some file
that tomcat loads to the memory, giving permission access error. Any idea?

On 7/21/06, Matthew Holt <[hidden email]> wrote:

>
> Lourival Júnior wrote:
> > I thing it wont work with me because i'm using the Nutch version 0.7.2.
> > Actually I use this script (some comments are in Portuguese):
> >
> > #!/bin/bash
> >
> > # A simple script to run a Nutch re-crawl
> > # Fonte do script:
> >
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> >
> > #{
> >
> > if [ -n "$1" ]
> > then
> >  crawl_dir=$1
> > else
> >  echo "Usage: recrawl crawl_dir [depth] [adddays]"
> >  exit 1
> > fi
> >
> > if [ -n "$2" ]
> > then
> >  depth=$2
> > else
> >  depth=5
> > fi
> >
> > if [ -n "$3" ]
> > then
> >  adddays=$3
> > else
> >  adddays=0
> > fi
> >
> > webdb_dir=$crawl_dir/db
> > segments_dir=$crawl_dir/segments
> > index_dir=$crawl_dir/index
> >
> > #Para o serviço do TomCat
> > #net stop "Apache Tomcat"
> >
> > # The generate/fetch/update cycle
> > for ((i=1; i <= depth ; i++))
> > do
> >  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> >  segment=`ls -d $segments_dir/* | tail -1`
> >  bin/nutch fetch $segment
> >  bin/nutch updatedb $webdb_dir $segment
> >  echo
> >  echo "Fim do ciclo $i."
> >  echo
> > done
> >
> > # Update segments
> > echo
> > echo "Atualizando os Segmentos..."
> > echo
> > mkdir tmp
> > bin/nutch updatesegs $webdb_dir $segments_dir tmp
> > rm -R tmp
> >
> > # Index segments
> > echo "Indexando os segmentos..."
> > echo
> > for segment in `ls -d $segments_dir/* | tail -$depth`
> > do
> >  bin/nutch index $segment
> > done
> >
> > # De-duplicate indexes
> > # "bogus" argument is ignored but needed due to
> > # a bug in the number of args expected
> > bin/nutch dedup $segments_dir bogus
> >
> > # Merge indexes
> > #echo "Unindo os segmentos..."
> > #echo
> > ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
> >
> > chmod 777 -R $index_dir
> >
> > #Inicia o serviço do TomCat
> > #net start "Apache Tomcat"
> >
> > echo "Fim."
> >
> > #} > recrawl.log 2>&1
> >
> > How you suggested I used the touch command instead stops the tomcat.
> > However
> > I get that error posted in previous message. I'm running nutch in
> windows
> > plataform with cygwin. I only get no errors when I stops the tomcat. I
> > use
> > this command to call the script:
> >
> > ./recrawl crawl-legislacao 1
> >
> > Could you give me more clarifications?
> >
> > Thanks a lot!
> >
> > On 7/21/06, Matthew Holt <[hidden email]> wrote:
> >>
> >> Lourival Júnior wrote:
> >> > Hi Renaud!
> >> >
> >> > I'm newbie with shell scripts and I know stops tomcat service is
> >> not the
> >> > better way to do this. The problem is, when a run the re-crawl script
> >> > with
> >> > tomcat started I get this error:
> >> >
> >> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
> >> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0
> >> >        at
> >> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
> >> >        at
> >> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
> >> >        at
> >> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
> >> > :141)
> >> >        at
> >> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
> >> >        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
> >> :92)
> >> >        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
> >> :160)
> >> >
> >> > So, I want another way to re-crawl my pages without this error and
> >> > without
> >> > restarting the tomcat. Could you suggest one?
> >> >
> >> > Thanks a lot!
> >> >
> >> >
> >> Try this updated script and tell me what command exactly you run to
> call
> >> the script. Let me know the error message then.
> >>
> >> Matt
> >>
> >>
> >> #!/bin/bash
> >>
> >> # Nutch recrawl script.
> >> # Based on 0.7.2 script at
> >>
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> >>
> >> # Modified by Matthew Holt
> >>
> >> if [ -n "$1" ]
> >> then
> >>   nutch_dir=$1
> >> else
> >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> >>   echo "servlet_path - Path of the nutch servlet (i.e.
> >> /usr/local/tomcat/webapps/ROOT)"
> >>   echo "crawl_dir - Name of the directory the crawl is located in."
> >>   echo "[depth] - The link depth from the root page that should be
> >> crawled."
> >>   echo "[adddays] - Advance the clock # of days for fetchlist
> >> generation."
> >>   exit 1
> >> fi
> >>
> >> if [ -n "$2" ]
> >> then
> >>   crawl_dir=$2
> >> else
> >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> >>   echo "servlet_path - Path of the nutch servlet (i.e.
> >> /usr/local/tomcat/webapps/ROOT)"
> >>   echo "crawl_dir - Name of the directory the crawl is located in."
> >>   echo "[depth] - The link depth from the root page that should be
> >> crawled."
> >>   echo "[adddays] - Advance the clock # of days for fetchlist
> >> generation."
> >>   exit 1
> >> fi
> >>
> >> if [ -n "$3" ]
> >> then
> >>   depth=$3
> >> else
> >>   depth=5
> >> fi
> >>
> >> if [ -n "$4" ]
> >> then
> >>   adddays=$4
> >> else
> >>   adddays=0
> >> fi
> >>
> >> # Only change if your crawl subdirectories are named something
> different
> >> webdb_dir=$crawl_dir/crawldb
> >> segments_dir=$crawl_dir/segments
> >> linkdb_dir=$crawl_dir/linkdb
> >> index_dir=$crawl_dir/index
> >>
> >> # The generate/fetch/update cycle
> >> for ((i=1; i <= depth ; i++))
> >> do
> >>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> >>   segment=`ls -d $segments_dir/* | tail -1`
> >>   bin/nutch fetch $segment
> >>   bin/nutch updatedb $webdb_dir $segment
> >> done
> >>
> >> # Update segments
> >> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
> >>
> >> # Index segments
> >> new_indexes=$crawl_dir/newindexes
> >> #ls -d $segments_dir/* | tail -$depth | xargs
> >> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
> >>
> >> # De-duplicate indexes
> >> bin/nutch dedup $new_indexes
> >>
> >> # Merge indexes
> >> bin/nutch merge $index_dir $new_indexes
> >>
> >> # Tell Tomcat to reload index
> >> touch $nutch_dir/WEB-INF/web.xml
> >>
> >> # Clean up
> >> rm -rf $new_indexes
> >>
> >>
> >
> >
> Oh yea, you're right the one i sent out was for 0.8.... you should just
> be able to put this at the end of your script..
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> and fill in the appropriate path of course.
> gluck
> matt
>



--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Hadoop and Recrawl

info-1247
In reply to this post by mholt
Hi List
I try to use this script with hadoop but don't work.
I try to change ls with bin/hadoop dfs -ls
But the script don't work because is ls -d and don't ls only.
Someone can help me
Best Regards
Roberto Navoni

-----Messaggio originale-----
Da: Matthew Holt [mailto:[hidden email]]
Inviato: venerdì 21 luglio 2006 18.58
A: [hidden email]
Oggetto: Re: Recrawl script for 0.8.0 completed...

Lourival Júnior wrote:

> I thing it wont work with me because i'm using the Nutch version 0.7.2.
> Actually I use this script (some comments are in Portuguese):
>
> #!/bin/bash
>
> # A simple script to run a Nutch re-crawl
> # Fonte do script:
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>
> #{
>
> if [ -n "$1" ]
> then
>  crawl_dir=$1
> else
>  echo "Usage: recrawl crawl_dir [depth] [adddays]"
>  exit 1
> fi
>
> if [ -n "$2" ]
> then
>  depth=$2
> else
>  depth=5
> fi
>
> if [ -n "$3" ]
> then
>  adddays=$3
> else
>  adddays=0
> fi
>
> webdb_dir=$crawl_dir/db
> segments_dir=$crawl_dir/segments
> index_dir=$crawl_dir/index
>
> #Para o serviço do TomCat
> #net stop "Apache Tomcat"
>
> # The generate/fetch/update cycle
> for ((i=1; i <= depth ; i++))
> do
>  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>  segment=`ls -d $segments_dir/* | tail -1`
>  bin/nutch fetch $segment
>  bin/nutch updatedb $webdb_dir $segment
>  echo
>  echo "Fim do ciclo $i."
>  echo
> done
>
> # Update segments
> echo
> echo "Atualizando os Segmentos..."
> echo
> mkdir tmp
> bin/nutch updatesegs $webdb_dir $segments_dir tmp
> rm -R tmp
>
> # Index segments
> echo "Indexando os segmentos..."
> echo
> for segment in `ls -d $segments_dir/* | tail -$depth`
> do
>  bin/nutch index $segment
> done
>
> # De-duplicate indexes
> # "bogus" argument is ignored but needed due to
> # a bug in the number of args expected
> bin/nutch dedup $segments_dir bogus
>
> # Merge indexes
> #echo "Unindo os segmentos..."
> #echo
> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>
> chmod 777 -R $index_dir
>
> #Inicia o serviço do TomCat
> #net start "Apache Tomcat"
>
> echo "Fim."
>
> #} > recrawl.log 2>&1
>
> How you suggested I used the touch command instead stops the tomcat.
> However
> I get that error posted in previous message. I'm running nutch in windows
> plataform with cygwin. I only get no errors when I stops the tomcat. I
> use
> this command to call the script:
>
> ./recrawl crawl-legislacao 1
>
> Could you give me more clarifications?
>
> Thanks a lot!
>
> On 7/21/06, Matthew Holt <[hidden email]> wrote:
>>
>> Lourival Júnior wrote:
>> > Hi Renaud!
>> >
>> > I'm newbie with shell scripts and I know stops tomcat service is
>> not the
>> > better way to do this. The problem is, when a run the re-crawl script
>> > with
>> > tomcat started I get this error:
>> >
>> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
>> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>> >        at
>> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>> >        at
>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>> >        at
>> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>> > :141)
>> >        at
>> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>> >        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>> :92)
>> >        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>> :160)
>> >
>> > So, I want another way to re-crawl my pages without this error and
>> > without
>> > restarting the tomcat. Could you suggest one?
>> >
>> > Thanks a lot!
>> >
>> >
>> Try this updated script and tell me what command exactly you run to call
>> the script. Let me know the error message then.
>>
>> Matt
>>
>>
>> #!/bin/bash
>>
>> # Nutch recrawl script.
>> # Based on 0.7.2 script at
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

>>
>> # Modified by Matthew Holt
>>
>> if [ -n "$1" ]
>> then
>>   nutch_dir=$1
>> else
>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>   echo "servlet_path - Path of the nutch servlet (i.e.
>> /usr/local/tomcat/webapps/ROOT)"
>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>   echo "[depth] - The link depth from the root page that should be
>> crawled."
>>   echo "[adddays] - Advance the clock # of days for fetchlist
>> generation."
>>   exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>>   crawl_dir=$2
>> else
>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>   echo "servlet_path - Path of the nutch servlet (i.e.
>> /usr/local/tomcat/webapps/ROOT)"
>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>   echo "[depth] - The link depth from the root page that should be
>> crawled."
>>   echo "[adddays] - Advance the clock # of days for fetchlist
>> generation."
>>   exit 1
>> fi
>>
>> if [ -n "$3" ]
>> then
>>   depth=$3
>> else
>>   depth=5
>> fi
>>
>> if [ -n "$4" ]
>> then
>>   adddays=$4
>> else
>>   adddays=0
>> fi
>>
>> # Only change if your crawl subdirectories are named something different
>> webdb_dir=$crawl_dir/crawldb
>> segments_dir=$crawl_dir/segments
>> linkdb_dir=$crawl_dir/linkdb
>> index_dir=$crawl_dir/index
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>   segment=`ls -d $segments_dir/* | tail -1`
>>   bin/nutch fetch $segment
>>   bin/nutch updatedb $webdb_dir $segment
>> done
>>
>> # Update segments
>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>
>> # Index segments
>> new_indexes=$crawl_dir/newindexes
>> #ls -d $segments_dir/* | tail -$depth | xargs
>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>
>> # De-duplicate indexes
>> bin/nutch dedup $new_indexes
>>
>> # Merge indexes
>> bin/nutch merge $index_dir $new_indexes
>>
>> # Tell Tomcat to reload index
>> touch $nutch_dir/WEB-INF/web.xml
>>
>> # Clean up
>> rm -rf $new_indexes
>>
>>
>
>
Oh yea, you're right the one i sent out was for 0.8.... you should just
be able to put this at the end of your script..

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

and fill in the appropriate path of course.
gluck
matt



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/394 - Release Date: 20/07/2006


Reply | Threaded
Open this post in threaded view
|

Re: Hadoop and Recrawl

Renaud Richardet-3
Hi Roberto,

Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to
Matthew Holt)

HTH,
Renaud


Info wrote:

> Hi List
> I try to use this script with hadoop but don't work.
> I try to change ls with bin/hadoop dfs -ls
> But the script don't work because is ls -d and don't ls only.
> Someone can help me
> Best Regards
> Roberto Navoni
>
> -----Messaggio originale-----
> Da: Matthew Holt [mailto:[hidden email]]
> Inviato: venerdì 21 luglio 2006 18.58
> A: [hidden email]
> Oggetto: Re: Recrawl script for 0.8.0 completed...
>
> Lourival Júnior wrote:
>  
>> I thing it wont work with me because i'm using the Nutch version 0.7.2.
>> Actually I use this script (some comments are in Portuguese):
>>
>> #!/bin/bash
>>
>> # A simple script to run a Nutch re-crawl
>> # Fonte do script:
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> #{
>>
>> if [ -n "$1" ]
>> then
>>  crawl_dir=$1
>> else
>>  echo "Usage: recrawl crawl_dir [depth] [adddays]"
>>  exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>>  depth=$2
>> else
>>  depth=5
>> fi
>>
>> if [ -n "$3" ]
>> then
>>  adddays=$3
>> else
>>  adddays=0
>> fi
>>
>> webdb_dir=$crawl_dir/db
>> segments_dir=$crawl_dir/segments
>> index_dir=$crawl_dir/index
>>
>> #Para o serviço do TomCat
>> #net stop "Apache Tomcat"
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>>  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>  segment=`ls -d $segments_dir/* | tail -1`
>>  bin/nutch fetch $segment
>>  bin/nutch updatedb $webdb_dir $segment
>>  echo
>>  echo "Fim do ciclo $i."
>>  echo
>> done
>>
>> # Update segments
>> echo
>> echo "Atualizando os Segmentos..."
>> echo
>> mkdir tmp
>> bin/nutch updatesegs $webdb_dir $segments_dir tmp
>> rm -R tmp
>>
>> # Index segments
>> echo "Indexando os segmentos..."
>> echo
>> for segment in `ls -d $segments_dir/* | tail -$depth`
>> do
>>  bin/nutch index $segment
>> done
>>
>> # De-duplicate indexes
>> # "bogus" argument is ignored but needed due to
>> # a bug in the number of args expected
>> bin/nutch dedup $segments_dir bogus
>>
>> # Merge indexes
>> #echo "Unindo os segmentos..."
>> #echo
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> chmod 777 -R $index_dir
>>
>> #Inicia o serviço do TomCat
>> #net start "Apache Tomcat"
>>
>> echo "Fim."
>>
>> #} > recrawl.log 2>&1
>>
>> How you suggested I used the touch command instead stops the tomcat.
>> However
>> I get that error posted in previous message. I'm running nutch in windows
>> plataform with cygwin. I only get no errors when I stops the tomcat. I
>> use
>> this command to call the script:
>>
>> ./recrawl crawl-legislacao 1
>>
>> Could you give me more clarifications?
>>
>> Thanks a lot!
>>
>> On 7/21/06, Matthew Holt <[hidden email]> wrote:
>>    
>>> Lourival Júnior wrote:
>>>      
>>>> Hi Renaud!
>>>>
>>>> I'm newbie with shell scripts and I know stops tomcat service is
>>>>        
>>> not the
>>>      
>>>> better way to do this. The problem is, when a run the re-crawl script
>>>> with
>>>> tomcat started I get this error:
>>>>
>>>> 060721 132224 merging segment indexes to: crawl-legislacao2\index
>>>> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>>>>        at
>>>> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>>>>        at
>>>>        
>>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>>>      
>>>>        at
>>>> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>>>> :141)
>>>>        at
>>>> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>>>>        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>>>>        
>>> :92)
>>>      
>>>>        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>>>>        
>>> :160)
>>>      
>>>> So, I want another way to re-crawl my pages without this error and
>>>> without
>>>> restarting the tomcat. Could you suggest one?
>>>>
>>>> Thanks a lot!
>>>>
>>>>
>>>>        
>>> Try this updated script and tell me what command exactly you run to call
>>> the script. Let me know the error message then.
>>>
>>> Matt
>>>
>>>
>>> #!/bin/bash
>>>
>>> # Nutch recrawl script.
>>> # Based on 0.7.2 script at
>>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>>      
>
>  
>>> # Modified by Matthew Holt
>>>
>>> if [ -n "$1" ]
>>> then
>>>   nutch_dir=$1
>>> else
>>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>>   echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>>   echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>>   echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>>   exit 1
>>> fi
>>>
>>> if [ -n "$2" ]
>>> then
>>>   crawl_dir=$2
>>> else
>>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>>   echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>>   echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>>   echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>>   exit 1
>>> fi
>>>
>>> if [ -n "$3" ]
>>> then
>>>   depth=$3
>>> else
>>>   depth=5
>>> fi
>>>
>>> if [ -n "$4" ]
>>> then
>>>   adddays=$4
>>> else
>>>   adddays=0
>>> fi
>>>
>>> # Only change if your crawl subdirectories are named something different
>>> webdb_dir=$crawl_dir/crawldb
>>> segments_dir=$crawl_dir/segments
>>> linkdb_dir=$crawl_dir/linkdb
>>> index_dir=$crawl_dir/index
>>>
>>> # The generate/fetch/update cycle
>>> for ((i=1; i <= depth ; i++))
>>> do
>>>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>>   segment=`ls -d $segments_dir/* | tail -1`
>>>   bin/nutch fetch $segment
>>>   bin/nutch updatedb $webdb_dir $segment
>>> done
>>>
>>> # Update segments
>>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>>
>>> # Index segments
>>> new_indexes=$crawl_dir/newindexes
>>> #ls -d $segments_dir/* | tail -$depth | xargs
>>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>>
>>> # De-duplicate indexes
>>> bin/nutch dedup $new_indexes
>>>
>>> # Merge indexes
>>> bin/nutch merge $index_dir $new_indexes
>>>
>>> # Tell Tomcat to reload index
>>> touch $nutch_dir/WEB-INF/web.xml
>>>
>>> # Clean up
>>> rm -rf $new_indexes
>>>
>>>
>>>      
>>    
> Oh yea, you're right the one i sent out was for 0.8.... you should just
> be able to put this at the end of your script..
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> and fill in the appropriate path of course.
> gluck
> matt
>
>
>
>  

--
Renaud Richardet
COO America
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                     mobile +1 617 230 9112
renaud.richardet <at> wyona.com              http://www.wyona.com

Reply | Threaded
Open this post in threaded view
|

R: Hadoop and Recrawl

info-1247


-----Messaggio originale-----
Da: Renaud Richardet [mailto:[hidden email]]
Inviato: venerdì 21 luglio 2006 22.24
A: [hidden email]
Oggetto: Re: Hadoop and Recrawl

Hi Roberto,

Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to
Matthew Holt)

HTH,
Renaud


Info wrote:

> Hi List
> I try to use this script with hadoop but don't work.
> I try to change ls with bin/hadoop dfs -ls
> But the script don't work because is ls -d and don't ls only.
> Someone can help me
> Best Regards
> Roberto Navoni
>
> -----Messaggio originale-----
> Da: Matthew Holt [mailto:[hidden email]]
> Inviato: venerdì 21 luglio 2006 18.58
> A: [hidden email]
> Oggetto: Re: Recrawl script for 0.8.0 completed...
>
> Lourival Júnior wrote:
>  
>> I thing it wont work with me because i'm using the Nutch version 0.7.2.
>> Actually I use this script (some comments are in Portuguese):
>>
>> #!/bin/bash
>>
>> # A simple script to run a Nutch re-crawl
>> # Fonte do script:
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> #{
>>
>> if [ -n "$1" ]
>> then
>>  crawl_dir=$1
>> else
>>  echo "Usage: recrawl crawl_dir [depth] [adddays]"
>>  exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>>  depth=$2
>> else
>>  depth=5
>> fi
>>
>> if [ -n "$3" ]
>> then
>>  adddays=$3
>> else
>>  adddays=0
>> fi
>>
>> webdb_dir=$crawl_dir/db
>> segments_dir=$crawl_dir/segments
>> index_dir=$crawl_dir/index
>>
>> #Para o serviço do TomCat
>> #net stop "Apache Tomcat"
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>>  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>  segment=`ls -d $segments_dir/* | tail -1`
>>  bin/nutch fetch $segment
>>  bin/nutch updatedb $webdb_dir $segment
>>  echo
>>  echo "Fim do ciclo $i."
>>  echo
>> done
>>
>> # Update segments
>> echo
>> echo "Atualizando os Segmentos..."
>> echo
>> mkdir tmp
>> bin/nutch updatesegs $webdb_dir $segments_dir tmp
>> rm -R tmp
>>
>> # Index segments
>> echo "Indexando os segmentos..."
>> echo
>> for segment in `ls -d $segments_dir/* | tail -$depth`
>> do
>>  bin/nutch index $segment
>> done
>>
>> # De-duplicate indexes
>> # "bogus" argument is ignored but needed due to
>> # a bug in the number of args expected
>> bin/nutch dedup $segments_dir bogus
>>
>> # Merge indexes
>> #echo "Unindo os segmentos..."
>> #echo
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> chmod 777 -R $index_dir
>>
>> #Inicia o serviço do TomCat
>> #net start "Apache Tomcat"
>>
>> echo "Fim."
>>
>> #} > recrawl.log 2>&1
>>
>> How you suggested I used the touch command instead stops the tomcat.
>> However
>> I get that error posted in previous message. I'm running nutch in windows
>> plataform with cygwin. I only get no errors when I stops the tomcat. I
>> use
>> this command to call the script:
>>
>> ./recrawl crawl-legislacao 1
>>
>> Could you give me more clarifications?
>>
>> Thanks a lot!
>>
>> On 7/21/06, Matthew Holt <[hidden email]> wrote:
>>    
>>> Lourival Júnior wrote:
>>>      
>>>> Hi Renaud!
>>>>
>>>> I'm newbie with shell scripts and I know stops tomcat service is
>>>>        
>>> not the
>>>      
>>>> better way to do this. The problem is, when a run the re-crawl script
>>>> with
>>>> tomcat started I get this error:
>>>>
>>>> 060721 132224 merging segment indexes to: crawl-legislacao2\index
>>>> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>>>>        at
>>>> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>>>>        at
>>>>        
>>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>>>      
>>>>        at
>>>> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>>>> :141)
>>>>        at
>>>> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>>>>        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>>>>        
>>> :92)
>>>      
>>>>        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>>>>        
>>> :160)
>>>      
>>>> So, I want another way to re-crawl my pages without this error and
>>>> without
>>>> restarting the tomcat. Could you suggest one?
>>>>
>>>> Thanks a lot!
>>>>
>>>>
>>>>        
>>> Try this updated script and tell me what command exactly you run to call
>>> the script. Let me know the error message then.
>>>
>>> Matt
>>>
>>>
>>> #!/bin/bash
>>>
>>> # Nutch recrawl script.
>>> # Based on 0.7.2 script at
>>>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

>>>      
>
>  
>>> # Modified by Matthew Holt
>>>
>>> if [ -n "$1" ]
>>> then
>>>   nutch_dir=$1
>>> else
>>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>>   echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>>   echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>>   echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>>   exit 1
>>> fi
>>>
>>> if [ -n "$2" ]
>>> then
>>>   crawl_dir=$2
>>> else
>>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>>   echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>>   echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>>   echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>>   exit 1
>>> fi
>>>
>>> if [ -n "$3" ]
>>> then
>>>   depth=$3
>>> else
>>>   depth=5
>>> fi
>>>
>>> if [ -n "$4" ]
>>> then
>>>   adddays=$4
>>> else
>>>   adddays=0
>>> fi
>>>
>>> # Only change if your crawl subdirectories are named something different
>>> webdb_dir=$crawl_dir/crawldb
>>> segments_dir=$crawl_dir/segments
>>> linkdb_dir=$crawl_dir/linkdb
>>> index_dir=$crawl_dir/index
>>>
>>> # The generate/fetch/update cycle
>>> for ((i=1; i <= depth ; i++))
>>> do
>>>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>>   segment=`ls -d $segments_dir/* | tail -1`
>>>   bin/nutch fetch $segment
>>>   bin/nutch updatedb $webdb_dir $segment
>>> done
>>>
>>> # Update segments
>>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>>
>>> # Index segments
>>> new_indexes=$crawl_dir/newindexes
>>> #ls -d $segments_dir/* | tail -$depth | xargs
>>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>>
>>> # De-duplicate indexes
>>> bin/nutch dedup $new_indexes
>>>
>>> # Merge indexes
>>> bin/nutch merge $index_dir $new_indexes
>>>
>>> # Tell Tomcat to reload index
>>> touch $nutch_dir/WEB-INF/web.xml
>>>
>>> # Clean up
>>> rm -rf $new_indexes
>>>
>>>
>>>      
>>    
> Oh yea, you're right the one i sent out was for 0.8.... you should just
> be able to put this at the end of your script..
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> and fill in the appropriate path of course.
> gluck
> matt
>
>
>
>  

--
Renaud Richardet
COO America
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                     mobile +1 617 230 9112
renaud.richardet <at> wyona.com              http://www.wyona.com




--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/394 - Release Date: 20/07/2006

Reply | Threaded
Open this post in threaded view
|

HELP ME PLEASE R: Hadoop and Nutch 0.8

info-1247
Hi Renaud,
I try that link , but I don't solve my problem.
The problem is that If I use nutch-0.8 night build that version of nutch
don't use linux file system but Hadoop file system .
So if I try to see the name of segment , the script use ls -d on local linux
file system.
Instead I need an a script that use hadoop dfs file system because I need
for my experimental project to use it.
I have 4 linux server
Where I install nutch . I use hadoop to have an distribuited file system.

My first problem is to merge the index
My second problem is that if I try to connect by ssh the slave server they
ask me the password. I  see the online tutorial Nutch + hadoop ...

There's  some people that can help me.
Best Regard
Roberto Navoni

-----Messaggio originale-----
Da: Info [mailto:[hidden email]]
Inviato: sabato 22 luglio 2006 10.09
A: [hidden email]
Oggetto: R: Hadoop and Recrawl



-----Messaggio originale-----
Da: Renaud Richardet [mailto:[hidden email]]
Inviato: venerdì 21 luglio 2006 22.24
A: [hidden email]
Oggetto: Re: Hadoop and Recrawl

Hi Roberto,

Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to
Matthew Holt)

HTH,
Renaud


Info wrote:

> Hi List
> I try to use this script with hadoop but don't work.
> I try to change ls with bin/hadoop dfs -ls
> But the script don't work because is ls -d and don't ls only.
> Someone can help me
> Best Regards
> Roberto Navoni
>
> -----Messaggio originale-----
> Da: Matthew Holt [mailto:[hidden email]]
> Inviato: venerdì 21 luglio 2006 18.58
> A: [hidden email]
> Oggetto: Re: Recrawl script for 0.8.0 completed...
>
> Lourival Júnior wrote:
>  
>> I thing it wont work with me because i'm using the Nutch version 0.7.2.
>> Actually I use this script (some comments are in Portuguese):
>>
>> #!/bin/bash
>>
>> # A simple script to run a Nutch re-crawl
>> # Fonte do script:
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> #{
>>
>> if [ -n "$1" ]
>> then
>>  crawl_dir=$1
>> else
>>  echo "Usage: recrawl crawl_dir [depth] [adddays]"
>>  exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>>  depth=$2
>> else
>>  depth=5
>> fi
>>
>> if [ -n "$3" ]
>> then
>>  adddays=$3
>> else
>>  adddays=0
>> fi
>>
>> webdb_dir=$crawl_dir/db
>> segments_dir=$crawl_dir/segments
>> index_dir=$crawl_dir/index
>>
>> #Para o serviço do TomCat
>> #net stop "Apache Tomcat"
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>>  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>  segment=`ls -d $segments_dir/* | tail -1`
>>  bin/nutch fetch $segment
>>  bin/nutch updatedb $webdb_dir $segment
>>  echo
>>  echo "Fim do ciclo $i."
>>  echo
>> done
>>
>> # Update segments
>> echo
>> echo "Atualizando os Segmentos..."
>> echo
>> mkdir tmp
>> bin/nutch updatesegs $webdb_dir $segments_dir tmp
>> rm -R tmp
>>
>> # Index segments
>> echo "Indexando os segmentos..."
>> echo
>> for segment in `ls -d $segments_dir/* | tail -$depth`
>> do
>>  bin/nutch index $segment
>> done
>>
>> # De-duplicate indexes
>> # "bogus" argument is ignored but needed due to
>> # a bug in the number of args expected
>> bin/nutch dedup $segments_dir bogus
>>
>> # Merge indexes
>> #echo "Unindo os segmentos..."
>> #echo
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> chmod 777 -R $index_dir
>>
>> #Inicia o serviço do TomCat
>> #net start "Apache Tomcat"
>>
>> echo "Fim."
>>
>> #} > recrawl.log 2>&1
>>
>> How you suggested I used the touch command instead stops the tomcat.
>> However
>> I get that error posted in previous message. I'm running nutch in windows
>> plataform with cygwin. I only get no errors when I stops the tomcat. I
>> use
>> this command to call the script:
>>
>> ./recrawl crawl-legislacao 1
>>
>> Could you give me more clarifications?
>>
>> Thanks a lot!
>>
>> On 7/21/06, Matthew Holt <[hidden email]> wrote:
>>    
>>> Lourival Júnior wrote:
>>>      
>>>> Hi Renaud!
>>>>
>>>> I'm newbie with shell scripts and I know stops tomcat service is
>>>>        
>>> not the
>>>      
>>>> better way to do this. The problem is, when a run the re-crawl script
>>>> with
>>>> tomcat started I get this error:
>>>>
>>>> 060721 132224 merging segment indexes to: crawl-legislacao2\index
>>>> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>>>>        at
>>>> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>>>>        at
>>>>        
>>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>>>      
>>>>        at
>>>> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>>>> :141)
>>>>        at
>>>> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>>>>        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>>>>        
>>> :92)
>>>      
>>>>        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>>>>        
>>> :160)
>>>      
>>>> So, I want another way to re-crawl my pages without this error and
>>>> without
>>>> restarting the tomcat. Could you suggest one?
>>>>
>>>> Thanks a lot!
>>>>
>>>>
>>>>        
>>> Try this updated script and tell me what command exactly you run to call
>>> the script. Let me know the error message then.
>>>
>>> Matt
>>>
>>>
>>> #!/bin/bash
>>>
>>> # Nutch recrawl script.
>>> # Based on 0.7.2 script at
>>>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

>>>      
>
>  
>>> # Modified by Matthew Holt
>>>
>>> if [ -n "$1" ]
>>> then
>>>   nutch_dir=$1
>>> else
>>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>>   echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>>   echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>>   echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>>   exit 1
>>> fi
>>>
>>> if [ -n "$2" ]
>>> then
>>>   crawl_dir=$2
>>> else
>>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>>   echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>>   echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>>   echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>>   exit 1
>>> fi
>>>
>>> if [ -n "$3" ]
>>> then
>>>   depth=$3
>>> else
>>>   depth=5
>>> fi
>>>
>>> if [ -n "$4" ]
>>> then
>>>   adddays=$4
>>> else
>>>   adddays=0
>>> fi
>>>
>>> # Only change if your crawl subdirectories are named something different
>>> webdb_dir=$crawl_dir/crawldb
>>> segments_dir=$crawl_dir/segments
>>> linkdb_dir=$crawl_dir/linkdb
>>> index_dir=$crawl_dir/index
>>>
>>> # The generate/fetch/update cycle
>>> for ((i=1; i <= depth ; i++))
>>> do
>>>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>>   segment=`ls -d $segments_dir/* | tail -1`
>>>   bin/nutch fetch $segment
>>>   bin/nutch updatedb $webdb_dir $segment
>>> done
>>>
>>> # Update segments
>>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>>
>>> # Index segments
>>> new_indexes=$crawl_dir/newindexes
>>> #ls -d $segments_dir/* | tail -$depth | xargs
>>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>>
>>> # De-duplicate indexes
>>> bin/nutch dedup $new_indexes
>>>
>>> # Merge indexes
>>> bin/nutch merge $index_dir $new_indexes
>>>
>>> # Tell Tomcat to reload index
>>> touch $nutch_dir/WEB-INF/web.xml
>>>
>>> # Clean up
>>> rm -rf $new_indexes
>>>
>>>
>>>      
>>    
> Oh yea, you're right the one i sent out was for 0.8.... you should just
> be able to put this at the end of your script..
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> and fill in the appropriate path of course.
> gluck
> matt
>
>
>
>  

--
Renaud Richardet
COO America
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                     mobile +1 617 230 9112
renaud.richardet <at> wyona.com              http://www.wyona.com




--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/394 - Release Date: 20/07/2006




--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/395 - Release Date: 21/07/2006
Reply | Threaded
Open this post in threaded view
|

This is my tutorial for hadoop + nutch 0.8 I'm searching a tutorial for recrawl script for nutch+hadoop

info-1247
In reply to this post by info-1247
Tutorial Nutch 0.8 and Hadoop

This tutorial derived by hadoop + nutch tutorial and other 0.8 tutorial
foun on wiky site and on google and "work fine!!!"
Now I working around a recrawl tutorial


#Format the hadoop namenode


root@LSearchDev01:/nutch/search# bin/hadoop namenode -format
Re-format filesystem in /nutch/filesystem/name ? (Y or N) Y
Formatted /nutch/filesystem/name


#Start Hadoop

root@LSearchDev01:/nutch/search# bin/start-all.sh
namenode running as process 16789.
root@lsearchdev01's password:
jobtracker running as process 16866.
root@lsearchdev01's password:
LSearchDev01: starting tasktracker, logging
to /nutch/search/logs/hadoop-root-tasktracker-LSearchDev01.out

#ls on hadoop file systems

root@LSearchDev01:/nutch/search#
root@LSearchDev01:/nutch/search# bin/hadoop dfs -ls
Found 0 items

#Hadoop work fine


# use vi to add your site in  http://www.yoursite.com format

root@LSearchDev01:/nutch/search# vi urls.txt


# Make urls directory on hadoop file system

root@LSearchDev01:/nutch/search# bin/hadoop dfs -mkdir urls

# Copy urls.txt file from linux file system to hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt

# List the file on hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr /user/root/urls
<dir>

/user/root/urls/urls.txt        <r 2>   41


#If you want to delete the old urls file on hadoop and put a new one
file system use the follow command

root@LSearchDev01:/nutch/search# bin/hadoop dfs
-rm /user/root/urls/urls.txt
Deleted /user/root/urls/urls.txt
root@LSearchDev01:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt

#Start to inject the urls in the urls.txt to <crawld> dbase

root@LSearchDev01:/nutch/search# bin/nutch inject crawld urls

# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030


# This is the new situation of your hadoop file system now
 
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr
/user/root/crawld       <dir>
/user/root/crawld/current       <dir>
/user/root/crawld/current/part-00000    <dir>
/user/root/crawld/current/part-00000/data       <r 2>   62
/user/root/crawld/current/part-00000/index      <r 2>   33
/user/root/crawld/current/part-00001    <dir>
/user/root/crawld/current/part-00001/data       <r 2>   62
/user/root/crawld/current/part-00001/index      <r 2>   33
/user/root/crawld/current/part-00002    <dir>
/user/root/crawld/current/part-00002/data       <r 2>   124
/user/root/crawld/current/part-00002/index      <r 2>   74
/user/root/crawld/current/part-00003    <dir>
/user/root/crawld/current/part-00003/data       <r 2>   181
/user/root/crawld/current/part-00003/index      <r 2>   74
/user/root/urls <dir>
/user/root/urls/urls.txt        <r 2>   64

# Now you can generate the file for fetch job
root@LSearchDev01:/nutch/search# bin/nutch
generate /user/root/crawld /user/root/crawld/segments

# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030

# This /user/root/crawld/segments/20060722130642 is the name of the
segment that you want to fetch

root@LSearchDev01:/nutch/search# bin/hadoop dfs
-ls /user/root/crawld/segments
Found 1 items
/user/root/crawld/segments/20060722130642       <dir>
root@LSearchDev01:/nutch/search#

#Fetch the site list in urls.txt

root@LSearchDev01:/nutch/search# bin/nutch
fetch /user/root/crawld/segments/20060722130642


# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030


#This is what there are on your hadoop file systems now

root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr /user/root/crawld
<dir>
/user/root/crawld/current       <dir>
/user/root/crawld/current/part-00000    <dir>
/user/root/crawld/current/part-00000/data       <r 2>   62
/user/root/crawld/current/part-00000/index      <r 2>   33
/user/root/crawld/current/part-00001    <dir>
/user/root/crawld/current/part-00001/data       <r 2>   62
/user/root/crawld/current/part-00001/index      <r 2>   33
/user/root/crawld/current/part-00002    <dir>
/user/root/crawld/current/part-00002/data       <r 2>   124
/user/root/crawld/current/part-00002/index      <r 2>   74
/user/root/crawld/current/part-00003    <dir>
/user/root/crawld/current/part-00003/data       <r 2>   181
/user/root/crawld/current/part-00003/index      <r 2>   74
/user/root/crawld/segments      <dir>
/user/root/crawld/segments/20060722130642       <dir>
/user/root/crawld/segments/20060722130642/content       <dir>
/user/root/crawld/segments/20060722130642/content/part-00000    <dir>
/user/root/crawld/segments/20060722130642/content/part-00000/data
<r 2>  62
/user/root/crawld/segments/20060722130642/content/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/content/part-00001    <dir>
/user/root/crawld/segments/20060722130642/content/part-00001/data
<r 2>  62
/user/root/crawld/segments/20060722130642/content/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/content/part-00002    <dir>
/user/root/crawld/segments/20060722130642/content/part-00002/data
<r 2>  2559
/user/root/crawld/segments/20060722130642/content/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/content/part-00003    <dir>
/user/root/crawld/segments/20060722130642/content/part-00003/data
<r 2>  6028
/user/root/crawld/segments/20060722130642/content/part-00003/index
<r 2>  74
/user/root/crawld/segments/20060722130642/crawl_fetch   <dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/data
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/data
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/data
<r 2>  140
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/data
<r 2>  213
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/index
<r 2>  74
/user/root/crawld/segments/20060722130642/crawl_generate        <dir>
/user/root/crawld/segments/20060722130642/crawl_generate/part-00000
<r 2>  119
/user/root/crawld/segments/20060722130642/crawl_generate/part-00001
<r 2>  124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00002
<r 2>  124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00003
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_parse   <dir>
/user/root/crawld/segments/20060722130642/crawl_parse/part-00000
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00001
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00002
<r 2>  784
/user/root/crawld/segments/20060722130642/crawl_parse/part-00003
<r 2>  1698
/user/root/crawld/segments/20060722130642/parse_data    <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_data/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_data/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00001/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_data/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_data/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00002/data
<r 2>  839
/user/root/crawld/segments/20060722130642/parse_data/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/parse_data/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00003/data
<r 2>  1798
/user/root/crawld/segments/20060722130642/parse_data/part-00003/index
<r 2>  74
/user/root/crawld/segments/20060722130642/parse_text    <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_text/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_text/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00001/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_text/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_text/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00002/data
<r 2>  377
/user/root/crawld/segments/20060722130642/parse_text/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/parse_text/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00003/data
<r 2>  811
/user/root/crawld/segments/20060722130642/parse_text/part-00003/index
<r 2>  74
/user/root/urls <dir>
/user/root/urls/urls.txt        <r 2>   64

#Now you need to do the invertlinks JOB

root@LSearchDev01:/nutch/search# bin/nutch
invertlinks /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642

#And at the end you need to build your index

root@LSearchDev01:/nutch/search# bin/nutch
index /user/root/crawld/indexes /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642

root@LSearchDev01:/nutch/search# bin/hadoop dfs -ls /user/root/crawld
Found 4 items
/user/root/crawld/current       <dir>
/user/root/crawld/indexes       <dir>
/user/root/crawld/linkdb        <dir>
/user/root/crawld/segments      <dir>
root@LSearchDev01:/nutch/search#

At the  end of your hard job you have on your hadoop file system this
directory

So you are ready to start tomcat .
Before you start tomcat remeber to change the path of your search
directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
directory

#This is an example of my configuration

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>LSearchDev01:9000</value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/user/root/crawld</value>
  </property>

</configuration>
~
~

I hope that i Help someone to do they first search engine on nutch 0.8 +
hadoop :)

Best crawling
Roberto Navoni
 
Reply | Threaded
Open this post in threaded view
|

Nutch to...Frutch

Hans Vallden
Greetings All!

I recently became interested in search technology and more
specifically Nutch. So, I'm a newbie by all standards. Don't hesitate
to treat me as one. :)

My vision would be to build a Froogle-like ecommerce search engine,
possibly using Nutch. I am wondering if anyone on this list has ever
pondered the same idea? I would be very interested in hearing
thoughts and experiences. Don't hesitate to contact me off the list,
if you feel it to be more appropriate.


--

--
Hans Vallden
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Hadoop and Inject and Recrawl hadoop and nutch v0.8 WORK FINE!!!!

Roberto Navoni
In reply to this post by info-1247
Tutorial Nutch 0.8 and Hadoop

This tutorial derived by hadoop + nutch tutorial and other 0.8 tutorial
found on wiky site and on google and "work fine!!!"
At the end of tutorial you can found also  a recrawl tutorial and
rebuild index


#Format the hadoop namenode


root@LSearchDev01:/nutch/search#bin/hadoop namenode -format
Re-format filesystem in /nutch/filesystem/name ? (Y or N) Y
Formatted /nutch/filesystem/name


#Start Hadoop

root@LSearchDev01:/nutch/search# bin/start-all.sh
namenode running as process 16789.
root@lsearchdev01's password:
jobtracker running as process 16866.
root@lsearchdev01's password:
LSearchDev01: starting tasktracker, logging
to /nutch/search/logs/hadoop-root-tasktracker-LSearchDev01.out

#ls on hadoop file systems

root@LSearchDev01:/nutch/search#
root@LSearchDev01:/nutch/search# bin/hadoop dfs -ls
Found 0 items

#Hadoop work fine


# use vi to add your site in  http://www.yoursite.com format

root@LSearchDev01:/nutch/search# vi urls.txt


# Make urls directory on hadoop file system

root@LSearchDev01:/nutch/search# bin/hadoop dfs -mkdir urls

# Copy urls.txt file from linux file system to hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt

# List the file on hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr /user/root/urls
<dir>

/user/root/urls/urls.txt        <r 2>   41


#If you want to delete the old urls file on hadoop and put a new one
file system use the follow command

root@LSearchDev01:/nutch/search# bin/hadoop dfs
-rm /user/root/urls/urls.txt
Deleted /user/root/urls/urls.txt
root@LSearchDev01:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt

#Start to inject the urls in the urls.txt to <crawld> dbase

root@LSearchDev01:/nutch/search# bin/nutch inject crawld urls

# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030


# This is the new situation of your hadoop file system now
 
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr
/user/root/crawld       <dir>
/user/root/crawld/current       <dir>
/user/root/crawld/current/part-00000    <dir>
/user/root/crawld/current/part-00000/data       <r 2>   62
/user/root/crawld/current/part-00000/index      <r 2>   33
/user/root/crawld/current/part-00001    <dir>
/user/root/crawld/current/part-00001/data       <r 2>   62
/user/root/crawld/current/part-00001/index      <r 2>   33
/user/root/crawld/current/part-00002    <dir>
/user/root/crawld/current/part-00002/data       <r 2>   124
/user/root/crawld/current/part-00002/index      <r 2>   74
/user/root/crawld/current/part-00003    <dir>
/user/root/crawld/current/part-00003/data       <r 2>   181
/user/root/crawld/current/part-00003/index      <r 2>   74
/user/root/urls <dir>
/user/root/urls/urls.txt        <r 2>   64

# Now you can generate the file for fetch job
root@LSearchDev01:/nutch/search# bin/nutch
generate /user/root/crawld /user/root/crawld/segments

# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030

# This /user/root/crawld/segments/20060722130642 is the name of the
segment that you want to fetch

root@LSearchDev01:/nutch/search# bin/hadoop dfs
-ls /user/root/crawld/segments
Found 1 items
/user/root/crawld/segments/20060722130642       <dir>
root@LSearchDev01:/nutch/search#

#Fetch the site list in urls.txt

root@LSearchDev01:/nutch/search# bin/nutch
fetch /user/root/crawld/segments/20060722130642


# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030


#This is what there are on your hadoop file systems now

root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr /user/root/crawld
<dir>
/user/root/crawld/current       <dir>
/user/root/crawld/current/part-00000    <dir>
/user/root/crawld/current/part-00000/data       <r 2>   62
/user/root/crawld/current/part-00000/index      <r 2>   33
/user/root/crawld/current/part-00001    <dir>
/user/root/crawld/current/part-00001/data       <r 2>   62
/user/root/crawld/current/part-00001/index      <r 2>   33
/user/root/crawld/current/part-00002    <dir>
/user/root/crawld/current/part-00002/data       <r 2>   124
/user/root/crawld/current/part-00002/index      <r 2>   74
/user/root/crawld/current/part-00003    <dir>
/user/root/crawld/current/part-00003/data       <r 2>   181
/user/root/crawld/current/part-00003/index      <r 2>   74
/user/root/crawld/segments      <dir>
/user/root/crawld/segments/20060722130642       <dir>
/user/root/crawld/segments/20060722130642/content       <dir>
/user/root/crawld/segments/20060722130642/content/part-00000    <dir>
/user/root/crawld/segments/20060722130642/content/part-00000/data
<r 2>  62
/user/root/crawld/segments/20060722130642/content/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/content/part-00001    <dir>
/user/root/crawld/segments/20060722130642/content/part-00001/data
<r 2>  62
/user/root/crawld/segments/20060722130642/content/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/content/part-00002    <dir>
/user/root/crawld/segments/20060722130642/content/part-00002/data
<r 2>  2559
/user/root/crawld/segments/20060722130642/content/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/content/part-00003    <dir>
/user/root/crawld/segments/20060722130642/content/part-00003/data
<r 2>  6028
/user/root/crawld/segments/20060722130642/content/part-00003/index
<r 2>  74
/user/root/crawld/segments/20060722130642/crawl_fetch   <dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/data
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/data
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/data
<r 2>  140
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/data
<r 2>  213
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/index
<r 2>  74
/user/root/crawld/segments/20060722130642/crawl_generate        <dir>
/user/root/crawld/segments/20060722130642/crawl_generate/part-00000
<r 2>  119
/user/root/crawld/segments/20060722130642/crawl_generate/part-00001
<r 2>  124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00002
<r 2>  124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00003
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_parse   <dir>
/user/root/crawld/segments/20060722130642/crawl_parse/part-00000
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00001
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00002
<r 2>  784
/user/root/crawld/segments/20060722130642/crawl_parse/part-00003
<r 2>  1698
/user/root/crawld/segments/20060722130642/parse_data    <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_data/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_data/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00001/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_data/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_data/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00002/data
<r 2>  839
/user/root/crawld/segments/20060722130642/parse_data/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/parse_data/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00003/data
<r 2>  1798
/user/root/crawld/segments/20060722130642/parse_data/part-00003/index
<r 2>  74
/user/root/crawld/segments/20060722130642/parse_text    <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_text/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_text/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00001/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_text/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_text/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00002/data
<r 2>  377
/user/root/crawld/segments/20060722130642/parse_text/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/parse_text/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00003/data
<r 2>  811
/user/root/crawld/segments/20060722130642/parse_text/part-00003/index
<r 2>  74
/user/root/urls <dir>
/user/root/urls/urls.txt        <r 2>   64

#Now you need to do the invertlinks JOB

root@LSearchDev01:/nutch/search# bin/nutch
invertlinks /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642

#And at the end you need to build your index

root@LSearchDev01:/nutch/search# bin/nutch
index /user/root/crawld/indexes /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642

root@LSearchDev01:/nutch/search# bin/hadoop dfs -ls /user/root/crawld
Found 4 items
/user/root/crawld/current       <dir>
/user/root/crawld/indexes       <dir>
/user/root/crawld/linkdb        <dir>
/user/root/crawld/segments      <dir>
root@LSearchDev01:/nutch/search#

At the  end of your hard job you have on your hadoop file system this
directory

So you are ready to start tomcat .
Before you start tomcat remeber to change the path of your search
directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
directory

#This is an example of my configuration

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>LSearchDev01:9000</value>
  </property>

  <property>bin
    <name>searcher.dir</name>
    <value>/user/root/crawld</value>
  </property>

</configuration>
~
~


#RECRAWL AND NEW INJECT

# Create a new indexe0
bin/nutch
index /user/root/crawld/indexe0 /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722153133
 
# Create a new index1
bin/nutch
index /user/root/crawld/indexe1 /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722182213

#Dedup the new indexe0
bin/nutch dedup /user/root/crawld/indexe0

#Dedup the new index1
bin/nutch dedup /user/root/crawld/indexe1

#Delete the old index


#Merge the new index merge directory

bin/nutch
merge /user/root/crawld/index /user/root/crawld/indexe0 /user/root/crawld/indexe1 ... #(and the other index create for the fetch segments)

#index is the stardard directory in the crawld (DB) where there is a
merge master index



I hope that i Help someone to do they first search engine on nutch 0.8 +
hadoop :)

Best crawling
Roberto Navoni
Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

Thomas Delnoij-3
In reply to this post by Lourival Júnior
Lourival.

I have typically seen the same issues on a cygwin/windows setup. The
only thing that worked for me was shutting down and restarting tomcat,
instead of just reloading the context. On linux now I don't have these
issues anymore.

Rgrds, Thomas

On 7/21/06, Lourival Júnior <[hidden email]> wrote:

> Ok. However a few minutes ago I ran the script exactly you said and I still
> get this error:
>
> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>         at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>         at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>         at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
> :141)
>         at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>         at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
>         at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)
>
> I dont know but I thing it occurs because nutch tries to delete some file
> that tomcat loads to the memory, giving permission access error. Any idea?
>
> On 7/21/06, Matthew Holt <[hidden email]> wrote:
> >
> > Lourival Júnior wrote:
> > > I thing it wont work with me because i'm using the Nutch version 0.7.2.
> > > Actually I use this script (some comments are in Portuguese):
> > >
> > > #!/bin/bash
> > >
> > > # A simple script to run a Nutch re-crawl
> > > # Fonte do script:
> > >
> > http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> > >
> > > #{
> > >
> > > if [ -n "$1" ]
> > > then
> > >  crawl_dir=$1
> > > else
> > >  echo "Usage: recrawl crawl_dir [depth] [adddays]"
> > >  exit 1
> > > fi
> > >
> > > if [ -n "$2" ]
> > > then
> > >  depth=$2
> > > else
> > >  depth=5
> > > fi
> > >
> > > if [ -n "$3" ]
> > > then
> > >  adddays=$3
> > > else
> > >  adddays=0
> > > fi
> > >
> > > webdb_dir=$crawl_dir/db
> > > segments_dir=$crawl_dir/segments
> > > index_dir=$crawl_dir/index
> > >
> > > #Para o serviço do TomCat
> > > #net stop "Apache Tomcat"
> > >
> > > # The generate/fetch/update cycle
> > > for ((i=1; i <= depth ; i++))
> > > do
> > >  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> > >  segment=`ls -d $segments_dir/* | tail -1`
> > >  bin/nutch fetch $segment
> > >  bin/nutch updatedb $webdb_dir $segment
> > >  echo
> > >  echo "Fim do ciclo $i."
> > >  echo
> > > done
> > >
> > > # Update segments
> > > echo
> > > echo "Atualizando os Segmentos..."
> > > echo
> > > mkdir tmp
> > > bin/nutch updatesegs $webdb_dir $segments_dir tmp
> > > rm -R tmp
> > >
> > > # Index segments
> > > echo "Indexando os segmentos..."
> > > echo
> > > for segment in `ls -d $segments_dir/* | tail -$depth`
> > > do
> > >  bin/nutch index $segment
> > > done
> > >
> > > # De-duplicate indexes
> > > # "bogus" argument is ignored but needed due to
> > > # a bug in the number of args expected
> > > bin/nutch dedup $segments_dir bogus
> > >
> > > # Merge indexes
> > > #echo "Unindo os segmentos..."
> > > #echo
> > > ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
> > >
> > > chmod 777 -R $index_dir
> > >
> > > #Inicia o serviço do TomCat
> > > #net start "Apache Tomcat"
> > >
> > > echo "Fim."
> > >
> > > #} > recrawl.log 2>&1
> > >
> > > How you suggested I used the touch command instead stops the tomcat.
> > > However
> > > I get that error posted in previous message. I'm running nutch in
> > windows
> > > plataform with cygwin. I only get no errors when I stops the tomcat. I
> > > use
> > > this command to call the script:
> > >
> > > ./recrawl crawl-legislacao 1
> > >
> > > Could you give me more clarifications?
> > >
> > > Thanks a lot!
> > >
> > > On 7/21/06, Matthew Holt <[hidden email]> wrote:
> > >>
> > >> Lourival Júnior wrote:
> > >> > Hi Renaud!
> > >> >
> > >> > I'm newbie with shell scripts and I know stops tomcat service is
> > >> not the
> > >> > better way to do this. The problem is, when a run the re-crawl script
> > >> > with
> > >> > tomcat started I get this error:
> > >> >
> > >> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
> > >> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0
> > >> >        at
> > >> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
> > >> >        at
> > >> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
> > >> >        at
> > >> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
> > >> > :141)
> > >> >        at
> > >> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
> > >> >        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
> > >> :92)
> > >> >        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
> > >> :160)
> > >> >
> > >> > So, I want another way to re-crawl my pages without this error and
> > >> > without
> > >> > restarting the tomcat. Could you suggest one?
> > >> >
> > >> > Thanks a lot!
> > >> >
> > >> >
> > >> Try this updated script and tell me what command exactly you run to
> > call
> > >> the script. Let me know the error message then.
> > >>
> > >> Matt
> > >>
> > >>
> > >> #!/bin/bash
> > >>
> > >> # Nutch recrawl script.
> > >> # Based on 0.7.2 script at
> > >>
> > http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> > >>
> > >> # Modified by Matthew Holt
> > >>
> > >> if [ -n "$1" ]
> > >> then
> > >>   nutch_dir=$1
> > >> else
> > >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> > >>   echo "servlet_path - Path of the nutch servlet (i.e.
> > >> /usr/local/tomcat/webapps/ROOT)"
> > >>   echo "crawl_dir - Name of the directory the crawl is located in."
> > >>   echo "[depth] - The link depth from the root page that should be
> > >> crawled."
> > >>   echo "[adddays] - Advance the clock # of days for fetchlist
> > >> generation."
> > >>   exit 1
> > >> fi
> > >>
> > >> if [ -n "$2" ]
> > >> then
> > >>   crawl_dir=$2
> > >> else
> > >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> > >>   echo "servlet_path - Path of the nutch servlet (i.e.
> > >> /usr/local/tomcat/webapps/ROOT)"
> > >>   echo "crawl_dir - Name of the directory the crawl is located in."
> > >>   echo "[depth] - The link depth from the root page that should be
> > >> crawled."
> > >>   echo "[adddays] - Advance the clock # of days for fetchlist
> > >> generation."
> > >>   exit 1
> > >> fi
> > >>
> > >> if [ -n "$3" ]
> > >> then
> > >>   depth=$3
> > >> else
> > >>   depth=5
> > >> fi
> > >>
> > >> if [ -n "$4" ]
> > >> then
> > >>   adddays=$4
> > >> else
> > >>   adddays=0
> > >> fi
> > >>
> > >> # Only change if your crawl subdirectories are named something
> > different
> > >> webdb_dir=$crawl_dir/crawldb
> > >> segments_dir=$crawl_dir/segments
> > >> linkdb_dir=$crawl_dir/linkdb
> > >> index_dir=$crawl_dir/index
> > >>
> > >> # The generate/fetch/update cycle
> > >> for ((i=1; i <= depth ; i++))
> > >> do
> > >>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> > >>   segment=`ls -d $segments_dir/* | tail -1`
> > >>   bin/nutch fetch $segment
> > >>   bin/nutch updatedb $webdb_dir $segment
> > >> done
> > >>
> > >> # Update segments
> > >> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
> > >>
> > >> # Index segments
> > >> new_indexes=$crawl_dir/newindexes
> > >> #ls -d $segments_dir/* | tail -$depth | xargs
> > >> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
> > >>
> > >> # De-duplicate indexes
> > >> bin/nutch dedup $new_indexes
> > >>
> > >> # Merge indexes
> > >> bin/nutch merge $index_dir $new_indexes
> > >>
> > >> # Tell Tomcat to reload index
> > >> touch $nutch_dir/WEB-INF/web.xml
> > >>
> > >> # Clean up
> > >> rm -rf $new_indexes
> > >>
> > >>
> > >
> > >
> > Oh yea, you're right the one i sent out was for 0.8.... you should just
> > be able to put this at the end of your script..
> >
> > # Tell Tomcat to reload index
> > touch $nutch_dir/WEB-INF/web.xml
> >
> > and fill in the appropriate path of course.
> > gluck
> > matt
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Recrawl script for 0.8.0 completed...

Lourival Júnior
You wanna say that only in windows this error occurs? I haven't tested in
linux yet. Has anyone a solution for this problem in windows/tomcat?

On 7/25/06, Thomas Delnoij <[hidden email]> wrote:

>
> Lourival.
>
> I have typically seen the same issues on a cygwin/windows setup. The
> only thing that worked for me was shutting down and restarting tomcat,
> instead of just reloading the context. On linux now I don't have these
> issues anymore.
>
> Rgrds, Thomas
>
> On 7/21/06, Lourival Júnior <[hidden email]> wrote:
> > Ok. However a few minutes ago I ran the script exactly you said and I
> still
> > get this error:
> >
> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0
> >         at org.apache.lucene.store.FSDirectory.create(FSDirectory.java
> :195)
> >         at org.apache.lucene.store.FSDirectory.init(FSDirectory.java
> :176)
> >         at org.apache.lucene.store.FSDirectory.getDirectory(
> FSDirectory.java
> > :141)
> >         at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java
> :225)
> >         at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
> :92)
> >         at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
> :160)
> >
> > I dont know but I thing it occurs because nutch tries to delete some
> file
> > that tomcat loads to the memory, giving permission access error. Any
> idea?
> >
> > On 7/21/06, Matthew Holt <[hidden email]> wrote:
> > >
> > > Lourival Júnior wrote:
> > > > I thing it wont work with me because i'm using the Nutch version
> 0.7.2.
> > > > Actually I use this script (some comments are in Portuguese):
> > > >
> > > > #!/bin/bash
> > > >
> > > > # A simple script to run a Nutch re-crawl
> > > > # Fonte do script:
> > > >
> > >
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> > > >
> > > > #{
> > > >
> > > > if [ -n "$1" ]
> > > > then
> > > >  crawl_dir=$1
> > > > else
> > > >  echo "Usage: recrawl crawl_dir [depth] [adddays]"
> > > >  exit 1
> > > > fi
> > > >
> > > > if [ -n "$2" ]
> > > > then
> > > >  depth=$2
> > > > else
> > > >  depth=5
> > > > fi
> > > >
> > > > if [ -n "$3" ]
> > > > then
> > > >  adddays=$3
> > > > else
> > > >  adddays=0
> > > > fi
> > > >
> > > > webdb_dir=$crawl_dir/db
> > > > segments_dir=$crawl_dir/segments
> > > > index_dir=$crawl_dir/index
> > > >
> > > > #Para o serviço do TomCat
> > > > #net stop "Apache Tomcat"
> > > >
> > > > # The generate/fetch/update cycle
> > > > for ((i=1; i <= depth ; i++))
> > > > do
> > > >  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> > > >  segment=`ls -d $segments_dir/* | tail -1`
> > > >  bin/nutch fetch $segment
> > > >  bin/nutch updatedb $webdb_dir $segment
> > > >  echo
> > > >  echo "Fim do ciclo $i."
> > > >  echo
> > > > done
> > > >
> > > > # Update segments
> > > > echo
> > > > echo "Atualizando os Segmentos..."
> > > > echo
> > > > mkdir tmp
> > > > bin/nutch updatesegs $webdb_dir $segments_dir tmp
> > > > rm -R tmp
> > > >
> > > > # Index segments
> > > > echo "Indexando os segmentos..."
> > > > echo
> > > > for segment in `ls -d $segments_dir/* | tail -$depth`
> > > > do
> > > >  bin/nutch index $segment
> > > > done
> > > >
> > > > # De-duplicate indexes
> > > > # "bogus" argument is ignored but needed due to
> > > > # a bug in the number of args expected
> > > > bin/nutch dedup $segments_dir bogus
> > > >
> > > > # Merge indexes
> > > > #echo "Unindo os segmentos..."
> > > > #echo
> > > > ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
> > > >
> > > > chmod 777 -R $index_dir
> > > >
> > > > #Inicia o serviço do TomCat
> > > > #net start "Apache Tomcat"
> > > >
> > > > echo "Fim."
> > > >
> > > > #} > recrawl.log 2>&1
> > > >
> > > > How you suggested I used the touch command instead stops the tomcat.
> > > > However
> > > > I get that error posted in previous message. I'm running nutch in
> > > windows
> > > > plataform with cygwin. I only get no errors when I stops the tomcat.
> I
> > > > use
> > > > this command to call the script:
> > > >
> > > > ./recrawl crawl-legislacao 1
> > > >
> > > > Could you give me more clarifications?
> > > >
> > > > Thanks a lot!
> > > >
> > > > On 7/21/06, Matthew Holt <[hidden email]> wrote:
> > > >>
> > > >> Lourival Júnior wrote:
> > > >> > Hi Renaud!
> > > >> >
> > > >> > I'm newbie with shell scripts and I know stops tomcat service is
> > > >> not the
> > > >> > better way to do this. The problem is, when a run the re-crawl
> script
> > > >> > with
> > > >> > tomcat started I get this error:
> > > >> >
> > > >> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
> > > >> > Exception in thread "main" java.io.IOException: Cannot delete
> _0.f0
> > > >> >        at
> > > >> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
> > > >> >        at
> > > >> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
> > > >> >        at
> > > >> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
> > > >> > :141)
> > > >> >        at
> > > >> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
> > > >> >        at org.apache.nutch.indexer.IndexMerger.merge(
> IndexMerger.java
> > > >> :92)
> > > >> >        at org.apache.nutch.indexer.IndexMerger.main(
> IndexMerger.java
> > > >> :160)
> > > >> >
> > > >> > So, I want another way to re-crawl my pages without this error
> and
> > > >> > without
> > > >> > restarting the tomcat. Could you suggest one?
> > > >> >
> > > >> > Thanks a lot!
> > > >> >
> > > >> >
> > > >> Try this updated script and tell me what command exactly you run to
> > > call
> > > >> the script. Let me know the error message then.
> > > >>
> > > >> Matt
> > > >>
> > > >>
> > > >> #!/bin/bash
> > > >>
> > > >> # Nutch recrawl script.
> > > >> # Based on 0.7.2 script at
> > > >>
> > >
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> > > >>
> > > >> # Modified by Matthew Holt
> > > >>
> > > >> if [ -n "$1" ]
> > > >> then
> > > >>   nutch_dir=$1
> > > >> else
> > > >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> > > >>   echo "servlet_path - Path of the nutch servlet (i.e.
> > > >> /usr/local/tomcat/webapps/ROOT)"
> > > >>   echo "crawl_dir - Name of the directory the crawl is located in."
> > > >>   echo "[depth] - The link depth from the root page that should be
> > > >> crawled."
> > > >>   echo "[adddays] - Advance the clock # of days for fetchlist
> > > >> generation."
> > > >>   exit 1
> > > >> fi
> > > >>
> > > >> if [ -n "$2" ]
> > > >> then
> > > >>   crawl_dir=$2
> > > >> else
> > > >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> > > >>   echo "servlet_path - Path of the nutch servlet (i.e.
> > > >> /usr/local/tomcat/webapps/ROOT)"
> > > >>   echo "crawl_dir - Name of the directory the crawl is located in."
> > > >>   echo "[depth] - The link depth from the root page that should be
> > > >> crawled."
> > > >>   echo "[adddays] - Advance the clock # of days for fetchlist
> > > >> generation."
> > > >>   exit 1
> > > >> fi
> > > >>
> > > >> if [ -n "$3" ]
> > > >> then
> > > >>   depth=$3
> > > >> else
> > > >>   depth=5
> > > >> fi
> > > >>
> > > >> if [ -n "$4" ]
> > > >> then
> > > >>   adddays=$4
> > > >> else
> > > >>   adddays=0
> > > >> fi
> > > >>
> > > >> # Only change if your crawl subdirectories are named something
> > > different
> > > >> webdb_dir=$crawl_dir/crawldb
> > > >> segments_dir=$crawl_dir/segments
> > > >> linkdb_dir=$crawl_dir/linkdb
> > > >> index_dir=$crawl_dir/index
> > > >>
> > > >> # The generate/fetch/update cycle
> > > >> for ((i=1; i <= depth ; i++))
> > > >> do
> > > >>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> > > >>   segment=`ls -d $segments_dir/* | tail -1`
> > > >>   bin/nutch fetch $segment
> > > >>   bin/nutch updatedb $webdb_dir $segment
> > > >> done
> > > >>
> > > >> # Update segments
> > > >> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
> > > >>
> > > >> # Index segments
> > > >> new_indexes=$crawl_dir/newindexes
> > > >> #ls -d $segments_dir/* | tail -$depth | xargs
> > > >> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
> > > >>
> > > >> # De-duplicate indexes
> > > >> bin/nutch dedup $new_indexes
> > > >>
> > > >> # Merge indexes
> > > >> bin/nutch merge $index_dir $new_indexes
> > > >>
> > > >> # Tell Tomcat to reload index
> > > >> touch $nutch_dir/WEB-INF/web.xml
> > > >>
> > > >> # Clean up
> > > >> rm -rf $new_indexes
> > > >>
> > > >>
> > > >
> > > >
> > > Oh yea, you're right the one i sent out was for 0.8.... you should
> just
> > > be able to put this at the end of your script..
> > >
> > > # Tell Tomcat to reload index
> > > touch $nutch_dir/WEB-INF/web.xml
> > >
> > > and fill in the appropriate path of course.
> > > gluck
> > > matt
> > >
> >
> >
> >
> > --
> > Lourival Junior
> > Universidade Federal do Pará
> > Curso de Bacharelado em Sistemas de Informação
> > http://www.ufpa.br/cbsi
> > Msn: [hidden email]
> >
> >
>



--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [hidden email]