Crawling the local file system with Nutch - Document-

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawling the local file system with Nutch - Document-

Vertical Search
Nutchians,
I have tried to document the sequence of steps to adopt nutch to crawl and
search local file system on windows machine.
I have been able to do it successfully using nutch 0.8 Dev
The configuration are as follows
*Inspiron 630m
Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine Windows XP
Professional)*
*If some can review it, it will be very helpful.*

Crawling the local filesystem with nutch
Platform: Microsoft / nutch 0.8 Dev
For a linux version, please refer to
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
The link did help me get it off the ground.

I have been working on adopting nutch in a vertical domain. All of a sudden,
I was asked to develop a proof of concept
to adopt nutch to crawl and search local file syste,
Initially I did face some problems. But some mail archieves did help me
proceed further.
The intention is to provide a overview of steps to crawl local file systems
and search through the browser.

I downloaded the nuctch nightly from
1. Create the environment variable such as "NUTCH_HOME". (Not mandatory, but
helps)
2. Extract the downloaded nightly build. <Dont build yet>
3. Create a folder --> c:/LocalSearch --> copied the following folders and
librariees
 1. bin/
 2. conf/
 3. *.job, *.jar and *.war files
 4. urls/ <URLS folder>
 5. Plugins folder
4. Modify the nutch-site.xml to include the Plugin folder
5. Modify the nutch-site.xml to include the includes. An example is as
follows

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<nutch-conf>
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
</property>
<property>
<name>file.content.limit</name> <value>-1</value>
</property>
</nutch-conf>

6. Modify crawl-urlfilter.txt
Remember we have to crawl the local file system. Hence we have to modify the
entries as follows

#skip http:, ftp:, & mailto: urls
##-^(file|ftp|mailto):

-^(http|ftp|mailto):

#skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

#skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

#accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

#accecpt anything else
+.*

7. urls folder
Create a file for all the urls to be crawled. The file should have the urls
as below
save the file under the urls folder.

The directories should be in "file://" format. Example entries were as
follows

file://c:/resumes/word <file:///c:/resumes/word>
file://c:/resumes/pdf <file:///c:/resumes/pdf>

#file:///data/readings/semanticweb/

Nutch recognises that the third line does not contain a valid file-url and
skips it

As suggested by the link
8. Ignoring the parent directories. As suggested in the linux flavor of
local fs crawl, I did modify the code in
org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
java.io.File f).

I changed the following line:

this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
true);
to

this.content = list2html(f.listFiles(), path, false);
and recompiled.

9. Compile the changes. Just compiled the whole source code base. did not
take more than 2 minutes.

10. Crawling the file system.
    on my desktop, I have a short cut to "cygdrive", double click
    pwd.
    cd ../../cygdrive/c/$NUTCH_HOME

    Execute
    bin/nutch crawl urls -dir c:/localfs/database

Voila, that is it, After 20 minutes, the files were indexed, merged and all
done.

11. extracted the nutch-o.8-dev.war file to <TOMCAT_HOME>/webapps/ROOT
folder

Opened the nutch-site.xml and added the following snippet to reflect the
search folder
<property>
  <name>searcher.dir</name>
  <value>c:/localfs/database</value>
  <description>
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  </description>
</property>

12. Searching locally was a bit slow. So I changed the hosts.ini file to map
machine name to localhost. That increased search considerably.

13. Modified the search.jsp and cached servlet to view word and pdf as
demanded by user seamlessly.


I hope this helps folks who are trying to adopt nutch for local file system.
Personally I believe corporates should adopt nutch rather buying google
appliance :)
Reply | Threaded
Open this post in threaded view
|

Re: Crawling the local file system with Nutch - Document-

吴志敏
thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf  as
demanded by user seamlessly.



On 4/1/06, Vertical Search <[hidden email]> wrote:

>
> Nutchians,
> I have tried to document the sequence of steps to adopt nutch to crawl and
> search local file system on windows machine.
> I have been able to do it successfully using nutch 0.8 Dev
> The configuration are as follows
> *Inspiron 630m
> Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
> Windows XP
> Professional)*
> *If some can review it, it will be very helpful.*
>
> Crawling the local filesystem with nutch
> Platform: Microsoft / nutch 0.8 Dev
> For a linux version, please refer to
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> The link did help me get it off the ground.
>
> I have been working on adopting nutch in a vertical domain. All of a
> sudden,
> I was asked to develop a proof of concept
> to adopt nutch to crawl and search local file syste,
> Initially I did face some problems. But some mail archieves did help me
> proceed further.
> The intention is to provide a overview of steps to crawl local file
> systems
> and search through the browser.
>
> I downloaded the nuctch nightly from
> 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory,
> but
> helps)
> 2. Extract the downloaded nightly build. <Dont build yet>
> 3. Create a folder --> c:/LocalSearch --> copied the following folders and
> librariees
> 1. bin/
> 2. conf/
> 3. *.job, *.jar and *.war files
> 4. urls/ <URLS folder>
> 5. Plugins folder
> 4. Modify the nutch-site.xml to include the Plugin folder
> 5. Modify the nutch-site.xml to include the includes. An example is as
> follows
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <nutch-conf>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
> </property>
> <property>
> <name>file.content.limit</name> <value>-1</value>
> </property>
> </nutch-conf>
>
> 6. Modify crawl-urlfilter.txt
> Remember we have to crawl the local file system. Hence we have to modify
> the
> entries as follows
>
> #skip http:, ftp:, & mailto: urls
> ##-^(file|ftp|mailto):
>
> -^(http|ftp|mailto):
>
> #skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> #skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> #accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> #accecpt anything else
> +.*
>
> 7. urls folder
> Create a file for all the urls to be crawled. The file should have the
> urls
> as below
> save the file under the urls folder.
>
> The directories should be in "file://" format. Example entries were as
> follows
>
> file://c:/resumes/word <file:///c:/resumes/word>
> file://c:/resumes/pdf <file:///c:/resumes/pdf>
>
> #file:///data/readings/semanticweb/
>
> Nutch recognises that the third line does not contain a valid file-url and
> skips it
>
> As suggested by the link
> 8. Ignoring the parent directories. As suggested in the linux flavor of
> local fs crawl, I did modify the code in
> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> java.io.File f).
>
> I changed the following line:
>
> this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> true);
> to
>
> this.content = list2html(f.listFiles(), path, false);
> and recompiled.
>
> 9. Compile the changes. Just compiled the whole source code base. did not
> take more than 2 minutes.
>
> 10. Crawling the file system.
>     on my desktop, I have a short cut to "cygdrive", double click
>     pwd.
>     cd ../../cygdrive/c/$NUTCH_HOME
>
>     Execute
>     bin/nutch crawl urls -dir c:/localfs/database
>
> Voila, that is it, After 20 minutes, the files were indexed, merged and
> all
> done.
>
> 11. extracted the nutch-o.8-dev.war file to <TOMCAT_HOME>/webapps/ROOT
> folder
>
> Opened the nutch-site.xml and added the following snippet to reflect the
> search folder
> <property>
>   <name>searcher.dir</name>
>   <value>c:/localfs/database</value>
>   <description>
>   Path to root of crawl.  This directory is searched (in
>   order) for either the file search-servers.txt, containing a list of
>   distributed search servers, or the directory "index" containing
>   merged indexes, or the directory "segments" containing segment
>   indexes.
>   </description>
> </property>
>
> 12. Searching locally was a bit slow. So I changed the hosts.ini file to
> map
> machine name to localhost. That increased search considerably.
>
> 13. Modified the search.jsp and cached servlet to view word and pdf as
> demanded by user seamlessly.
>
>
> I hope this helps folks who are trying to adopt nutch for local file
> system.
> Personally I believe corporates should adopt nutch rather buying google
> appliance :)
>
>


--
www.babatu.com
Reply | Threaded
Open this post in threaded view
|

Re: Crawling the local file system with Nutch - Document-

sudhendra seshachala
I just modified search.jsp. Basically set the content type based on document type I was querying.
  Rest is handled protocol and browser.
   
  I can send the code if you would like.
   
  Thanks

kauu <[hidden email]> wrote:
  thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf as
demanded by user seamlessly.



On 4/1/06, Vertical Search wrote:

>
> Nutchians,
> I have tried to document the sequence of steps to adopt nutch to crawl and
> search local file system on windows machine.
> I have been able to do it successfully using nutch 0.8 Dev
> The configuration are as follows
> *Inspiron 630m
> Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
> Windows XP
> Professional)*
> *If some can review it, it will be very helpful.*
>
> Crawling the local filesystem with nutch
> Platform: Microsoft / nutch 0.8 Dev
> For a linux version, please refer to
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> The link did help me get it off the ground.
>
> I have been working on adopting nutch in a vertical domain. All of a
> sudden,
> I was asked to develop a proof of concept
> to adopt nutch to crawl and search local file syste,
> Initially I did face some problems. But some mail archieves did help me
> proceed further.
> The intention is to provide a overview of steps to crawl local file
> systems
> and search through the browser.
>
> I downloaded the nuctch nightly from
> 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory,
> but
> helps)
> 2. Extract the downloaded nightly build.
> 3. Create a folder --> c:/LocalSearch --> copied the following folders and
> librariees
> 1. bin/
> 2. conf/
> 3. *.job, *.jar and *.war files
> 4. urls/
> 5. Plugins folder
> 4. Modify the nutch-site.xml to include the Plugin folder
> 5. Modify the nutch-site.xml to include the includes. An example is as
> follows
>
>
>
>
>
>

> plugin.includes
>
> protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)
>

>

> file.content.limit -1
>

>
>
> 6. Modify crawl-urlfilter.txt
> Remember we have to crawl the local file system. Hence we have to modify
> the
> entries as follows
>
> #skip http:, ftp:, & mailto: urls
> ##-^(file|ftp|mailto):
>
> -^(http|ftp|mailto):
>
> #skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> #skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> #accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> #accecpt anything else
> +.*
>
> 7. urls folder
> Create a file for all the urls to be crawled. The file should have the
> urls
> as below
> save the file under the urls folder.
>
> The directories should be in "file://" format. Example entries were as
> follows
>
> file://c:/resumes/word
> file://c:/resumes/pdf
>
> #file:///data/readings/semanticweb/
>
> Nutch recognises that the third line does not contain a valid file-url and
> skips it
>
> As suggested by the link
> 8. Ignoring the parent directories. As suggested in the linux flavor of
> local fs crawl, I did modify the code in
> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> java.io.File f).
>
> I changed the following line:
>
> this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> true);
> to
>
> this.content = list2html(f.listFiles(), path, false);
> and recompiled.
>
> 9. Compile the changes. Just compiled the whole source code base. did not
> take more than 2 minutes.
>
> 10. Crawling the file system.
> on my desktop, I have a short cut to "cygdrive", double click
> pwd.
> cd ../../cygdrive/c/$NUTCH_HOME
>
> Execute
> bin/nutch crawl urls -dir c:/localfs/database
>
> Voila, that is it, After 20 minutes, the files were indexed, merged and
> all
> done.
>
> 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT
> folder
>
> Opened the nutch-site.xml and added the following snippet to reflect the
> search folder
>

> searcher.dir
> c:/localfs/database
>
> Path to root of crawl. This directory is searched (in
> order) for either the file search-servers.txt, containing a list of
> distributed search servers, or the directory "index" containing
> merged indexes, or the directory "segments" containing segment
> indexes.
>
>

>
> 12. Searching locally was a bit slow. So I changed the hosts.ini file to
> map
> machine name to localhost. That increased search considerably.
>
> 13. Modified the search.jsp and cached servlet to view word and pdf as
> demanded by user seamlessly.
>
>
> I hope this helps folks who are trying to adopt nutch for local file
> system.
> Personally I believe corporates should adopt nutch rather buying google
> appliance :)
>
>


--
www.babatu.com



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


               
---------------------------------
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
Reply | Threaded
Open this post in threaded view
|

Re: Crawling the local file system with Nutch - Document-

吴志敏
hi sudhendra seshachala
thx so much for ur code.
yes ,i want it .


On 4/5/06, sudhendra seshachala <[hidden email]> wrote:

>
> I just modified search.jsp. Basically set the content type based on
> document type I was querying.
>   Rest is handled protocol and browser.
>
>   I can send the code if you would like.
>
>   Thanks
>
> kauu <[hidden email]> wrote:
>   thx for ur idea!!
> but i get a question .
> how to modify the search.jsp and cached servlet to view word and pdf as
> demanded by user seamlessly.
>
>
>
> On 4/1/06, Vertical Search wrote:
> >
> > Nutchians,
> > I have tried to document the sequence of steps to adopt nutch to crawl
> and
> > search local file system on windows machine.
> > I have been able to do it successfully using nutch 0.8 Dev
> > The configuration are as follows
> > *Inspiron 630m
> > Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
> > Windows XP
> > Professional)*
> > *If some can review it, it will be very helpful.*
> >
> > Crawling the local filesystem with nutch
> > Platform: Microsoft / nutch 0.8 Dev
> > For a linux version, please refer to
> >
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> > The link did help me get it off the ground.
> >
> > I have been working on adopting nutch in a vertical domain. All of a
> > sudden,
> > I was asked to develop a proof of concept
> > to adopt nutch to crawl and search local file syste,
> > Initially I did face some problems. But some mail archieves did help me
> > proceed further.
> > The intention is to provide a overview of steps to crawl local file
> > systems
> > and search through the browser.
> >
> > I downloaded the nuctch nightly from
> > 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory,
> > but
> > helps)
> > 2. Extract the downloaded nightly build.
> > 3. Create a folder --> c:/LocalSearch --> copied the following folders
> and
> > librariees
> > 1. bin/
> > 2. conf/
> > 3. *.job, *.jar and *.war files
> > 4. urls/
> > 5. Plugins folder
> > 4. Modify the nutch-site.xml to include the Plugin folder
> > 5. Modify the nutch-site.xml to include the includes. An example is as
> > follows
> >
> >
> >
> >
> >
> >
>
> > plugin.includes
> >
> >
> protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)
> >
>
> >
>
> > file.content.limit -1
> >
>
> >
> >
> > 6. Modify crawl-urlfilter.txt
> > Remember we have to crawl the local file system. Hence we have to modify
> > the
> > entries as follows
> >
> > #skip http:, ftp:, & mailto: urls
> > ##-^(file|ftp|mailto):
> >
> > -^(http|ftp|mailto):
> >
> > #skip image and other suffixes we can't yet parse
> >
> >
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> >
> > #skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> >
> > #accept hosts in MY.DOMAIN.NAME
> > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >
> > #accecpt anything else
> > +.*
> >
> > 7. urls folder
> > Create a file for all the urls to be crawled. The file should have the
> > urls
> > as below
> > save the file under the urls folder.
> >
> > The directories should be in "file://" format. Example entries were as
> > follows
> >
> > file://c:/resumes/word
> > file://c:/resumes/pdf
> >
> > #file:///data/readings/semanticweb/
> >
> > Nutch recognises that the third line does not contain a valid file-url
> and
> > skips it
> >
> > As suggested by the link
> > 8. Ignoring the parent directories. As suggested in the linux flavor of
> > local fs crawl, I did modify the code in
> > org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> > java.io.File f).
> >
> > I changed the following line:
> >
> > this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> > true);
> > to
> >
> > this.content = list2html(f.listFiles(), path, false);
> > and recompiled.
> >
> > 9. Compile the changes. Just compiled the whole source code base. did
> not
> > take more than 2 minutes.
> >
> > 10. Crawling the file system.
> > on my desktop, I have a short cut to "cygdrive", double click
> > pwd.
> > cd ../../cygdrive/c/$NUTCH_HOME
> >
> > Execute
> > bin/nutch crawl urls -dir c:/localfs/database
> >
> > Voila, that is it, After 20 minutes, the files were indexed, merged and
> > all
> > done.
> >
> > 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT
> > folder
> >
> > Opened the nutch-site.xml and added the following snippet to reflect the
> > search folder
> >
>
> > searcher.dir
> > c:/localfs/database
> >
> > Path to root of crawl. This directory is searched (in
> > order) for either the file search-servers.txt, containing a list of
> > distributed search servers, or the directory "index" containing
> > merged indexes, or the directory "segments" containing segment
> > indexes.
> >
> >
>
> >
> > 12. Searching locally was a bit slow. So I changed the hosts.ini file to
> > map
> > machine name to localhost. That increased search considerably.
> >
> > 13. Modified the search.jsp and cached servlet to view word and pdf as
> > demanded by user seamlessly.
> >
> >
> > I hope this helps folks who are trying to adopt nutch for local file
> > system.
> > Personally I believe corporates should adopt nutch rather buying google
> > appliance :)
> >
> >
>
>
> --
> www.babatu.com
>
>
>
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>
>
>
>
> ---------------------------------
> New Yahoo! Messenger with Voice. Call regular phones from your PC and save
> big.
>



--
www.babatu.com