FW: Indexing Files on Local File System

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

FW: Indexing Files on Local File System

Manu Warikoo
Hi,
 
I am running Nutch 0.9 and am attempting to use it to index files on my local file system without much luck. I believe I have configured things correctly, however, no files are being indexed and no errors being reported. Note that I have looked thru the various posts on this topic on the mailing list and tired various variations on the configuration.
 
I am providing details of my configuration and log files below. I would appreciate any insight people might have.
Best,
mw
 
Details:
OS: Windows Vista (note I have turned off defender and firewall)
<comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN 500 >& logs/crawl.log
urls files contains only
```````````````````````````````````````````````````
file:///C:/MyData/

```````````````````````````````````````````````````
Nutch-site.xml
`````````````````````````````````````
<property>
 <name>http.agent.url</name>
 <value></value>
 <description>none</description>
</property>
<property>
 <name>http.agent.email</name>
 <value>none</value>
 <description></description>
</property>

<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>file.content.limit</name> <value>-1</value>
</property>
</configuration>
```````````````````````````````````````````````````
crawl-urlfilters.txt
```````````````````````````````````````````````````
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
# -^(file|ftp|mailto):
# skip http:, ftp:, & mailto: urls
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
# -.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# skip everything else
# -.
# get everything else
+^file:///C:/MyData/*
-.*
```````````````````````````````````````````````````


Want to do more with Windows Live? Learn “10 hidden secrets” from Jamie. Learn Now
Reply | Threaded
Open this post in threaded view
|

Re: FW: Indexing Files on Local File System

Srinivas Gokavarapu
hi,
           You should change the url as file://C:/MyData/  and also in
crawl-urlfilter.txt change the file:// line to
+^file://C:/MyData/*

On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <[hidden email]> wrote:

>  Hi,
>
> I am running Nutch 0.9 and am attempting to use it to index files on my
> local file system without much luck. I believe I have configured things
> correctly, however, no files are being indexed and no errors being reported.
> Note that I have looked thru the various posts on this topic on the mailing
> list and tired various variations on the configuration.
>
> I am providing details of my configuration and log files below. I would
> appreciate any insight people might have.
> Best,
> mw
>
> Details:
> OS: Windows Vista (note I have turned off defender and firewall)
> <comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN 500 >&
> logs/crawl.log
> urls files contains only
> ```````````````````````````````````````````````````
> file:///C:/MyData/
>
> ```````````````````````````````````````````````````
> Nutch-site.xml
> `````````````````````````````````````
> <property>
>  <name>http.agent.url</name>
>  <value></value>
>  <description>none</description>
> </property>
> <property>
>  <name>http.agent.email</name>
>  <value>none</value>
>  <description></description>
> </property>
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
> <property>
> <name>file.content.limit</name> <value>-1</value>
> </property>
> </configuration>
> ```````````````````````````````````````````````````
> crawl-urlfilters.txt
> ```````````````````````````````````````````````````
> # The url filter file used by the crawl command.
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> # skip file:, ftp:, & mailto: urls
> # -^(file|ftp|mailto):
> # skip http:, ftp:, & mailto: urls
> -^(http|ftp|mailto):
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> # -.*(/.+?)/.*?\1/.*?\1/
> # accept hosts in MY.DOMAIN.NAME
> # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> # skip everything else
> # -.
> # get everything else
> +^file:///C:/MyData/*
> -.*
> ```````````````````````````````````````````````````
>
> ------------------------------
> Want to do more with Windows Live? Learn "10 hidden secrets" from Jamie. Learn
> Now<http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>
>
Reply | Threaded
Open this post in threaded view
|

RE: Indexing Files on Local File System

Manu Warikoo

hi,
Thanks for responding.
Just tried the changes that you suggested, no change.
log files look exactly the same expect that now the dir ref comes up with only 2 /.
any other possible things?
mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: [hidden email]> To: [hidden email]> Subject: Re: FW: Indexing Files on Local File System> > hi,> You should change the url as file://C:/MyData/ and also in> crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*> > On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <[hidden email]> wrote:> > > Hi,> >> > I am running Nutch 0.9 and am attempting to use it to index files on my> > local file system without much luck. I believe I have configured things> > correctly, however, no files are being indexed and no errors being reported.> > Note that I have looked thru the various posts on this topic on the mailing> > list and tired various variations on the configuration.> >> > I am providing details of my configuration and log files below. I would> > appreciate any insight people might have.> > Best,> > mw> >> > Details:> > OS: Windows Vista (note I have turned off defender and firewall)> > <comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN 500 >&> > logs/crawl.log> > urls files contains only> > ```````````````````````````````````````````````````> > file:///C:/MyData/> >> > ```````````````````````````````````````````````````> > Nutch-site.xml> > `````````````````````````````````````> > <property>> > <name>http.agent.url</name>> > <value></value>> > <description>none</description>> > </property>> > <property>> > <name>http.agent.email</name>> > <value>none</value>> > <description></description>> > </property>> >> > <property>> > <name>plugin.includes</name>> >> > <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>> > </property>> > <property>> > <name>file.content.limit</name> <value>-1</value>> > </property>> > </configuration>> > ```````````````````````````````````````````````````> > crawl-urlfilters.txt> > ```````````````````````````````````````````````````> > # The url filter file used by the crawl command.> > # Better for intranet crawling.> > # Be sure to change MY.DOMAIN.NAME to your domain name.> > # Each non-comment, non-blank line contains a regular expression> > # prefixed by '+' or '-'. The first matching pattern in the file> > # determines whether a URL is included or ignored. If no pattern> > # matches, the URL is ignored.> > # skip file:, ftp:, & mailto: urls> > # -^(file|ftp|mailto):> > # skip http:, ftp:, & mailto: urls> > -^(http|ftp|mailto):> > # skip image and other suffixes we can't yet parse> >> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$> > # skip URLs containing certain characters as probable queries, etc.> > -[?*!@=]> > # skip URLs with slash-delimited segment that repeats 3+ times, to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip everything else> > # -.> > # get everything else> > +^file:///C:/MyData/*> > -.*> > ```````````````````````````````````````````````````> >> > ------------------------------> > Want to do more with Windows Live? Learn "10 hidden secrets" from Jamie. Learn> > Now<http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>> >
_________________________________________________________________
See how Windows Mobile brings your life together—at home, work, or on the go.
http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: Indexing Files on Local File System

Kevin MacDonald-3
Manu,
The only way I was able to figure out why nutch was not crawling Urls that I
was expecting it to crawl was by digging into the code and adding extra
logging lines. I suggest you look at org.apache.nutch.fetcher.Fetcher.run()
and get an idea what it's doing. Also, look at Fetcher.handleRedirect(). Put
a whole bunch of extra logging lines in that file to figure out if either a
filter or a Normalizer is stripping out Urls that you want crawled. You can
also try disabling all Normalizers by adding something like this to your
nutch-site.xml file. Note that I stripped out just about everything. You
might only want to strip out the Normalizers. See the original settings in
nutch-default.xml.

<property>
  <name>plugin.includes</name>
  <value>protocol-http|parse-(text|html|js)|scoring-opic</value>
</property>


On Thu, Sep 25, 2008 at 1:53 PM, Manu Warikoo <[hidden email]> wrote:

>
> hi,
> Thanks for responding.
> Just tried the changes that you suggested, no change.
> log files look exactly the same expect that now the dir ref comes up with
> only 2 /.
> any other possible things?
> mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: [hidden email]>
> To: [hidden email]> Subject: Re: FW: Indexing Files on Local
> File System> > hi,> You should change the url as file://C:/MyData/ and also
> in> crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*> >
> On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <[hidden email]>
> wrote:> > > Hi,> >> > I am running Nutch 0.9 and am attempting to use it to
> index files on my> > local file system without much luck. I believe I have
> configured things> > correctly, however, no files are being indexed and no
> errors being reported.> > Note that I have looked thru the various posts on
> this topic on the mailing> > list and tired various variations on the
> configuration.> >> > I am providing details of my configuration and log
> files below. I would> > appreciate any insight people might have.> > Best,>
> > mw> >> > Details:> > OS: Windows Vista (note I have turned off defender
> and firewall)> > <comand> bin/nutch crawl urls -dir crawl_results -depth 4
> -topN 500 >&> > logs/crawl.log> > urls files contains only> >
> ```````````````````````````````````````````````````> > file:///C:/MyData/>
> >> > ```````````````````````````````````````````````````> > Nutch-site.xml>
> > `````````````````````````````````````> > <property>> >
> <name>http.agent.url</name>> > <value></value>> >
> <description>none</description>> > </property>> > <property>> >
> <name>http.agent.email</name>> > <value>none</value>> >
> <description></description>> > </property>> >> > <property>> >
> <name>plugin.includes</name>> >> >
> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>>
> > </property>> > <property>> > <name>file.content.limit</name>
> <value>-1</value>> > </property>> > </configuration>> >
> ```````````````````````````````````````````````````> > crawl-urlfilters.txt>
> > ```````````````````````````````````````````````````> > # The url filter
> file used by the crawl command.> > # Better for intranet crawling.> > # Be
> sure to change MY.DOMAIN.NAME to your domain name.> > # Each non-comment,
> non-blank line contains a regular expression> > # prefixed by '+' or '-'.
> The first matching pattern in the file> > # determines whether a URL is
> included or ignored. If no pattern> > # matches, the URL is ignored.> > #
> skip file:, ftp:, & mailto: urls> > # -^(file|ftp|mailto):> > # skip
> http:, ftp:, & mailto: urls> > -^(http|ftp|mailto):> > # skip image and
> other suffixes we can't yet parse> >> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$>
> > # skip URLs containing certain characters as probable queries, etc.> >
> -[?*!@=]> > # skip URLs with slash-delimited segment that repeats 3+ times,
> to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in
> MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip
> everything else> > # -.> > # get everything else> > +^file:///C:/MyData/*> >
> -.*> > ```````````````````````````````````````````````````> >> >
> ------------------------------> > Want to do more with Windows Live? Learn
> "10 hidden secrets" from Jamie. Learn> > Now<
> http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>>
> >
> _________________________________________________________________
> See how Windows Mobile brings your life together—at home, work, or on the
> go.
> http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: Indexing Files on Local File System

Srinivas Gokavarapu
hi,
           Check this link For Crawling local pages in
nutch<http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch>.
Follow the steps in this site and check once

On Fri, Sep 26, 2008 at 3:24 AM, Kevin MacDonald <[hidden email]>wrote:

> Manu,
> The only way I was able to figure out why nutch was not crawling Urls that
> I
> was expecting it to crawl was by digging into the code and adding extra
> logging lines. I suggest you look at org.apache.nutch.fetcher.Fetcher.run()
> and get an idea what it's doing. Also, look at Fetcher.handleRedirect().
> Put
> a whole bunch of extra logging lines in that file to figure out if either a
> filter or a Normalizer is stripping out Urls that you want crawled. You can
> also try disabling all Normalizers by adding something like this to your
> nutch-site.xml file. Note that I stripped out just about everything. You
> might only want to strip out the Normalizers. See the original settings in
> nutch-default.xml.
>
> <property>
>  <name>plugin.includes</name>
>   <value>protocol-http|parse-(text|html|js)|scoring-opic</value>
> </property>
>
>
> On Thu, Sep 25, 2008 at 1:53 PM, Manu Warikoo <[hidden email]>
> wrote:
>
> >
> > hi,
> > Thanks for responding.
> > Just tried the changes that you suggested, no change.
> > log files look exactly the same expect that now the dir ref comes up with
> > only 2 /.
> > any other possible things?
> > mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: [hidden email]
> >
> > To: [hidden email]> Subject: Re: FW: Indexing Files on
> Local
> > File System> > hi,> You should change the url as file://C:/MyData/ and
> also
> > in> crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*>
> >
> > On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <[hidden email]>
> > wrote:> > > Hi,> >> > I am running Nutch 0.9 and am attempting to use it
> to
> > index files on my> > local file system without much luck. I believe I
> have
> > configured things> > correctly, however, no files are being indexed and
> no
> > errors being reported.> > Note that I have looked thru the various posts
> on
> > this topic on the mailing> > list and tired various variations on the
> > configuration.> >> > I am providing details of my configuration and log
> > files below. I would> > appreciate any insight people might have.> >
> Best,>
> > > mw> >> > Details:> > OS: Windows Vista (note I have turned off defender
> > and firewall)> > <comand> bin/nutch crawl urls -dir crawl_results -depth
> 4
> > -topN 500 >&> > logs/crawl.log> > urls files contains only> >
> > ```````````````````````````````````````````````````> >
> file:///C:/MyData/>
> > >> > ```````````````````````````````````````````````````> >
> Nutch-site.xml>
> > > `````````````````````````````````````> > <property>> >
> > <name>http.agent.url</name>> > <value></value>> >
> > <description>none</description>> > </property>> > <property>> >
> > <name>http.agent.email</name>> > <value>none</value>> >
> > <description></description>> > </property>> >> > <property>> >
> > <name>plugin.includes</name>> >> >
> >
> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>>
> > > </property>> > <property>> > <name>file.content.limit</name>
> > <value>-1</value>> > </property>> > </configuration>> >
> > ```````````````````````````````````````````````````> >
> crawl-urlfilters.txt>
> > > ```````````````````````````````````````````````````> > # The url filter
> > file used by the crawl command.> > # Better for intranet crawling.> > #
> Be
> > sure to change MY.DOMAIN.NAME to your domain name.> > # Each
> non-comment,
> > non-blank line contains a regular expression> > # prefixed by '+' or '-'.
> > The first matching pattern in the file> > # determines whether a URL is
> > included or ignored. If no pattern> > # matches, the URL is ignored.> > #
> > skip file:, ftp:, & mailto: urls> > # -^(file|ftp|mailto):> > # skip
> > http:, ftp:, & mailto: urls> > -^(http|ftp|mailto):> > # skip image and
> > other suffixes we can't yet parse> >> >
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$>
> > > # skip URLs containing certain characters as probable queries, etc.> >
> > -[?*!@=]> > # skip URLs with slash-delimited segment that repeats 3+
> times,
> > to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in
> > MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip
> > everything else> > # -.> > # get everything else> >
> +^file:///C:/MyData/*> >
> > -.*> > ```````````````````````````````````````````````````> >> >
> > ------------------------------> > Want to do more with Windows Live?
> Learn
> > "10 hidden secrets" from Jamie. Learn> > Now<
> >
> http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008
> >>
> > >
> > _________________________________________________________________
> > See how Windows Mobile brings your life together—at home, work, or on the
> > go.
> > http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/
>