New to Nutch, a few questions

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

New to Nutch, a few questions

Nes Yarug
Hi all,

I'm new to Nutch and I have a few questions that I hope to get some answers
on. Thanks in advance for any replies.

I want to use Nutch to index a web site I'm maintaining. I've followed the
tutorial for intranet crawling and used a list of links (17420 links to 8710
pages, each page has two unique links) from my site to crawl initially. The
command I used was:

bin/nutch crawl urls -dir crawl -depth 20 -topN 100

The crawl completed, but I'm sure that when I was testing the search it has
not indexed a lot of pages. What I understand from the following command it
only indexed 1527 of 21378 pages:

CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     21378
retry 0:        20878
retry 1:        487
retry 2:        10
retry 3:        3
min score:      0.014
avg score:      84.405266
max score:      37106.03
status 1 (DB_unfetched):        19848
status 2 (DB_fetched):  1527
status 3 (DB_gone):     3
CrawlDb statistics: done


Now my questions:

1) Will Nutch automatically continue to index the rest of the URLs even
though te initial crawl finished (through some internal scheduler of some
sorts)?

2) All of my site's pages at the moment are contained in two languages (each
page has exactly two languages, the lang attribute on the html tag of each
page contains the language identifier). When searching, is there a way to
only return pages in a specific language? I know the Nutch UI is localised,
but it will still return pages in english if my UI language is German for
example. I want it to return German pages only (<html lang="de">) when
searching through the German UI. Is that possible?

Many thanks,
Nes
Reply | Threaded
Open this post in threaded view
|

Re: New to Nutch, a few questions

Dennis Kubes


Nes Yarug wrote:

> Hi all,
>
> I'm new to Nutch and I have a few questions that I hope to get some answers
> on. Thanks in advance for any replies.
>
> I want to use Nutch to index a web site I'm maintaining. I've followed the
> tutorial for intranet crawling and used a list of links (17420 links to
> 8710
> pages, each page has two unique links) from my site to crawl initially. The
> command I used was:
>
> bin/nutch crawl urls -dir crawl -depth 20 -topN 100

Here you are using topN.  This will only pull the top 100 results to
fetch on the next depth.  You probably also don't need a depth of 20.
Starting from your homepage, what is the most number of clicks it would
take to get to any page in your site.  This should be your depth.  If
you eliminate this topN I think you will be able to get all of your pages.

>
> The crawl completed, but I'm sure that when I was testing the search it has
> not indexed a lot of pages. What I understand from the following command it
> only indexed 1527 of 21378 pages:
>
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:     21378
> retry 0:        20878
> retry 1:        487
> retry 2:        10
> retry 3:        3
> min score:      0.014
> avg score:      84.405266
> max score:      37106.03
> status 1 (DB_unfetched):        19848
> status 2 (DB_fetched):  1527
> status 3 (DB_gone):     3
> CrawlDb statistics: done
>
>
> Now my questions:
>
> 1) Will Nutch automatically continue to index the rest of the URLs even
> though te initial crawl finished (through some internal scheduler of some
> sorts)?

Not with the topN set like that no.  You could also change it from 100
to say 5000 but I still think that wouldn't get all the pages.  Better
leaving it off, especially if you are only indexing a single site.
>
> 2) All of my site's pages at the moment are contained in two languages
> (each
> page has exactly two languages, the lang attribute on the html tag of each
> page contains the language identifier). When searching, is there a way to
> only return pages in a specific language? I know the Nutch UI is localised,
> but it will still return pages in english if my UI language is German for
> example. I want it to return German pages only (<html lang="de">) when
> searching through the German UI. Is that possible?

I believe the lang attribute is put in as a field during indexing
(depends on your settings but I believe this is default) and then you
can add a required field to the query in the search.jsp for the language
like this:

query.addRequiredTerm("en", "lang"); // substitute language for en
>
> Many thanks,
> Nes
>

Dennis Kubes
Reply | Threaded
Open this post in threaded view
|

Re: New to Nutch, a few questions

Renaud Richardet-3-2
In reply to this post by Nes Yarug
Nes Yarug wrote:

> Hi all,
>
> I'm new to Nutch and I have a few questions that I hope to get some
> answers
> on. Thanks in advance for any replies.
>
> I want to use Nutch to index a web site I'm maintaining. I've followed
> the
> tutorial for intranet crawling and used a list of links (17420 links
> to 8710
> pages, each page has two unique links) from my site to crawl initially.
Actually, you don't need to provide a full list of links to Nutch. You
can let it discover links as it crawl your site, and constrain them
using crawl-urlfilter.txt and regex-urlfilter.txt

> The
> command I used was:
>
> bin/nutch crawl urls -dir crawl -depth 20 -topN 100
>
> The crawl completed, but I'm sure that when I was testing the search
> it has
> not indexed a lot of pages. What I understand from the following
> command it
> only indexed 1527 of 21378 pages:
>
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:     21378
> retry 0:        20878
> retry 1:        487
> retry 2:        10
> retry 3:        3
> min score:      0.014
> avg score:      84.405266
> max score:      37106.03
> status 1 (DB_unfetched):        19848
> status 2 (DB_fetched):  1527
> status 3 (DB_gone):     3
> CrawlDb statistics: done
>
>
> Now my questions:
>
> 1) Will Nutch automatically continue to index the rest of the URLs even
> though te initial crawl finished (through some internal scheduler of some
> sorts)?
You will need to refetch, or better: increase the depth, until "all your
pages" are fetched.

>
> 2) All of my site's pages at the moment are contained in two languages
> (each
> page has exactly two languages, the lang attribute on the html tag of
> each
> page contains the language identifier). When searching, is there a way to
> only return pages in a specific language? I know the Nutch UI is
> localised,
> but it will still return pages in english if my UI language is German for
> example. I want it to return German pages only (<html lang="de">) when
> searching through the German UI. Is that possible?
try using "lang:" in your query, I'm not sure it's working, though...
 From the javadoc: "LanguageQueryFilter.java should handles "lang:"
query clauses, causing them to search the "lang" field indexed by
LanguageIdentifier" (see also LanguageIndexingFilter.java).

HTH,
Renaud


--
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com

Reply | Threaded
Open this post in threaded view
|

Re: New to Nutch, a few questions

Nes Yarug
Thank you everyone for your replies.

I have implemented the recrawl script from
http://wiki.apache.org/nutch/IntranetRecrawl and that is still running for
over 12 hours so I guess that  would index much more pages.

Leaves the question about language specific search. I have tried adding the
lang: clause to my search query by appending lang:en but that is not
returning any results (as if lang:en would become part of the actual query).
The url then looks like this: search.jsp
?query=help+lang%3Aen&hitsPerPage=10&lang=en

Anyone has used a language specific search before, do I need to add a new
(hidden) input field on the search form to specifiy the language instead of
appending it to the query? That would be my preference anyway, as I want the
language specific search to be transparant to he user.

Again, many thanks for any replies,
Nes

On 1/30/07, Renaud Richardet <[hidden email]> wrote:

>
> Nes Yarug wrote:
> > Hi all,
> >
> > I'm new to Nutch and I have a few questions that I hope to get some
> > answers
> > on. Thanks in advance for any replies.
> >
> > I want to use Nutch to index a web site I'm maintaining. I've followed
> > the
> > tutorial for intranet crawling and used a list of links (17420 links
> > to 8710
> > pages, each page has two unique links) from my site to crawl initially.
> Actually, you don't need to provide a full list of links to Nutch. You
> can let it discover links as it crawl your site, and constrain them
> using crawl-urlfilter.txt and regex-urlfilter.txt
> > The
> > command I used was:
> >
> > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
> >
> > The crawl completed, but I'm sure that when I was testing the search
> > it has
> > not indexed a lot of pages. What I understand from the following
> > command it
> > only indexed 1527 of 21378 pages:
> >
> > CrawlDb statistics start: crawl/crawldb
> > Statistics for CrawlDb: crawl/crawldb
> > TOTAL urls:     21378
> > retry 0:        20878
> > retry 1:        487
> > retry 2:        10
> > retry 3:        3
> > min score:      0.014
> > avg score:      84.405266
> > max score:      37106.03
> > status 1 (DB_unfetched):        19848
> > status 2 (DB_fetched):  1527
> > status 3 (DB_gone):     3
> > CrawlDb statistics: done
> >
> >
> > Now my questions:
> >
> > 1) Will Nutch automatically continue to index the rest of the URLs even
> > though te initial crawl finished (through some internal scheduler of
> some
> > sorts)?
> You will need to refetch, or better: increase the depth, until "all your
> pages" are fetched.
> >
> > 2) All of my site's pages at the moment are contained in two languages
> > (each
> > page has exactly two languages, the lang attribute on the html tag of
> > each
> > page contains the language identifier). When searching, is there a way
> to
> > only return pages in a specific language? I know the Nutch UI is
> > localised,
> > but it will still return pages in english if my UI language is German
> for
> > example. I want it to return German pages only (<html lang="de">) when
> > searching through the German UI. Is that possible?
> try using "lang:" in your query, I'm not sure it's working, though...
> From the javadoc: "LanguageQueryFilter.java should handles "lang:"
> query clauses, causing them to search the "lang" field indexed by
> LanguageIdentifier" (see also LanguageIndexingFilter.java).
>
> HTH,
> Renaud
>
>
> --
> renaud richardet                           +1 617 230 9112
> renaud <at> oslutions.com         http://www.oslutions.com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: New to Nutch, a few questions

Zaheed Haque
Unless you haven't yet.. You need to activate index-more and
query-more plugin in nutch-site.xml

You can also check the "explan link"  from the search results page and
you will see "lang" is missing if you haven't activated the index-more
and query-more plugin..

Cheers

On 1/31/07, Nes Yarug <[hidden email]> wrote:

> Thank you everyone for your replies.
>
> I have implemented the recrawl script from
> http://wiki.apache.org/nutch/IntranetRecrawl and that is still running for
> over 12 hours so I guess that  would index much more pages.
>
> Leaves the question about language specific search. I have tried adding the
> lang: clause to my search query by appending lang:en but that is not
> returning any results (as if lang:en would become part of the actual query).
> The url then looks like this: search.jsp
> ?query=help+lang%3Aen&hitsPerPage=10&lang=en
>
> Anyone has used a language specific search before, do I need to add a new
> (hidden) input field on the search form to specifiy the language instead of
> appending it to the query? That would be my preference anyway, as I want the
> language specific search to be transparant to he user.
>
> Again, many thanks for any replies,
> Nes
>
> On 1/30/07, Renaud Richardet <[hidden email]> wrote:
> >
> > Nes Yarug wrote:
> > > Hi all,
> > >
> > > I'm new to Nutch and I have a few questions that I hope to get some
> > > answers
> > > on. Thanks in advance for any replies.
> > >
> > > I want to use Nutch to index a web site I'm maintaining. I've followed
> > > the
> > > tutorial for intranet crawling and used a list of links (17420 links
> > > to 8710
> > > pages, each page has two unique links) from my site to crawl initially.
> > Actually, you don't need to provide a full list of links to Nutch. You
> > can let it discover links as it crawl your site, and constrain them
> > using crawl-urlfilter.txt and regex-urlfilter.txt
> > > The
> > > command I used was:
> > >
> > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
> > >
> > > The crawl completed, but I'm sure that when I was testing the search
> > > it has
> > > not indexed a lot of pages. What I understand from the following
> > > command it
> > > only indexed 1527 of 21378 pages:
> > >
> > > CrawlDb statistics start: crawl/crawldb
> > > Statistics for CrawlDb: crawl/crawldb
> > > TOTAL urls:     21378
> > > retry 0:        20878
> > > retry 1:        487
> > > retry 2:        10
> > > retry 3:        3
> > > min score:      0.014
> > > avg score:      84.405266
> > > max score:      37106.03
> > > status 1 (DB_unfetched):        19848
> > > status 2 (DB_fetched):  1527
> > > status 3 (DB_gone):     3
> > > CrawlDb statistics: done
> > >
> > >
> > > Now my questions:
> > >
> > > 1) Will Nutch automatically continue to index the rest of the URLs even
> > > though te initial crawl finished (through some internal scheduler of
> > some
> > > sorts)?
> > You will need to refetch, or better: increase the depth, until "all your
> > pages" are fetched.
> > >
> > > 2) All of my site's pages at the moment are contained in two languages
> > > (each
> > > page has exactly two languages, the lang attribute on the html tag of
> > > each
> > > page contains the language identifier). When searching, is there a way
> > to
> > > only return pages in a specific language? I know the Nutch UI is
> > > localised,
> > > but it will still return pages in english if my UI language is German
> > for
> > > example. I want it to return German pages only (<html lang="de">) when
> > > searching through the German UI. Is that possible?
> > try using "lang:" in your query, I'm not sure it's working, though...
> > From the javadoc: "LanguageQueryFilter.java should handles "lang:"
> > query clauses, causing them to search the "lang" field indexed by
> > LanguageIdentifier" (see also LanguageIndexingFilter.java).
> >
> > HTH,
> > Renaud
> >
> >
> > --
> > renaud richardet                           +1 617 230 9112
> > renaud <at> oslutions.com         http://www.oslutions.com
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: New to Nutch, a few questions

Nes Yarug
I have explicitely activated those plugins. Could you tell me how to do that
with an example as I looked through conf/nutch-default.xml and couldn't find
any references to it. I'm using 0.8.1 by the way. They are enabled in the
build I guess as default.properties is listing them:

#
# Indexing Filter Plugins
#
plugins.index=\
   org.apache.nutch.indexer.basic*:\
   org.apache.nutch.indexer.more*

#
# Query Filter Plugins
#
plugins.query=\
   org.apache.nutch.searcher.basic*:\
   org.apache.nutch.searcher.more*:\
   org.apache.nutch.searcher.site*:\
   org.apache.nutch.searcher.url*

Many thanks,
Nes

On 1/31/07, Zaheed Haque <[hidden email]> wrote:

>
> Unless you haven't yet.. You need to activate index-more and
> query-more plugin in nutch-site.xml
>
> You can also check the "explan link"  from the search results page and
> you will see "lang" is missing if you haven't activated the index-more
> and query-more plugin..
>
> Cheers
>
> On 1/31/07, Nes Yarug <[hidden email]> wrote:
> > Thank you everyone for your replies.
> >
> > I have implemented the recrawl script from
> > http://wiki.apache.org/nutch/IntranetRecrawl and that is still running
> for
> > over 12 hours so I guess that  would index much more pages.
> >
> > Leaves the question about language specific search. I have tried adding
> the
> > lang: clause to my search query by appending lang:en but that is not
> > returning any results (as if lang:en would become part of the actual
> query).
> > The url then looks like this: search.jsp
> > ?query=help+lang%3Aen&hitsPerPage=10&lang=en
> >
> > Anyone has used a language specific search before, do I need to add a
> new
> > (hidden) input field on the search form to specifiy the language instead
> of
> > appending it to the query? That would be my preference anyway, as I want
> the
> > language specific search to be transparant to he user.
> >
> > Again, many thanks for any replies,
> > Nes
> >
> > On 1/30/07, Renaud Richardet <[hidden email]> wrote:
> > >
> > > Nes Yarug wrote:
> > > > Hi all,
> > > >
> > > > I'm new to Nutch and I have a few questions that I hope to get some
> > > > answers
> > > > on. Thanks in advance for any replies.
> > > >
> > > > I want to use Nutch to index a web site I'm maintaining. I've
> followed
> > > > the
> > > > tutorial for intranet crawling and used a list of links (17420 links
> > > > to 8710
> > > > pages, each page has two unique links) from my site to crawl
> initially.
> > > Actually, you don't need to provide a full list of links to Nutch. You
> > > can let it discover links as it crawl your site, and constrain them
> > > using crawl-urlfilter.txt and regex-urlfilter.txt
> > > > The
> > > > command I used was:
> > > >
> > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
> > > >
> > > > The crawl completed, but I'm sure that when I was testing the search
> > > > it has
> > > > not indexed a lot of pages. What I understand from the following
> > > > command it
> > > > only indexed 1527 of 21378 pages:
> > > >
> > > > CrawlDb statistics start: crawl/crawldb
> > > > Statistics for CrawlDb: crawl/crawldb
> > > > TOTAL urls:     21378
> > > > retry 0:        20878
> > > > retry 1:        487
> > > > retry 2:        10
> > > > retry 3:        3
> > > > min score:      0.014
> > > > avg score:      84.405266
> > > > max score:      37106.03
> > > > status 1 (DB_unfetched):        19848
> > > > status 2 (DB_fetched):  1527
> > > > status 3 (DB_gone):     3
> > > > CrawlDb statistics: done
> > > >
> > > >
> > > > Now my questions:
> > > >
> > > > 1) Will Nutch automatically continue to index the rest of the URLs
> even
> > > > though te initial crawl finished (through some internal scheduler of
> > > some
> > > > sorts)?
> > > You will need to refetch, or better: increase the depth, until "all
> your
> > > pages" are fetched.
> > > >
> > > > 2) All of my site's pages at the moment are contained in two
> languages
> > > > (each
> > > > page has exactly two languages, the lang attribute on the html tag
> of
> > > > each
> > > > page contains the language identifier). When searching, is there a
> way
> > > to
> > > > only return pages in a specific language? I know the Nutch UI is
> > > > localised,
> > > > but it will still return pages in english if my UI language is
> German
> > > for
> > > > example. I want it to return German pages only (<html lang="de">)
> when
> > > > searching through the German UI. Is that possible?
> > > try using "lang:" in your query, I'm not sure it's working, though...
> > > From the javadoc: "LanguageQueryFilter.java should handles "lang:"
> > > query clauses, causing them to search the "lang" field indexed by
> > > LanguageIdentifier" (see also LanguageIndexingFilter.java).
> > >
> > > HTH,
> > > Renaud
> > >
> > >
> > > --
> > > renaud richardet                           +1 617 230 9112
> > > renaud <at> oslutions.com         http://www.oslutions.com
> > >
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: New to Nutch, a few questions

Nes Yarug
Oops, my previous post should read "I have NOT explicitely activated those
plugins"

On 1/31/07, Nes Yarug <[hidden email]> wrote:

>
> I have explicitely activated those plugins. Could you tell me how to do
> that with an example as I looked through conf/nutch-default.xml and
> couldn't find any references to it. I'm using 0.8.1 by the way. They are
> enabled in the build I guess as default.properties is listing them:
>
> #
> # Indexing Filter Plugins
> #
> plugins.index=\
>    org.apache.nutch.indexer.basic*:\
>    org.apache.nutch.indexer.more*
>
> #
> # Query Filter Plugins
> #
> plugins.query=\
>    org.apache.nutch.searcher.basic*:\
>    org.apache.nutch.searcher.more*:\
>    org.apache.nutch.searcher.site*:\
>    org.apache.nutch.searcher.url*
>
> Many thanks,
> Nes
>
> On 1/31/07, Zaheed Haque <[hidden email]> wrote:
> >
> > Unless you haven't yet.. You need to activate index-more and
> > query-more plugin in nutch-site.xml
> >
> > You can also check the "explan link"  from the search results page and
> > you will see "lang" is missing if you haven't activated the index-more
> > and query-more plugin..
> >
> > Cheers
> >
> > On 1/31/07, Nes Yarug <[hidden email]> wrote:
> > > Thank you everyone for your replies.
> > >
> > > I have implemented the recrawl script from
> > > http://wiki.apache.org/nutch/IntranetRecrawl and that is still running
> > for
> > > over 12 hours so I guess that  would index much more pages.
> > >
> > > Leaves the question about language specific search. I have tried
> > adding the
> > > lang: clause to my search query by appending lang:en but that is not
> > > returning any results (as if lang:en would become part of the actual
> > query).
> > > The url then looks like this: search.jsp
> > > ?query=help+lang%3Aen&hitsPerPage=10&lang=en
> > >
> > > Anyone has used a language specific search before, do I need to add a
> > new
> > > (hidden) input field on the search form to specifiy the language
> > instead of
> > > appending it to the query? That would be my preference anyway, as I
> > want the
> > > language specific search to be transparant to he user.
> > >
> > > Again, many thanks for any replies,
> > > Nes
> > >
> > > On 1/30/07, Renaud Richardet <[hidden email]> wrote:
> > > >
> > > > Nes Yarug wrote:
> > > > > Hi all,
> > > > >
> > > > > I'm new to Nutch and I have a few questions that I hope to get
> > some
> > > > > answers
> > > > > on. Thanks in advance for any replies.
> > > > >
> > > > > I want to use Nutch to index a web site I'm maintaining. I've
> > followed
> > > > > the
> > > > > tutorial for intranet crawling and used a list of links (17420
> > links
> > > > > to 8710
> > > > > pages, each page has two unique links) from my site to crawl
> > initially.
> > > > Actually, you don't need to provide a full list of links to Nutch.
> > You
> > > > can let it discover links as it crawl your site, and constrain them
> > > > using crawl-urlfilter.txt and regex-urlfilter.txt
> > > > > The
> > > > > command I used was:
> > > > >
> > > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
> > > > >
> > > > > The crawl completed, but I'm sure that when I was testing the
> > search
> > > > > it has
> > > > > not indexed a lot of pages. What I understand from the following
> > > > > command it
> > > > > only indexed 1527 of 21378 pages:
> > > > >
> > > > > CrawlDb statistics start: crawl/crawldb
> > > > > Statistics for CrawlDb: crawl/crawldb
> > > > > TOTAL urls:     21378
> > > > > retry 0:        20878
> > > > > retry 1:        487
> > > > > retry 2:        10
> > > > > retry 3:        3
> > > > > min score:      0.014
> > > > > avg score:       84.405266
> > > > > max score:      37106.03
> > > > > status 1 (DB_unfetched):        19848
> > > > > status 2 (DB_fetched):  1527
> > > > > status 3 (DB_gone):     3
> > > > > CrawlDb statistics: done
> > > > >
> > > > >
> > > > > Now my questions:
> > > > >
> > > > > 1) Will Nutch automatically continue to index the rest of the URLs
> > even
> > > > > though te initial crawl finished (through some internal scheduler
> > of
> > > > some
> > > > > sorts)?
> > > > You will need to refetch, or better: increase the depth, until "all
> > your
> > > > pages" are fetched.
> > > > >
> > > > > 2) All of my site's pages at the moment are contained in two
> > languages
> > > > > (each
> > > > > page has exactly two languages, the lang attribute on the html tag
> > of
> > > > > each
> > > > > page contains the language identifier). When searching, is there a
> > way
> > > > to
> > > > > only return pages in a specific language? I know the Nutch UI is
> > > > > localised,
> > > > > but it will still return pages in english if my UI language is
> > German
> > > > for
> > > > > example. I want it to return German pages only (<html lang="de">)
> > when
> > > > > searching through the German UI. Is that possible?
> > > > try using "lang:" in your query, I'm not sure it's working,
> > though...
> > > > From the javadoc: "LanguageQueryFilter.java should handles "lang:"
> > > > query clauses, causing them to search the "lang" field indexed by
> > > > LanguageIdentifier" (see also LanguageIndexingFilter.java).
> > > >
> > > > HTH,
> > > > Renaud
> > > >
> > > >
> > > > --
> > > > renaud richardet                           +1 617 230 9112
> > > > renaud <at> oslutions.com         http://www.oslutions.com
> > > >
> > > >
> > >
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: New to Nutch, a few questions

Renaud Richardet-3-2
As Zaheed pointed out, "You need to activate index-more and query-more
plugin in nutch-site.xml"

So, copy the entry "plugin.includes" from nutch-defaults.xml, add
index-more and query-lang, and insert it in your nutch-site.xml. You
should have something like this:

<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|more)|query-(basic|site|url|lang)|summary-basic|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

HTH,
Renaud


Nes Yarug wrote:

> Oops, my previous post should read "I have NOT explicitely activated
> those
> plugins"
>
> On 1/31/07, Nes Yarug <[hidden email]> wrote:
>>
>> I have explicitely activated those plugins. Could you tell me how to do
>> that with an example as I looked through conf/nutch-default.xml and
>> couldn't find any references to it. I'm using 0.8.1 by the way. They are
>> enabled in the build I guess as default.properties is listing them:
>>
>> #
>> # Indexing Filter Plugins
>> #
>> plugins.index=\
>>    org.apache.nutch.indexer.basic*:\
>>    org.apache.nutch.indexer.more*
>>
>> #
>> # Query Filter Plugins
>> #
>> plugins.query=\
>>    org.apache.nutch.searcher.basic*:\
>>    org.apache.nutch.searcher.more*:\
>>    org.apache.nutch.searcher.site*:\
>>    org.apache.nutch.searcher.url*
>>
>> Many thanks,
>> Nes
>>
>> On 1/31/07, Zaheed Haque <[hidden email]> wrote:
>> >
>> > Unless you haven't yet.. You need to activate index-more and
>> > query-more plugin in nutch-site.xml
>> >
>> > You can also check the "explan link"  from the search results page and
>> > you will see "lang" is missing if you haven't activated the index-more
>> > and query-more plugin..
>> >
>> > Cheers
>> >
>> > On 1/31/07, Nes Yarug <[hidden email]> wrote:
>> > > Thank you everyone for your replies.
>> > >
>> > > I have implemented the recrawl script from
>> > > http://wiki.apache.org/nutch/IntranetRecrawl and that is still
>> running
>> > for
>> > > over 12 hours so I guess that  would index much more pages.
>> > >
>> > > Leaves the question about language specific search. I have tried
>> > adding the
>> > > lang: clause to my search query by appending lang:en but that is not
>> > > returning any results (as if lang:en would become part of the actual
>> > query).
>> > > The url then looks like this: search.jsp
>> > > ?query=help+lang%3Aen&hitsPerPage=10&lang=en
>> > >
>> > > Anyone has used a language specific search before, do I need to
>> add a
>> > new
>> > > (hidden) input field on the search form to specifiy the language
>> > instead of
>> > > appending it to the query? That would be my preference anyway, as I
>> > want the
>> > > language specific search to be transparant to he user.
>> > >
>> > > Again, many thanks for any replies,
>> > > Nes
>> > >
>> > > On 1/30/07, Renaud Richardet <[hidden email]> wrote:
>> > > >
>> > > > Nes Yarug wrote:
>> > > > > Hi all,
>> > > > >
>> > > > > I'm new to Nutch and I have a few questions that I hope to get
>> > some
>> > > > > answers
>> > > > > on. Thanks in advance for any replies.
>> > > > >
>> > > > > I want to use Nutch to index a web site I'm maintaining. I've
>> > followed
>> > > > > the
>> > > > > tutorial for intranet crawling and used a list of links (17420
>> > links
>> > > > > to 8710
>> > > > > pages, each page has two unique links) from my site to crawl
>> > initially.
>> > > > Actually, you don't need to provide a full list of links to Nutch.
>> > You
>> > > > can let it discover links as it crawl your site, and constrain
>> them
>> > > > using crawl-urlfilter.txt and regex-urlfilter.txt
>> > > > > The
>> > > > > command I used was:
>> > > > >
>> > > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
>> > > > >
>> > > > > The crawl completed, but I'm sure that when I was testing the
>> > search
>> > > > > it has
>> > > > > not indexed a lot of pages. What I understand from the following
>> > > > > command it
>> > > > > only indexed 1527 of 21378 pages:
>> > > > >
>> > > > > CrawlDb statistics start: crawl/crawldb
>> > > > > Statistics for CrawlDb: crawl/crawldb
>> > > > > TOTAL urls:     21378
>> > > > > retry 0:        20878
>> > > > > retry 1:        487
>> > > > > retry 2:        10
>> > > > > retry 3:        3
>> > > > > min score:      0.014
>> > > > > avg score:       84.405266
>> > > > > max score:      37106.03
>> > > > > status 1 (DB_unfetched):        19848
>> > > > > status 2 (DB_fetched):  1527
>> > > > > status 3 (DB_gone):     3
>> > > > > CrawlDb statistics: done
>> > > > >
>> > > > >
>> > > > > Now my questions:
>> > > > >
>> > > > > 1) Will Nutch automatically continue to index the rest of the
>> URLs
>> > even
>> > > > > though te initial crawl finished (through some internal
>> scheduler
>> > of
>> > > > some
>> > > > > sorts)?
>> > > > You will need to refetch, or better: increase the depth, until
>> "all
>> > your
>> > > > pages" are fetched.
>> > > > >
>> > > > > 2) All of my site's pages at the moment are contained in two
>> > languages
>> > > > > (each
>> > > > > page has exactly two languages, the lang attribute on the
>> html tag
>> > of
>> > > > > each
>> > > > > page contains the language identifier). When searching, is
>> there a
>> > way
>> > > > to
>> > > > > only return pages in a specific language? I know the Nutch UI is
>> > > > > localised,
>> > > > > but it will still return pages in english if my UI language is
>> > German
>> > > > for
>> > > > > example. I want it to return German pages only (<html
>> lang="de">)
>> > when
>> > > > > searching through the German UI. Is that possible?
>> > > > try using "lang:" in your query, I'm not sure it's working,
>> > though...
>> > > > From the javadoc: "LanguageQueryFilter.java should handles "lang:"
>> > > > query clauses, causing them to search the "lang" field indexed by
>> > > > LanguageIdentifier" (see also LanguageIndexingFilter.java).
>> > > >
>> > > > HTH,
>> > > > Renaud
>> > > >
>> > > >
>> > > > --
>> > > > renaud richardet                           +1 617 230 9112
>> > > > renaud <at> oslutions.com         http://www.oslutions.com
>> > > >
>> > > >
>> > >
>> > >
>> >
>>
>>
>


--
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com

Reply | Threaded
Open this post in threaded view
|

Re: New to Nutch, a few questions

Nes Yarug
Okay, thanks for that. I have updated my configuration and I will now
re-index the site. I'll let you know how it goes.

Many thanks,
Nes

On 1/31/07, Renaud Richardet <[hidden email]> wrote:

>
> As Zaheed pointed out, "You need to activate index-more and query-more
> plugin in nutch-site.xml"
>
> So, copy the entry "plugin.includes" from nutch-defaults.xml, add
> index-more and query-lang, and insert it in your nutch-site.xml. You
> should have something like this:
>
> <property>
>   <name>plugin.includes</name>
>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|more)|query-(basic|site|url|lang)|summary-basic|scoring-opic</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
>
> HTH,
> Renaud
>
>
> Nes Yarug wrote:
> > Oops, my previous post should read "I have NOT explicitely activated
> > those
> > plugins"
> >
> > On 1/31/07, Nes Yarug <[hidden email]> wrote:
> >>
> >> I have explicitely activated those plugins. Could you tell me how to do
> >> that with an example as I looked through conf/nutch-default.xml and
> >> couldn't find any references to it. I'm using 0.8.1 by the way. They
> are
> >> enabled in the build I guess as default.properties is listing them:
> >>
> >> #
> >> # Indexing Filter Plugins
> >> #
> >> plugins.index=\
> >>    org.apache.nutch.indexer.basic*:\
> >>    org.apache.nutch.indexer.more*
> >>
> >> #
> >> # Query Filter Plugins
> >> #
> >> plugins.query=\
> >>    org.apache.nutch.searcher.basic*:\
> >>    org.apache.nutch.searcher.more*:\
> >>    org.apache.nutch.searcher.site*:\
> >>    org.apache.nutch.searcher.url*
> >>
> >> Many thanks,
> >> Nes
> >>
> >> On 1/31/07, Zaheed Haque <[hidden email]> wrote:
> >> >
> >> > Unless you haven't yet.. You need to activate index-more and
> >> > query-more plugin in nutch-site.xml
> >> >
> >> > You can also check the "explan link"  from the search results page
> and
> >> > you will see "lang" is missing if you haven't activated the
> index-more
> >> > and query-more plugin..
> >> >
> >> > Cheers
> >> >
> >> > On 1/31/07, Nes Yarug <[hidden email]> wrote:
> >> > > Thank you everyone for your replies.
> >> > >
> >> > > I have implemented the recrawl script from
> >> > > http://wiki.apache.org/nutch/IntranetRecrawl and that is still
> >> running
> >> > for
> >> > > over 12 hours so I guess that  would index much more pages.
> >> > >
> >> > > Leaves the question about language specific search. I have tried
> >> > adding the
> >> > > lang: clause to my search query by appending lang:en but that is
> not
> >> > > returning any results (as if lang:en would become part of the
> actual
> >> > query).
> >> > > The url then looks like this: search.jsp
> >> > > ?query=help+lang%3Aen&hitsPerPage=10&lang=en
> >> > >
> >> > > Anyone has used a language specific search before, do I need to
> >> add a
> >> > new
> >> > > (hidden) input field on the search form to specifiy the language
> >> > instead of
> >> > > appending it to the query? That would be my preference anyway, as I
> >> > want the
> >> > > language specific search to be transparant to he user.
> >> > >
> >> > > Again, many thanks for any replies,
> >> > > Nes
> >> > >
> >> > > On 1/30/07, Renaud Richardet <[hidden email]> wrote:
> >> > > >
> >> > > > Nes Yarug wrote:
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I'm new to Nutch and I have a few questions that I hope to get
> >> > some
> >> > > > > answers
> >> > > > > on. Thanks in advance for any replies.
> >> > > > >
> >> > > > > I want to use Nutch to index a web site I'm maintaining. I've
> >> > followed
> >> > > > > the
> >> > > > > tutorial for intranet crawling and used a list of links (17420
> >> > links
> >> > > > > to 8710
> >> > > > > pages, each page has two unique links) from my site to crawl
> >> > initially.
> >> > > > Actually, you don't need to provide a full list of links to
> Nutch.
> >> > You
> >> > > > can let it discover links as it crawl your site, and constrain
> >> them
> >> > > > using crawl-urlfilter.txt and regex-urlfilter.txt
> >> > > > > The
> >> > > > > command I used was:
> >> > > > >
> >> > > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
> >> > > > >
> >> > > > > The crawl completed, but I'm sure that when I was testing the
> >> > search
> >> > > > > it has
> >> > > > > not indexed a lot of pages. What I understand from the
> following
> >> > > > > command it
> >> > > > > only indexed 1527 of 21378 pages:
> >> > > > >
> >> > > > > CrawlDb statistics start: crawl/crawldb
> >> > > > > Statistics for CrawlDb: crawl/crawldb
> >> > > > > TOTAL urls:     21378
> >> > > > > retry 0:        20878
> >> > > > > retry 1:        487
> >> > > > > retry 2:        10
> >> > > > > retry 3:        3
> >> > > > > min score:      0.014
> >> > > > > avg score:       84.405266
> >> > > > > max score:      37106.03
> >> > > > > status 1 (DB_unfetched):        19848
> >> > > > > status 2 (DB_fetched):  1527
> >> > > > > status 3 (DB_gone):     3
> >> > > > > CrawlDb statistics: done
> >> > > > >
> >> > > > >
> >> > > > > Now my questions:
> >> > > > >
> >> > > > > 1) Will Nutch automatically continue to index the rest of the
> >> URLs
> >> > even
> >> > > > > though te initial crawl finished (through some internal
> >> scheduler
> >> > of
> >> > > > some
> >> > > > > sorts)?
> >> > > > You will need to refetch, or better: increase the depth, until
> >> "all
> >> > your
> >> > > > pages" are fetched.
> >> > > > >
> >> > > > > 2) All of my site's pages at the moment are contained in two
> >> > languages
> >> > > > > (each
> >> > > > > page has exactly two languages, the lang attribute on the
> >> html tag
> >> > of
> >> > > > > each
> >> > > > > page contains the language identifier). When searching, is
> >> there a
> >> > way
> >> > > > to
> >> > > > > only return pages in a specific language? I know the Nutch UI
> is
> >> > > > > localised,
> >> > > > > but it will still return pages in english if my UI language is
> >> > German
> >> > > > for
> >> > > > > example. I want it to return German pages only (<html
> >> lang="de">)
> >> > when
> >> > > > > searching through the German UI. Is that possible?
> >> > > > try using "lang:" in your query, I'm not sure it's working,
> >> > though...
> >> > > > From the javadoc: "LanguageQueryFilter.java should handles
> "lang:"
> >> > > > query clauses, causing them to search the "lang" field indexed by
> >> > > > LanguageIdentifier" (see also LanguageIndexingFilter.java).
> >> > > >
> >> > > > HTH,
> >> > > > Renaud
> >> > > >
> >> > > >
> >> > > > --
> >> > > > renaud richardet                           +1 617 230 9112
> >> > > > renaud <at> oslutions.com         http://www.oslutions.com
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> >
> >>
> >>
> >
>
>
> --
> renaud richardet                           +1 617 230 9112
> renaud <at> oslutions.com         http://www.oslutions.com
>
>