googled for ever and still can't figure it out

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

googled for ever and still can't figure it out

Andrew MacKay
Hi

hoping for some help to get sitemaps.xml working
using this command to crawl  (nutch 1.18)

NUTCH_HOME/bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch
--sitemaps-from-hostdb always -s $NUTCH_HOME/urls/ $NUTCH_HOME/Crawl 10

if this flag is used *--sitemaps-from-hostdb always*
*this error occurs*

*Generator: number of items rejected during selection:Generator:    201
SCHEDULE_REJECTEDGenerator: 0 records selected for fetching, exiting ...*

without this flag present   it crawls the site without issue and

nutch-default.xml set the interval to 2 seconds from default 30 days.

 <name>db.fetch.interval.default</name>

  <value>2</value>

I also don't understand why the crawldb is automatically deleted after each
crawl so I cannot runn any commands about url's that are not crawled.

Any help

--

Andrew MacKay

--
CONFIDENTIALITY NOTICE: The information contained in this email is
privileged and confidential and intended only for the use of the individual
or entity to whom it is addressed.   If you receive this message in error,
please notify the sender immediately at 613-729-1100 and destroy the
original message and all copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: googled for ever and still can't figure it out

Sebastian Nagel-2
Hi Andrew,

 > if this flag is used *--sitemaps-from-hostdb always*

Do the crawled hosts announce the sitemap in their robots.txt?
If not does the sitemap URLs follow the pattern
   http://example.com/sitemap.xml ?

See https://cwiki.apache.org/confluence/display/NUTCH/SitemapFeature

If this is not the case, it's required to put the URLs pointing
to the sitemaps into a separate list and call bin/crawl with the
option `-sm <sitemap_dir>`.

 > nutch-default.xml set the interval to 2 seconds from default 30 days.

Ok, for one day or even few hours. But why "2 seconds"?


 > I also don't understand why the crawldb is automatically deleted

The crawldb isn't removed but updated after each cycle by
- moving the previous version from "current/" to "old/"
- placing the updated version in "current/"


In doubt, and because bugs are always possible could you share the logs
from the SitemapProcessor ?


Best,
Sebastian

On 3/13/21 6:33 PM, Andrew MacKay wrote:

> Hi
>
> hoping for some help to get sitemaps.xml working
> using this command to crawl  (nutch 1.18)
>
> NUTCH_HOME/bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch
> --sitemaps-from-hostdb always -s $NUTCH_HOME/urls/ $NUTCH_HOME/Crawl 10
>
> if this flag is used *--sitemaps-from-hostdb always*
> *this error occurs*
>
> *Generator: number of items rejected during selection:Generator:    201
> SCHEDULE_REJECTEDGenerator: 0 records selected for fetching, exiting ...*
>
> without this flag present   it crawls the site without issue and
>
> nutch-default.xml set the interval to 2 seconds from default 30 days.
>
>   <name>db.fetch.interval.default</name>
>
>    <value>2</value>
>
> I also don't understand why the crawldb is automatically deleted after each
> crawl so I cannot runn any commands about url's that are not crawled.
>
> Any help
>