Crawl Command Question

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawl Command Question

Dave Beckstrom-2
Hi Everyone,

Reading the help for the nutch crawl script, I have a question.  If I run
the crawl script without the -i parameter, does that mean the crawl will
run and complete without updating SOLR?  I need to crawl pages without
updating SOLR.  Then I'll use solrindex to push the crawled content into
SOLR later, when I'm ready.



Usage: crawl [-i|--index] [-D "key=value"] [-s <Seed Dir>] <Crawl Dir> <Num
Rounds>
-i|--index Indexes crawl results into a configured indexer
-D... A Java property to pass to Nutch calls
-s <Seed Dir> Directory in which to look for a seeds file
<Crawl Dir> Directory where the crawl/link/segments dirs are saved
<Num Rounds> The number of rounds to run this crawl for
     Example: bin/crawl -i -s urls/ TestCrawl/  2

--
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/



Reply | Threaded
Open this post in threaded view
|

Re: Crawl Command Question

Sebastian Nagel-2
Hi Dave,

> the crawl script without the -i parameter, does that mean the crawl will
> run and complete without updating SOLR?

Yes.

> Then I'll use solrindex to push the crawled content into
> SOLR later, when I'm ready.

Better call "index", the command "solrindex" is deprecated,
in fact, it just calls IndexingJob same as "index".

Of course, you need to pass all unindexed segments to the
index command or call "index" iteratively.

Best,
Sebastian

On 19.10.19 23:05, Dave Beckstrom wrote:

> Hi Everyone,
>
> Reading the help for the nutch crawl script, I have a question.  If I run
> the crawl script without the -i parameter, does that mean the crawl will
> run and complete without updating SOLR?  I need to crawl pages without
> updating SOLR.  Then I'll use solrindex to push the crawled content into
> SOLR later, when I'm ready.
>
>
>
> Usage: crawl [-i|--index] [-D "key=value"] [-s <Seed Dir>] <Crawl Dir> <Num
> Rounds>
> -i|--index Indexes crawl results into a configured indexer
> -D... A Java property to pass to Nutch calls
> -s <Seed Dir> Directory in which to look for a seeds file
> <Crawl Dir> Directory where the crawl/link/segments dirs are saved
> <Num Rounds> The number of rounds to run this crawl for
>      Example: bin/crawl -i -s urls/ TestCrawl/  2
>