Doesn't seem to be indexing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Doesn't seem to be indexing

raycrawford
I think I have Nutch set up right (Nutch 1.13 and Solr 6.6.0).  When I try
to crawl stuff and send it to Solr, it doesn't seem to be getting any
content.  Here's the code I'm using to get web content and push it to Solr:

mkdir -p /opt/nutch/urls
echo 'http://www.with-impact.com' > /opt/nutch/urls/seed.txt
vi /opt/nutch/conf/regex-urlfilter.txt
# +.
export JAVA_HOME='/etc/alternatives/jre_1.8.0'
/opt/solr/bin/solr create -c nutch_solr_data_core
/opt/nutch/bin/nutch inject crawl/crawldb urls/seed.txt
cd /opt/nutch
/opt/nutch/bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d /opt/nutch/crawl/segments/2* | tail -1`
/opt/nutch/bin/nutch fetch $s1
/opt/nutch/bin/nutch parse $s1
/opt/nutch/bin/nutch updatedb crawl/crawldb $s1
/opt/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments
/opt/nutch/bin/nutch solrindex http://localhost:8983/solr/nutch_solr_data_core
crawl/crawldb/ -linkdb crawl/linkdb/ $s1


Am I missing a step?

I wouldn't mind using nutch 2, but I didn't see a good tutorial for Nutch
2/Solr 6 integration.  Can anyone point me to one?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Doesn't seem to be indexing

Michael Chen
Hi Ray,

Not sure about 1.x, but for 2.x setting up solr 6.6 is the same as 1.x tutorial. Only thing is that the schema.xml in 1.x has not been updated for sole 6.6.0 so there will be errors in some when you first start it up, but you can easily find solutions on stackoverflow...

Also you might need to delete managedschema after updating schema.xml to allow solr to re-read the configs.

Hope this helps!
Michael

> On Aug 4, 2017, at 04:44, Ray Crawford <[hidden email]> wrote:
>
> I think I have Nutch set up right (Nutch 1.13 and Solr 6.6.0).  When I try
> to crawl stuff and send it to Solr, it doesn't seem to be getting any
> content.  Here's the code I'm using to get web content and push it to Solr:
>
> mkdir -p /opt/nutch/urls
> echo 'http://www.with-impact.com' > /opt/nutch/urls/seed.txt
> vi /opt/nutch/conf/regex-urlfilter.txt
> # +.
> export JAVA_HOME='/etc/alternatives/jre_1.8.0'
> /opt/solr/bin/solr create -c nutch_solr_data_core
> /opt/nutch/bin/nutch inject crawl/crawldb urls/seed.txt
> cd /opt/nutch
> /opt/nutch/bin/nutch generate crawl/crawldb crawl/segments
> s1=`ls -d /opt/nutch/crawl/segments/2* | tail -1`
> /opt/nutch/bin/nutch fetch $s1
> /opt/nutch/bin/nutch parse $s1
> /opt/nutch/bin/nutch updatedb crawl/crawldb $s1
> /opt/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments
> /opt/nutch/bin/nutch solrindex http://localhost:8983/solr/nutch_solr_data_core
> crawl/crawldb/ -linkdb crawl/linkdb/ $s1
>
>
> Am I missing a step?
>
> I wouldn't mind using nutch 2, but I didn't see a good tutorial for Nutch
> 2/Solr 6 integration.  Can anyone point me to one?
>
> Thanks!