Quantcast

Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

Chip Calhoun
We've found that the solrindex process chokes on the custom metadata fields I added to my Nutch using the urlmeta plugin. A sample of the lengthy error messages:

java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/phfaws: ERROR: [doc=http://academics.wellesley.edu/lts/archives/3/3L_Astronomy.html] unknown field 'icosreposurl'
     at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)

As mentioned in my previous message, I've copied my Nutch schema.xml into my Solr's conf folder, but since my Solr instance hadn't already had a schema.xml file I'm not convinced it's being read.. How do I set up my Solr to take these new fields?

Chip

________________________________________
From: Chip Calhoun [[hidden email]]
Sent: Friday, February 03, 2017 11:45 AM
To: [hidden email]
Subject: Failing to index from Nutch 1.12 to Solr 5.5.3

I'm switching to more recent Nutch/Solr, after years of using Nutch 1.4 and Solr 3.3.0. I get no results when I index into Solr. I can't tell where this breaks down.

I use these commands:
cd /opt/apache-nutch-1.12/runtime/local
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121.x86_64
export NUTCH_CONF_DIR=/opt/apache-nutch-1.12/runtime/local/conf/phfaws
bin/crawl urls/phfaws crawl/phfaws 1
bin/nutch solrindex http://localhost:8983/solr/phfaws/ crawl/phfaws/crawldb -linkdb crawl/phfaws/linkdb crawl/phfaws/segments/*

I believe that Nutch is crawling properly, but I do find that the crawl folders end up about 25% as large as what I produced with Nutch 1.4. I suspect that the problem is with the Nutch/Solr integration. My Solr core didn't create a schema.xml, instead having a managed scheme. I've copied my Nutch local conf's schema.xml into Solr, but I haven't seen that I'm supposed to do anything more with that.


Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740
301-209-3180
https://www.aip.org/history-programs/niels-bohr-library

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

AW: Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

André Schild
Hello Chip,

>We've found that the solrindex process chokes on the custom metadata fields I added to my Nutch using the urlmeta plugin. A sample of the lengthy error >messages:
>
>java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/phfaws: ERROR: >[doc=http://academics.wellesley.edu/lts/archives/3/3L_Astronomy.html] unknown field 'icosreposurl'
>     at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>
>As mentioned in my previous message, I've copied my Nutch schema.xml into my Solr's conf folder, but since my Solr instance hadn't already had a >schema.xml file I'm not convinced it's being read.. How do I set up my Solr to take these new fields?

Does that schema.xml file contains a definition for a field named "icosreposurl"?
If not, then you have to add it. The example schema.xml does not handle all cases possible with nutch.

André


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

Chip Calhoun
Hi André,

Yes, my schema.xml has field definitions for 5 new fields I index using Nutch urlmeta. So I definitely need to make sure it's being read.

It looks like I'll need to scrap this Solr core and build a new one. I had created this one using:
/opt/solr/bin/solr create_core -c phfaws -d basic_configs
...and that got me a managed_schema rather than a schema.xml. Is there a way to build a core that will definitely use schema.xml?

Chip


________________________________________
From: André Schild [[hidden email]]
Sent: Saturday, February 04, 2017 3:26 AM
To: [hidden email]
Subject: AW: Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

Hello Chip,

>We've found that the solrindex process chokes on the custom metadata fields I added to my Nutch using the urlmeta plugin. A sample of the lengthy error >messages:
>
>java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/phfaws: ERROR: >[doc=http://academics.wellesley.edu/lts/archives/3/3L_Astronomy.html] unknown field 'icosreposurl'
>     at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>
>As mentioned in my previous message, I've copied my Nutch schema.xml into my Solr's conf folder, but since my Solr instance hadn't already had a >schema.xml file I'm not convinced it's being read.. How do I set up my Solr to take these new fields?

Does that schema.xml file contains a definition for a field named "icosreposurl"?
If not, then you have to add it. The example schema.xml does not handle all cases possible with nutch.

André


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

Michael Coffey
You can create a core manually in the file system, in a specific place where solr looks for cores when it starts up. I have mine in /opt/solr/server/sol. It at least works in solr 5.4.1 (I haven't tried others).

The core needs a conf dir and a properties file. The properties file should contain a property that points to the actual data directory. The conf dir contains schema.xml and a bunch of other files. So, for a core named "popular", I have
/opt/solr/server/solr    popular        core.properties        conf            schema.xml            (other files, including stopwords.txt)        popular_data            (if initially empty, solr creates subdirectories here)
You may find more information by googling <solr core.properties instance directory>

      From: Chip Calhoun <[hidden email]>
 To: "[hidden email]" <[hidden email]>
 Sent: Monday, February 6, 2017 7:10 AM
 Subject: RE: Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)
   
Hi André,

Yes, my schema.xml has field definitions for 5 new fields I index using Nutch urlmeta. So I definitely need to make sure it's being read.

It looks like I'll need to scrap this Solr core and build a new one. I had created this one using:
/opt/solr/bin/solr create_core -c phfaws -d basic_configs
...and that got me a managed_schema rather than a schema.xml. Is there a way to build a core that will definitely use schema.xml?

Chip


________________________________________
From: André Schild [[hidden email]]
Sent: Saturday, February 04, 2017 3:26 AM
To: [hidden email]
Subject: AW: Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

Hello Chip,

>We've found that the solrindex process chokes on the custom metadata fields I added to my Nutch using the urlmeta plugin. A sample of the lengthy error >messages:
>
>java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/phfaws: ERROR: >[doc=http://academics.wellesley.edu/lts/archives/3/3L_Astronomy.html] unknown field 'icosreposurl'
>    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>
>As mentioned in my previous message, I've copied my Nutch schema.xml into my Solr's conf folder, but since my Solr instance hadn't already had a >schema.xml file I'm not convinced it's being read.. How do I set up my Solr to take these new fields?

Does that schema.xml file contains a definition for a field named "icosreposurl"?
If not, then you have to add it. The example schema.xml does not handle all cases possible with nutch.

André




   
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

webgraph speed

Michael Coffey
Hello nutchers!
I am trying to compute linkrank scores without spending excessive time on the task. My version of the crawl script contains the following line, which is similar to a commented-out line in the bin/crawl script in the 1.12 distribution.
__bin_nutch webgraph $commonOptions -filter -normalize -segmentDir "$CRAWL_PATH"/segments/ -webgraphdb "$CRAWL_PATH"
I notice that it specifies -segmentDir, rather than -segment. Does that mean it  re-computes the outlinkdb and other information for every existing segment every time it does a new segment, or does it check and avoid re-doing things it did before?
If I change it to say -segment "$CRAWL_PATH"/segments/$SEGMENT, will it do just what needs doing? The way I have it now, it spends a lot of time computing outlinkdb.
Thanks for any light you may shed.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: webgraph speed

Markus Jelsma-2
Hello,

Start by disabling filtering and normalizing, it was already done in the parser. Only enable it just once if you changed filters and/or normalizers. You can use -segmemt to update an existing graph. By the way, is building the graph a performance problem? What about computing the linkrank which is much more costly.

Markus

-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Thursday 2nd March 2017 2:07
> To: [hidden email]
> Subject: webgraph speed
>
> Hello nutchers!
> I am trying to compute linkrank scores without spending excessive time on the task. My version of the crawl script contains the following line, which is similar to a commented-out line in the bin/crawl script in the 1.12 distribution.
> __bin_nutch webgraph $commonOptions -filter -normalize -segmentDir "$CRAWL_PATH"/segments/ -webgraphdb "$CRAWL_PATH"
> I notice that it specifies -segmentDir, rather than -segment. Does that mean it  re-computes the outlinkdb and other information for every existing segment every time it does a new segment, or does it check and avoid re-doing things it did before?
> If I change it to say -segment "$CRAWL_PATH"/segments/$SEGMENT, will it do just what needs doing? The way I have it now, it spends a lot of time computing outlinkdb.
> Thanks for any light you may shed.
>
Loading...