Nutch to SolR. First steps

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch to SolR. First steps

Alex McLintock
I'm trying to send my Nutch crawl to SolR. I've "generated, fetched,
updated", several times. I've done an invertlinks.
But when I try to do the solrindex it just sits there for ages and
doesnt seem to stress the solr server at all.

I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04.

/local/apps/software/nutch$ bin/nutch solrindex
http://rio23:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

Is there some kind of "verbose" option so that I can better see what
it is doing? I could maybe insert some extra deugging, or do i need to
run this in Eclipse?

The Java process seems to be using up most of a core's CPU time so it
seems to be doing *something*.

This is my first Solr project so I have proved that it is up and
running, but havent actually added any data to it yet...

Alex
Reply | Threaded
Open this post in threaded view
|

Re: Nutch to SolR. First steps

Alex McLintock
Further information to this....

I'm running on a single machine in fake clustering mode.

A tmp directory gets created, with nothing but another empty directory
inside of it.

The hadoop log file just says the same thing over and over every 30 seconds....

2009-08-11 20:20:57,803 INFO  plugin.PluginRepository - Plugins:
looking in: /local/apps/software/nutch/plugins
2009-08-11 20:20:58,158 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2009-08-11 20:20:58,159 INFO  plugin.PluginRepository - Registered Plugins:
2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         the
nutch core extension points (nutch-extensionpoints)
2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Basic
Query Filter (query-basic)
2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Basic
URL Normalizer (urlnormalizer-basic)
2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Html
Parse Plug-in (parse-html)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Site
Query Filter (query-site)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Regex
URL Filter (urlfilter-regex)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         XML
Response Writer Plug-in (response-xml)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Regex
URL Normalizer (urlnormalizer-regex)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         OPIC
Scoring Plug-in (scoring-opic)
2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Anchor
Indexing Filter (index-anchor)
2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         URL
Query Filter (query-url)
2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Regex
URL Filter Framework (lib-regex-filter)
2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         JSON
Response Writer Plug-in (response-json)
2009-08-11 20:20:58,161 INFO  plugin.PluginRepository - Registered
Extension-Points:
2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
Field Filter (org.apache.nutch.indexer.field.FieldFilter)
2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
Search Results Response Writer
(org.apache.nutch.searcher.response.ResponseWriter)
2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
2009-08-11 20:20:58,163 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-08-11 20:20:58,163 INFO  plugin.PluginRepository -
Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-08-11 20:20:58,171 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-08-11 20:20:58,202 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter



Is Solr output a plugin, and is it not set up above?

2009/8/11 Alex McLintock <[hidden email]>:

> I'm trying to send my Nutch crawl to SolR. I've "generated, fetched,
> updated", several times. I've done an invertlinks.
> But when I try to do the solrindex it just sits there for ages and
> doesnt seem to stress the solr server at all.
>
> I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04.
>
> /local/apps/software/nutch$ bin/nutch solrindex
> http://rio23:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
>
> Is there some kind of "verbose" option so that I can better see what
> it is doing? I could maybe insert some extra deugging, or do i need to
> run this in Eclipse?
>
> The Java process seems to be using up most of a core's CPU time so it
> seems to be doing *something*.
>
> This is my first Solr project so I have proved that it is up and
> running, but havent actually added any data to it yet...
>
> Alex
>
Reply | Threaded
Open this post in threaded view
|

RE: Nutch to SolR. First steps

Brian Tingle
I don't know the answer, but I'd also check in the tomcat/J2ee container
logs to see if there are any clues.  This helped me solve a problem with
nutch solrindex once.  Also, I think the data directory for solr should
be growing as it add in more stuff.

|-----Original Message-----
|From: Alex McLintock [mailto:[hidden email]]
|Sent: Tuesday, August 11, 2009 12:22 PM
|To: [hidden email]
|Subject: Re: Nutch to SolR. First steps
|
|Further information to this....
|
|I'm running on a single machine in fake clustering mode.
|
|A tmp directory gets created, with nothing but another empty directory
|inside of it.
|
|The hadoop log file just says the same thing over and over every 30
|seconds....
|
|2009-08-11 20:20:57,803 INFO  plugin.PluginRepository - Plugins:
|looking in: /local/apps/software/nutch/plugins
|2009-08-11 20:20:58,158 INFO  plugin.PluginRepository - Plugin
|Auto-activation mode: [true]
|2009-08-11 20:20:58,159 INFO  plugin.PluginRepository - Registered
Plugins:
|2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         the
|nutch core extension points (nutch-extensionpoints)
|2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Basic
|Query Filter (query-basic)
|2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Basic
|URL Normalizer (urlnormalizer-basic)
|2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Basic
|Indexing Filter (index-basic)
|2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Html
|Parse Plug-in (parse-html)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Site
|Query Filter (query-site)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Basic
|Summarizer Plug-in (summary-basic)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         HTTP
|Framework (lib-http)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -
|Pass-through URL Normalizer (urlnormalizer-pass)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Regex
|URL Filter (urlfilter-regex)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Http
|Protocol Plug-in (protocol-http)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         XML
|Response Writer Plug-in (response-xml)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Regex
|URL Normalizer (urlnormalizer-regex)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         OPIC
|Scoring Plug-in (scoring-opic)
|2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -
|CyberNeko HTML Parser (lib-nekohtml)
|2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Anchor
|Indexing Filter (index-anchor)
|2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         URL
|Query Filter (query-url)
|2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Regex
|URL Filter Framework (lib-regex-filter)
|2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         JSON
|Response Writer Plug-in (response-json)
|2009-08-11 20:20:58,161 INFO  plugin.PluginRepository - Registered
|Extension-Points:
|2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Nutch
|Summarizer (org.apache.nutch.searcher.Summarizer)
|2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Nutch
|Protocol (org.apache.nutch.protocol.Protocol)
|2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Nutch
|Analysis (org.apache.nutch.analysis.NutchAnalyzer)
|2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
|Field Filter (org.apache.nutch.indexer.field.FieldFilter)
|2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         HTML
|Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
|2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
|Query Filter (org.apache.nutch.searcher.QueryFilter)
|2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
|Search Results Response Writer
|(org.apache.nutch.searcher.response.ResponseWriter)
|2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
|URL Normalizer (org.apache.nutch.net.URLNormalizer)
|2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
|URL Filter (org.apache.nutch.net.URLFilter)
|2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
|Online Search Results Clustering Plugin
|(org.apache.nutch.clustering.OnlineClusterer)
|2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
|Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
|2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
|Content Parser (org.apache.nutch.parse.Parser)
|2009-08-11 20:20:58,163 INFO  plugin.PluginRepository -         Nutch
|Scoring (org.apache.nutch.scoring.ScoringFilter)
|2009-08-11 20:20:58,163 INFO  plugin.PluginRepository -
|Ontology Model Loader (org.apache.nutch.ontology.Ontology)
|2009-08-11 20:20:58,171 INFO  indexer.IndexingFilters - Adding
|org.apache.nutch.indexer.basic.BasicIndexingFilter
|2009-08-11 20:20:58,202 INFO  indexer.IndexingFilters - Adding
|org.apache.nutch.indexer.anchor.AnchorIndexingFilter
|
|
|
|Is Solr output a plugin, and is it not set up above?
|
|2009/8/11 Alex McLintock <[hidden email]>:
|> I'm trying to send my Nutch crawl to SolR. I've "generated, fetched,
|> updated", several times. I've done an invertlinks.
|> But when I try to do the solrindex it just sits there for ages and
|> doesnt seem to stress the solr server at all.
|>
|> I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04.
|>
|> /local/apps/software/nutch$ bin/nutch solrindex
|> http://rio23:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
|>
|> Is there some kind of "verbose" option so that I can better see what
|> it is doing? I could maybe insert some extra deugging, or do i need
to
|> run this in Eclipse?
|>
|> The Java process seems to be using up most of a core's CPU time so it
|> seems to be doing *something*.
|>
|> This is my first Solr project so I have proved that it is up and
|> running, but havent actually added any data to it yet...
|>
|> Alex
|>
Reply | Threaded
Open this post in threaded view
|

RE: Nutch to SolR. First steps

Davide.D'ALESSANDRO
In reply to this post by Alex McLintock
Hi Alex,

Based on my experience, you just need to wait. In my case, last time to index just 400MB of data, Solr took around 1h.
The server was with 2GB ram, dual processor, with no other software installed on it (Red Hat OS).

Hope it helps
Davide

-----Original Message-----
From: Alex McLintock [mailto:[hidden email]]
Sent: Tuesday, August 11, 2009 9:11 PM
To: [hidden email]
Subject: Nutch to SolR. First steps

I'm trying to send my Nutch crawl to SolR. I've "generated, fetched,
updated", several times. I've done an invertlinks.
But when I try to do the solrindex it just sits there for ages and
doesnt seem to stress the solr server at all.

I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04.

/local/apps/software/nutch$ bin/nutch solrindex
http://rio23:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

Is there some kind of "verbose" option so that I can better see what
it is doing? I could maybe insert some extra deugging, or do i need to
run this in Eclipse?

The Java process seems to be using up most of a core's CPU time so it
seems to be doing *something*.

This is my first Solr project so I have proved that it is up and
running, but havent actually added any data to it yet...

Alex
Reply | Threaded
Open this post in threaded view
|

Re: Nutch to SolR. First steps

Alex McLintock
In reply to this post by Alex McLintock
OK,

I'm trying to use the SolrIndexer with Nutch 1.0 and nothing seems to
be sent to Solr.

I've put some more debug logging into the SolrIndexer and SolrWriter
classes. It seems like although the SolrWriter class is told to open()
and close() it is never told to write() anything in between.

Why would that be? Surely nutch should be sending everything to Solr?
Is there some other kind of filtering going on? How could I find out?
Hadoop is taking ages to do the "map" and then quite quickly the
reduce results in nothing...

Here is the previous email on the subject in case your emailer hasnt
tied the two together.

Alex


2009/8/11 Alex McLintock <[hidden email]>:

> Further information to this....
>
> I'm running on a single machine in fake clustering mode.
>
> A tmp directory gets created, with nothing but another empty directory
> inside of it.
>
> The hadoop log file just says the same thing over and over every 30 seconds....
>
> 2009-08-11 20:20:57,803 INFO  plugin.PluginRepository - Plugins:
> looking in: /local/apps/software/nutch/plugins
> 2009-08-11 20:20:58,158 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2009-08-11 20:20:58,159 INFO  plugin.PluginRepository - Registered Plugins:
> 2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         the
> nutch core extension points (nutch-extensionpoints)
> 2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Basic
> Query Filter (query-basic)
> 2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Basic
> URL Normalizer (urlnormalizer-basic)
> 2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Basic
> Indexing Filter (index-basic)
> 2009-08-11 20:20:58,159 INFO  plugin.PluginRepository -         Html
> Parse Plug-in (parse-html)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Site
> Query Filter (query-site)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Basic
> Summarizer Plug-in (summary-basic)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         HTTP
> Framework (lib-http)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Regex
> URL Filter (urlfilter-regex)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Http
> Protocol Plug-in (protocol-http)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         XML
> Response Writer Plug-in (response-xml)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         Regex
> URL Normalizer (urlnormalizer-regex)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -         OPIC
> Scoring Plug-in (scoring-opic)
> 2009-08-11 20:20:58,160 INFO  plugin.PluginRepository -
> CyberNeko HTML Parser (lib-nekohtml)
> 2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Anchor
> Indexing Filter (index-anchor)
> 2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         URL
> Query Filter (query-url)
> 2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Regex
> URL Filter Framework (lib-regex-filter)
> 2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         JSON
> Response Writer Plug-in (response-json)
> 2009-08-11 20:20:58,161 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2009-08-11 20:20:58,161 INFO  plugin.PluginRepository -         Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
> 2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         HTML
> Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
> Query Filter (org.apache.nutch.searcher.QueryFilter)
> 2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
> Search Results Response Writer
> (org.apache.nutch.searcher.response.ResponseWriter)
> 2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
> URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
> URL Filter (org.apache.nutch.net.URLFilter)
> 2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
> Online Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2009-08-11 20:20:58,162 INFO  plugin.PluginRepository -         Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2009-08-11 20:20:58,163 INFO  plugin.PluginRepository -         Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2009-08-11 20:20:58,163 INFO  plugin.PluginRepository -
> Ontology Model Loader (org.apache.nutch.ontology.Ontology)
> 2009-08-11 20:20:58,171 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2009-08-11 20:20:58,202 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>
>
>
> Is Solr output a plugin, and is it not set up above?
>
> 2009/8/11 Alex McLintock <[hidden email]>:
>> I'm trying to send my Nutch crawl to SolR. I've "generated, fetched,
>> updated", several times. I've done an invertlinks.
>> But when I try to do the solrindex it just sits there for ages and
>> doesnt seem to stress the solr server at all.
>>
>> I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04.
>>
>> /local/apps/software/nutch$ bin/nutch solrindex
>> http://rio23:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
>>
>> Is there some kind of "verbose" option so that I can better see what
>> it is doing? I could maybe insert some extra deugging, or do i need to
>> run this in Eclipse?
>>
>> The Java process seems to be using up most of a core's CPU time so it
>> seems to be doing *something*.
>>
>> This is my first Solr project so I have proved that it is up and
>> running, but havent actually added any data to it yet...
>>
>> Alex
>>
>