Nutch 1.15 IndexWriter -- how to explicitly choose one?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch 1.15 IndexWriter -- how to explicitly choose one?

Felix von Zadow

Hello dear list!

I have a problem with the new IndexWriter mechanism in 1.15. Hopefully someone can point out to me what I should do differently.

I have a couple of test systems running different versions of a web application and there is a separate SOLR core for each of them. There is a single VM that crawls and indexes content from scratch for every test system that has been redeployed. So up until 1.14 I would simply specify the target core (solr.server.url) when calling bin/crawl. Say, today I have redeployed test_system_1, so I call bin/crawl to update the SOLR core test_system_1.

Now with 1.15 I cannot explicitly choose a target index anymore, so I tried the following: In index-writers.xml, I specified an IndexWriter for each of my systems/cores. In order to choose which IndexWriter to use, I specified an exchange for every test system in exhanges.xml. It maps the host name (unique to each test system) to the correct IndexWriter (and therefore the correct core). This leaves me with two problems though:

1. I only ever want to index to one specific core during one crawl cycle and I already KNOW its name. However, the Exchange expressions are evaluated for every single document I'm indexing. The expression evaluates fine though, so it "works" and this being a test environment, I could live with it.

2. All IndexWriters referenced by ANY of the Exchanges must actually reference existing cores, even when only one of the IndexWriters is ever actually being used. If any of the references cores does NOT exist, Nutch will get a 404 for the non-existing core during the indexing phase and break. I assume Nutch checks all referenced IndexWriters before starting indexing just to be sure they are all available.

Problem #2 is the crux for me since I can't reliably guarantee that all (unrelated) cores are available during a certain crawl (and why should I need to?).


It's possible that my design is broken or my use case uncommon. But it seems to me that I should be able to somewhat easily achieve what I could with 1.14, i.e. explicitly choose the target core for each call of bin/crawl. A solution would of course be to set up a separate crawling VM for each test system, each with a single IndexWriter. But that can't be the way to go.

Grateful for any kind of pointer towards a solution!

Felix


Reply | Threaded
Open this post in threaded view
|

Re: Nutch 1.15 IndexWriter -- how to explicitly choose one?

Sebastian Nagel-2
Hi Felix,

assumed that every test crawl runs by its own not sharing resources with other test crawls
(except the Nutch packages): you may just write a separate index-writers.xml for every test, place
it in a separate directory and point NUTCH_CONF_DIR to this directory.
This works only in local mode (assuming that the tests do not run on a Hadoop cluster).

This may look like:
 .../
 |- test1/
 |  `- conf/
 |     |- index-writers.xml
 |     `- regex-urlfilter.txt
 |- test2/
 |  `- conf/
 |     |- index-writers.xml
 ...

Now you run the test crawls with NUTCH_CONF_DIR as environment variable:
 NUTCH_CONF_DIR=.../test1/conf:$NUTCH_HOME/conf  $NUTCH_HOME/bin/crawl
and
 NUTCH_CONF_DIR=.../test2/conf:$NUTCH_HOME/conf  $NUTCH_HOME/bin/crawl

Configuration files are then first picked from test1/conf/ (resp. test2/conf/) and if not
found there from $NUTCH_HOME/conf or from the class path.

This allows also to test different URL filter rules etc.

You may also set NUTCH_LOG_DIR for each test to log into different hadoop.log files.


That's the easiest way I see so far. Unfortunately, the file names themselves are not
configurable for index writers and exchanges configuration files. I've opened
  https://issues.apache.org/jira/browse/NUTCH-2718
to get this resolved.


Best,
Sebastian


On 5/22/19 11:19 AM, Felix von Zadow wrote:

>
> Hello dear list!
>
> I have a problem with the new IndexWriter mechanism in 1.15. Hopefully someone can point out to me what I should do differently.
>
> I have a couple of test systems running different versions of a web application and there is a separate SOLR core for each of them. There is a single VM that crawls and indexes content from scratch for every test system that has been redeployed. So up until 1.14 I would simply specify the target core (solr.server.url) when calling bin/crawl. Say, today I have redeployed test_system_1, so I call bin/crawl to update the SOLR core test_system_1.
>
> Now with 1.15 I cannot explicitly choose a target index anymore, so I tried the following: In index-writers.xml, I specified an IndexWriter for each of my systems/cores. In order to choose which IndexWriter to use, I specified an exchange for every test system in exhanges.xml. It maps the host name (unique to each test system) to the correct IndexWriter (and therefore the correct core). This leaves me with two problems though:
>
> 1. I only ever want to index to one specific core during one crawl cycle and I already KNOW its name. However, the Exchange expressions are evaluated for every single document I'm indexing. The expression evaluates fine though, so it "works" and this being a test environment, I could live with it.
>
> 2. All IndexWriters referenced by ANY of the Exchanges must actually reference existing cores, even when only one of the IndexWriters is ever actually being used. If any of the references cores does NOT exist, Nutch will get a 404 for the non-existing core during the indexing phase and break. I assume Nutch checks all referenced IndexWriters before starting indexing just to be sure they are all available.
>
> Problem #2 is the crux for me since I can't reliably guarantee that all (unrelated) cores are available during a certain crawl (and why should I need to?).
>
>
> It's possible that my design is broken or my use case uncommon. But it seems to me that I should be able to somewhat easily achieve what I could with 1.14, i.e. explicitly choose the target core for each call of bin/crawl. A solution would of course be to set up a separate crawling VM for each test system, each with a single IndexWriter. But that can't be the way to go.
>
> Grateful for any kind of pointer towards a solution!
>
> Felix
>
>
>

Reply | Threaded
Open this post in threaded view
|

AW: Nutch 1.15 IndexWriter -- how to explicitly choose one?

Felix von Zadow

Hi Sebastian!

Thank you for your suggestion and detailed explanation!

Putting my index-writers.xml in a separate directory for each test system but leaving the rest in a common directory does the trick!
Being able to configure the file names would sure be nice but for now I don't mind having separate directories.

Felix

> Von: Sebastian Nagel
>
> Hi Felix,
>
> assumed that every test crawl runs by its own not sharing resources with
> other test crawls
> (except the Nutch packages): you may just write a separate index-
> writers.xml for every test, place
> it in a separate directory and point NUTCH_CONF_DIR to this directory.
> This works only in local mode (assuming that the tests do not run on a
> Hadoop cluster).
>
> This may look like:
>  .../
>  |- test1/
>  |  `- conf/
>  |     |- index-writers.xml
>  |     `- regex-urlfilter.txt
>  |- test2/
>  |  `- conf/
>  |     |- index-writers.xml
>  ...
>
> Now you run the test crawls with NUTCH_CONF_DIR as environment
> variable:
>  NUTCH_CONF_DIR=.../test1/conf:$NUTCH_HOME/conf
> $NUTCH_HOME/bin/crawl
> and
>  NUTCH_CONF_DIR=.../test2/conf:$NUTCH_HOME/conf
> $NUTCH_HOME/bin/crawl
>
> Configuration files are then first picked from test1/conf/ (resp. test2/conf/)
> and if not
> found there from $NUTCH_HOME/conf or from the class path.
>
> This allows also to test different URL filter rules etc.
>
> You may also set NUTCH_LOG_DIR for each test to log into different
> hadoop.log files.
>
>
> That's the easiest way I see so far. Unfortunately, the file names themselves
> are not
> configurable for index writers and exchanges configuration files. I've
> opened
>   https://issues.apache.org/jira/browse/NUTCH-2718
> to get this resolved.
>
>
> Best,
> Sebastian
>
>
> On 5/22/19 11:19 AM, Felix von Zadow wrote:
> >
> > Hello dear list!
> >
> > I have a problem with the new IndexWriter mechanism in 1.15. Hopefully
> someone can point out to me what I should do differently.
> >
> > I have a couple of test systems running different versions of a web
> application and there is a separate SOLR core for each of them. There is a
> single VM that crawls and indexes content from scratch for every test
> system that has been redeployed. So up until 1.14 I would simply specify
> the target core (solr.server.url) when calling bin/crawl. Say, today I have
> redeployed test_system_1, so I call bin/crawl to update the SOLR core
> test_system_1.
> >
> > Now with 1.15 I cannot explicitly choose a target index anymore, so I tried
> the following: In index-writers.xml, I specified an IndexWriter for each of my
> systems/cores. In order to choose which IndexWriter to use, I specified an
> exchange for every test system in exhanges.xml. It maps the host name
> (unique to each test system) to the correct IndexWriter (and therefore the
> correct core). This leaves me with two problems though:
> >
> > 1. I only ever want to index to one specific core during one crawl cycle and
> I already KNOW its name. However, the Exchange expressions are evaluated
> for every single document I'm indexing. The expression evaluates fine
> though, so it "works" and this being a test environment, I could live with it.
> >
> > 2. All IndexWriters referenced by ANY of the Exchanges must actually
> reference existing cores, even when only one of the IndexWriters is ever
> actually being used. If any of the references cores does NOT exist, Nutch will
> get a 404 for the non-existing core during the indexing phase and break. I
> assume Nutch checks all referenced IndexWriters before starting indexing
> just to be sure they are all available.
> >
> > Problem #2 is the crux for me since I can't reliably guarantee that all
> (unrelated) cores are available during a certain crawl (and why should I
> need to?).
> >
> >
> > It's possible that my design is broken or my use case uncommon. But it
> seems to me that I should be able to somewhat easily achieve what I could
> with 1.14, i.e. explicitly choose the target core for each call of bin/crawl. A
> solution would of course be to set up a separate crawling VM for each test
> system, each with a single IndexWriter. But that can't be the way to go.
> >
> > Grateful for any kind of pointer towards a solution!
> >
> > Felix
> >
> >
> >

Reply | Threaded
Open this post in threaded view
|

Re: AW: Nutch 1.15 IndexWriter -- how to explicitly choose one?

Sebastian Nagel-2
> Putting my index-writers.xml in a separate directory for each test system but leaving
> the rest in a common directory does the trick!

Great! Thanks for the notice!

> Being able to configure the file names would sure be nice but for now I don't mind
> having separate directories.

It's a rather trivial improvement. But we'll do it. :)

On 5/27/19 11:46 AM, Felix von Zadow wrote:

>
> Hi Sebastian!
>
> Thank you for your suggestion and detailed explanation!
>
> Putting my index-writers.xml in a separate directory for each test system but leaving the rest in a common directory does the trick!
> Being able to configure the file names would sure be nice but for now I don't mind having separate directories.
>
> Felix
>
>> Von: Sebastian Nagel
>>
>> Hi Felix,
>>
>> assumed that every test crawl runs by its own not sharing resources with
>> other test crawls
>> (except the Nutch packages): you may just write a separate index-
>> writers.xml for every test, place
>> it in a separate directory and point NUTCH_CONF_DIR to this directory.
>> This works only in local mode (assuming that the tests do not run on a
>> Hadoop cluster).
>>
>> This may look like:
>>  .../
>>  |- test1/
>>  |  `- conf/
>>  |     |- index-writers.xml
>>  |     `- regex-urlfilter.txt
>>  |- test2/
>>  |  `- conf/
>>  |     |- index-writers.xml
>>  ...
>>
>> Now you run the test crawls with NUTCH_CONF_DIR as environment
>> variable:
>>  NUTCH_CONF_DIR=.../test1/conf:$NUTCH_HOME/conf
>> $NUTCH_HOME/bin/crawl
>> and
>>  NUTCH_CONF_DIR=.../test2/conf:$NUTCH_HOME/conf
>> $NUTCH_HOME/bin/crawl
>>
>> Configuration files are then first picked from test1/conf/ (resp. test2/conf/)
>> and if not
>> found there from $NUTCH_HOME/conf or from the class path.
>>
>> This allows also to test different URL filter rules etc.
>>
>> You may also set NUTCH_LOG_DIR for each test to log into different
>> hadoop.log files.
>>
>>
>> That's the easiest way I see so far. Unfortunately, the file names themselves
>> are not
>> configurable for index writers and exchanges configuration files. I've
>> opened
>>   https://issues.apache.org/jira/browse/NUTCH-2718
>> to get this resolved.
>>
>>
>> Best,
>> Sebastian
>>
>>
>> On 5/22/19 11:19 AM, Felix von Zadow wrote:
>>>
>>> Hello dear list!
>>>
>>> I have a problem with the new IndexWriter mechanism in 1.15. Hopefully
>> someone can point out to me what I should do differently.
>>>
>>> I have a couple of test systems running different versions of a web
>> application and there is a separate SOLR core for each of them. There is a
>> single VM that crawls and indexes content from scratch for every test
>> system that has been redeployed. So up until 1.14 I would simply specify
>> the target core (solr.server.url) when calling bin/crawl. Say, today I have
>> redeployed test_system_1, so I call bin/crawl to update the SOLR core
>> test_system_1.
>>>
>>> Now with 1.15 I cannot explicitly choose a target index anymore, so I tried
>> the following: In index-writers.xml, I specified an IndexWriter for each of my
>> systems/cores. In order to choose which IndexWriter to use, I specified an
>> exchange for every test system in exhanges.xml. It maps the host name
>> (unique to each test system) to the correct IndexWriter (and therefore the
>> correct core). This leaves me with two problems though:
>>>
>>> 1. I only ever want to index to one specific core during one crawl cycle and
>> I already KNOW its name. However, the Exchange expressions are evaluated
>> for every single document I'm indexing. The expression evaluates fine
>> though, so it "works" and this being a test environment, I could live with it.
>>>
>>> 2. All IndexWriters referenced by ANY of the Exchanges must actually
>> reference existing cores, even when only one of the IndexWriters is ever
>> actually being used. If any of the references cores does NOT exist, Nutch will
>> get a 404 for the non-existing core during the indexing phase and break. I
>> assume Nutch checks all referenced IndexWriters before starting indexing
>> just to be sure they are all available.
>>>
>>> Problem #2 is the crux for me since I can't reliably guarantee that all
>> (unrelated) cores are available during a certain crawl (and why should I
>> need to?).
>>>
>>>
>>> It's possible that my design is broken or my use case uncommon. But it
>> seems to me that I should be able to somewhat easily achieve what I could
>> with 1.14, i.e. explicitly choose the target core for each call of bin/crawl. A
>> solution would of course be to set up a separate crawling VM for each test
>> system, each with a single IndexWriter. But that can't be the way to go.
>>>
>>> Grateful for any kind of pointer towards a solution!
>>>
>>> Felix
>>>
>>>
>>>
>