Fwd: Crawling 3 websites from one nutch

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Crawling 3 websites from one nutch

Zara Parst
Hi, Is it possible to crawl three different website like

1. https://www.urgenthomework.com/
2. https://www.myassignmenthelp.net/
3. https://www.assignmenthelp.net/

in single nutch configuration and then send the respective index pages to
corrosponding cores [ uah, mah , yah]  in solr.  I tried to acheieve it by
exchange and writer id.  Please look below for my confirgurations

-------------exchange.xml---------------------------------







*<exchange id="uahIndexernew" class="default">    <writers>      <writer
id="indexer_solr_1" />    </writers>    <params>      <param name="expr"
value="doc.getFieldValue('host')=='urgenthomework.com
<http://urgenthomework.com>'" />    </params>  </exchange>*








*<exchange id="mahIndexernew" class="default">    <writers>      <writer
id="indexer_solr_2" />    </writers>    <params>      <param name="expr"
value="doc.getFieldValue('host')=='myassignmenthelp.net
<http://myassignmenthelp.net>'" />    </params>  </exchange>*










* <exchange id="yahIndexernew" class="default">    <writers>      <writer
id="indexer_solr_3" />    </writers>    <params>      <param name="expr"
value="doc.getFieldValue('host')=='assignmenthelp.net
<http://assignmenthelp.net>'" />    </params>  </exchange>*



---------------------------------index.writers.xml----------------------------------------

 <writer id="indexer_solr_1"
class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
    <parameters>
      <param name="type" value="http" />
      <param name="url" value="http://localhost:8983/solr/uah" />
      <param name="collection" value="" />
      <param name="weight.field" value="" />
      <param name="commitSize" value="1000" />
      <param name="auth" value="false" />
      <param name="username" value="username" />
      <param name="password" value="password" />
    </parameters>
    <mapping>
      <copy>
        <!-- <field source="title" dest="content" />
        <field source="metatag.description" dest="content" />
        <field source="metatag.keywords" dest="content" /> -->
      </copy>
      <rename></rename>
      <remove>
        <field source="segment" />
        <field source="host" />
        <field source="url" />
        <!-- <field source="metatag.description" />
        <field source="metatag.keywords" />
        <field source="date" />
        <field source="url" />
         -->
      </remove>
    </mapping>
  </writer>


  <writer id="indexer_solr_2"
class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
    <parameters>
      <param name="type" value="http" />
      <param name="url" value="http://localhost:8983/solr/mah" />
      <param name="collection" value="" />
      <param name="weight.field" value="" />
      <param name="commitSize" value="1000" />
      <param name="auth" value="false" />
      <param name="username" value="username" />
      <param name="password" value="password" />
    </parameters>
    <mapping>
      <copy>
      </copy>
      <rename></rename>
      <remove>
        <field source="segment" />
        <field source="host" />
        <field source="url" />
      </remove>
    </mapping>
  </writer>



  <writer id="indexer_solr_3"
class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
    <parameters>
      <param name="type" value="http" />
      <param name="url" value="http://localhost:8983/solr/yah" />
      <param name="collection" value="" />
      <param name="weight.field" value="" />
      <param name="commitSize" value="1000" />
      <param name="auth" value="false" />
      <param name="username" value="username" />
      <param name="password" value="password" />
    </parameters>
    <mapping>
      <copy>
      </copy>
      <rename></rename>
      <remove>
        <field source="segment" />
        <field source="host" />
        <field source="url" />
      </remove>
    </mapping>
  </writer>

---------------------------------------------------------------------------------------------------------------

But it is not pushing data into corrosinding cores rather it is sending
data in one core from different domain, Please do let me know. I am sure
there has to be way to achieve it. I didnt try wth sobcollecion.xml. Do you
think I can achieve it using subcollection?
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Crawling 3 websites from one nutch

Sebastian Nagel-2
Hi,

the test compares names of the "host" and the registered domain:
  doc.getFieldValue('host')=='urgenthomework.com'

The host name is "www.urgenthomework.com". You can test it via:

  $> bin/nutch indexchecker https://www.urgenthomework.com/
  fetching: https://www.urgenthomework.com/
  ...
  host :  www.urgenthomework.com
  ...
  title : Homework Help for College, University and School Students
  ...

Best,
Sebastian


On 12/26/19 11:29 AM, Zara Parst wrote:

> Hi, Is it possible to crawl three different website like
>
> 1. https://www.urgenthomework.com/
> 2. https://www.myassignmenthelp.net/
> 3. https://www.assignmenthelp.net/
>
> in single nutch configuration and then send the respective index pages to
> corrosponding cores [ uah, mah , yah]  in solr.  I tried to acheieve it by
> exchange and writer id.  Please look below for my confirgurations
>
> -------------exchange.xml---------------------------------
>
>
>
>
>
>
>
> *<exchange id="uahIndexernew" class="default">    <writers>      <writer
> id="indexer_solr_1" />    </writers>    <params>      <param name="expr"
> value="doc.getFieldValue('host')=='urgenthomework.com
> <http://urgenthomework.com>'" />    </params>  </exchange>*
>
>
>
>
>
>
>
>
> *<exchange id="mahIndexernew" class="default">    <writers>      <writer
> id="indexer_solr_2" />    </writers>    <params>      <param name="expr"
> value="doc.getFieldValue('host')=='myassignmenthelp.net
> <http://myassignmenthelp.net>'" />    </params>  </exchange>*
>
>
>
>
>
>
>
>
>
>
> * <exchange id="yahIndexernew" class="default">    <writers>      <writer
> id="indexer_solr_3" />    </writers>    <params>      <param name="expr"
> value="doc.getFieldValue('host')=='assignmenthelp.net
> <http://assignmenthelp.net>'" />    </params>  </exchange>*
>
>
>
> ---------------------------------index.writers.xml----------------------------------------
>
>  <writer id="indexer_solr_1"
> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>     <parameters>
>       <param name="type" value="http" />
>       <param name="url" value="http://localhost:8983/solr/uah" />
>       <param name="collection" value="" />
>       <param name="weight.field" value="" />
>       <param name="commitSize" value="1000" />
>       <param name="auth" value="false" />
>       <param name="username" value="username" />
>       <param name="password" value="password" />
>     </parameters>
>     <mapping>
>       <copy>
>         <!-- <field source="title" dest="content" />
>         <field source="metatag.description" dest="content" />
>         <field source="metatag.keywords" dest="content" /> -->
>       </copy>
>       <rename></rename>
>       <remove>
>         <field source="segment" />
>         <field source="host" />
>         <field source="url" />
>         <!-- <field source="metatag.description" />
>         <field source="metatag.keywords" />
>         <field source="date" />
>         <field source="url" />
>          -->
>       </remove>
>     </mapping>
>   </writer>
>
>
>   <writer id="indexer_solr_2"
> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>     <parameters>
>       <param name="type" value="http" />
>       <param name="url" value="http://localhost:8983/solr/mah" />
>       <param name="collection" value="" />
>       <param name="weight.field" value="" />
>       <param name="commitSize" value="1000" />
>       <param name="auth" value="false" />
>       <param name="username" value="username" />
>       <param name="password" value="password" />
>     </parameters>
>     <mapping>
>       <copy>
>       </copy>
>       <rename></rename>
>       <remove>
>         <field source="segment" />
>         <field source="host" />
>         <field source="url" />
>       </remove>
>     </mapping>
>   </writer>
>
>
>
>   <writer id="indexer_solr_3"
> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>     <parameters>
>       <param name="type" value="http" />
>       <param name="url" value="http://localhost:8983/solr/yah" />
>       <param name="collection" value="" />
>       <param name="weight.field" value="" />
>       <param name="commitSize" value="1000" />
>       <param name="auth" value="false" />
>       <param name="username" value="username" />
>       <param name="password" value="password" />
>     </parameters>
>     <mapping>
>       <copy>
>       </copy>
>       <rename></rename>
>       <remove>
>         <field source="segment" />
>         <field source="host" />
>         <field source="url" />
>       </remove>
>     </mapping>
>   </writer>
>
> ---------------------------------------------------------------------------------------------------------------
>
> But it is not pushing data into corrosinding cores rather it is sending
> data in one core from different domain, Please do let me know. I am sure
> there has to be way to achieve it. I didnt try wth sobcollecion.xml. Do you
> think I can achieve it using subcollection?
>

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Crawling 3 websites from one nutch

Zara Parst
thanks @Sebastian but that didnt help either.  I think that is not right
way to push on different core.

On Fri, Dec 27, 2019 at 5:10 PM Sebastian Nagel
<[hidden email]> wrote:

> Hi,
>
> the test compares names of the "host" and the registered domain:
>   doc.getFieldValue('host')=='urgenthomework.com'
>
> The host name is "www.urgenthomework.com". You can test it via:
>
>   $> bin/nutch indexchecker https://www.urgenthomework.com/
>   fetching: https://www.urgenthomework.com/
>   ...
>   host :  www.urgenthomework.com
>   ...
>   title : Homework Help for College, University and School Students
>   ...
>
> Best,
> Sebastian
>
>
> On 12/26/19 11:29 AM, Zara Parst wrote:
> > Hi, Is it possible to crawl three different website like
> >
> > 1. https://www.urgenthomework.com/
> > 2. https://www.myassignmenthelp.net/
> > 3. https://www.assignmenthelp.net/
> >
> > in single nutch configuration and then send the respective index pages to
> > corrosponding cores [ uah, mah , yah]  in solr.  I tried to acheieve it
> by
> > exchange and writer id.  Please look below for my confirgurations
> >
> > -------------exchange.xml---------------------------------
> >
> >
> >
> >
> >
> >
> >
> > *<exchange id="uahIndexernew" class="default">    <writers>      <writer
> > id="indexer_solr_1" />    </writers>    <params>      <param name="expr"
> > value="doc.getFieldValue('host')=='urgenthomework.com
> > <http://urgenthomework.com>'" />    </params>  </exchange>*
> >
> >
> >
> >
> >
> >
> >
> >
> > *<exchange id="mahIndexernew" class="default">    <writers>      <writer
> > id="indexer_solr_2" />    </writers>    <params>      <param name="expr"
> > value="doc.getFieldValue('host')=='myassignmenthelp.net
> > <http://myassignmenthelp.net>'" />    </params>  </exchange>*
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > * <exchange id="yahIndexernew" class="default">    <writers>      <writer
> > id="indexer_solr_3" />    </writers>    <params>      <param name="expr"
> > value="doc.getFieldValue('host')=='assignmenthelp.net
> > <http://assignmenthelp.net>'" />    </params>  </exchange>*
> >
> >
> >
> >
> ---------------------------------index.writers.xml----------------------------------------
> >
> >  <writer id="indexer_solr_1"
> > class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
> >     <parameters>
> >       <param name="type" value="http" />
> >       <param name="url" value="http://localhost:8983/solr/uah" />
> >       <param name="collection" value="" />
> >       <param name="weight.field" value="" />
> >       <param name="commitSize" value="1000" />
> >       <param name="auth" value="false" />
> >       <param name="username" value="username" />
> >       <param name="password" value="password" />
> >     </parameters>
> >     <mapping>
> >       <copy>
> >         <!-- <field source="title" dest="content" />
> >         <field source="metatag.description" dest="content" />
> >         <field source="metatag.keywords" dest="content" /> -->
> >       </copy>
> >       <rename></rename>
> >       <remove>
> >         <field source="segment" />
> >         <field source="host" />
> >         <field source="url" />
> >         <!-- <field source="metatag.description" />
> >         <field source="metatag.keywords" />
> >         <field source="date" />
> >         <field source="url" />
> >          -->
> >       </remove>
> >     </mapping>
> >   </writer>
> >
> >
> >   <writer id="indexer_solr_2"
> > class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
> >     <parameters>
> >       <param name="type" value="http" />
> >       <param name="url" value="http://localhost:8983/solr/mah" />
> >       <param name="collection" value="" />
> >       <param name="weight.field" value="" />
> >       <param name="commitSize" value="1000" />
> >       <param name="auth" value="false" />
> >       <param name="username" value="username" />
> >       <param name="password" value="password" />
> >     </parameters>
> >     <mapping>
> >       <copy>
> >       </copy>
> >       <rename></rename>
> >       <remove>
> >         <field source="segment" />
> >         <field source="host" />
> >         <field source="url" />
> >       </remove>
> >     </mapping>
> >   </writer>
> >
> >
> >
> >   <writer id="indexer_solr_3"
> > class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
> >     <parameters>
> >       <param name="type" value="http" />
> >       <param name="url" value="http://localhost:8983/solr/yah" />
> >       <param name="collection" value="" />
> >       <param name="weight.field" value="" />
> >       <param name="commitSize" value="1000" />
> >       <param name="auth" value="false" />
> >       <param name="username" value="username" />
> >       <param name="password" value="password" />
> >     </parameters>
> >     <mapping>
> >       <copy>
> >       </copy>
> >       <rename></rename>
> >       <remove>
> >         <field source="segment" />
> >         <field source="host" />
> >         <field source="url" />
> >       </remove>
> >     </mapping>
> >   </writer>
> >
> >
> ---------------------------------------------------------------------------------------------------------------
> >
> > But it is not pushing data into corrosinding cores rather it is sending
> > data in one core from different domain, Please do let me know. I am sure
> > there has to be way to achieve it. I didnt try wth sobcollecion.xml. Do
> you
> > think I can achieve it using subcollection?
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawling 3 websites from one nutch

Richard Lavin
In reply to this post by Zara Parst
Thanks Rick

Sent from my iPhone

> On Dec 26, 2019, at 2:30 AM, Zara Parst <[hidden email]> wrote:
>
> Hi, Is it possible to crawl three different website like
>
> 1. https://www.urgenthomework.com/
> 2. https://www.myassignmenthelp.net/
> 3. https://www.assignmenthelp.net/
>
> in single nutch configuration and then send the respective index pages to
> corrosponding cores [ uah, mah , yah]  in solr.  I tried to acheieve it by
> exchange and writer id.  Please look below for my confirgurations
>
> -------------exchange.xml---------------------------------
>
>
>
>
>
>
>
> *<exchange id="uahIndexernew" class="default">    <writers>      <writer
> id="indexer_solr_1" />    </writers>    <params>      <param name="expr"
> value="doc.getFieldValue('host')=='urgenthomework.com
> <http://urgenthomework.com>'" />    </params>  </exchange>*
>
>
>
>
>
>
>
>
> *<exchange id="mahIndexernew" class="default">    <writers>      <writer
> id="indexer_solr_2" />    </writers>    <params>      <param name="expr"
> value="doc.getFieldValue('host')=='myassignmenthelp.net
> <http://myassignmenthelp.net>'" />    </params>  </exchange>*
>
>
>
>
>
>
>
>
>
>
> * <exchange id="yahIndexernew" class="default">    <writers>      <writer
> id="indexer_solr_3" />    </writers>    <params>      <param name="expr"
> value="doc.getFieldValue('host')=='assignmenthelp.net
> <http://assignmenthelp.net>'" />    </params>  </exchange>*
>
>
>
> ---------------------------------index.writers.xml----------------------------------------
>
> <writer id="indexer_solr_1"
> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>    <parameters>
>      <param name="type" value="http" />
>      <param name="url" value="http://localhost:8983/solr/uah" />
>      <param name="collection" value="" />
>      <param name="weight.field" value="" />
>      <param name="commitSize" value="1000" />
>      <param name="auth" value="false" />
>      <param name="username" value="username" />
>      <param name="password" value="password" />
>    </parameters>
>    <mapping>
>      <copy>
>        <!-- <field source="title" dest="content" />
>        <field source="metatag.description" dest="content" />
>        <field source="metatag.keywords" dest="content" /> -->
>      </copy>
>      <rename></rename>
>      <remove>
>        <field source="segment" />
>        <field source="host" />
>        <field source="url" />
>        <!-- <field source="metatag.description" />
>        <field source="metatag.keywords" />
>        <field source="date" />
>        <field source="url" />
>         -->
>      </remove>
>    </mapping>
>  </writer>
>
>
>  <writer id="indexer_solr_2"
> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>    <parameters>
>      <param name="type" value="http" />
>      <param name="url" value="http://localhost:8983/solr/mah" />
>      <param name="collection" value="" />
>      <param name="weight.field" value="" />
>      <param name="commitSize" value="1000" />
>      <param name="auth" value="false" />
>      <param name="username" value="username" />
>      <param name="password" value="password" />
>    </parameters>
>    <mapping>
>      <copy>
>      </copy>
>      <rename></rename>
>      <remove>
>        <field source="segment" />
>        <field source="host" />
>        <field source="url" />
>      </remove>
>    </mapping>
>  </writer>
>
>
>
>  <writer id="indexer_solr_3"
> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>    <parameters>
>      <param name="type" value="http" />
>      <param name="url" value="http://localhost:8983/solr/yah" />
>      <param name="collection" value="" />
>      <param name="weight.field" value="" />
>      <param name="commitSize" value="1000" />
>      <param name="auth" value="false" />
>      <param name="username" value="username" />
>      <param name="password" value="password" />
>    </parameters>
>    <mapping>
>      <copy>
>      </copy>
>      <rename></rename>
>      <remove>
>        <field source="segment" />
>        <field source="host" />
>        <field source="url" />
>      </remove>
>    </mapping>
>  </writer>
>
> ---------------------------------------------------------------------------------------------------------------
>
> But it is not pushing data into corrosinding cores rather it is sending
> data in one core from different domain, Please do let me know. I am sure
> there has to be way to achieve it. I didnt try wth sobcollecion.xml. Do you
> think I can achieve it using subcollection?

Reply | Threaded
Open this post in threaded view
|

Re: Crawling 3 websites from one nutch

Richard Lavin
Thanks

Sent from my iPhone

> On Dec 27, 2019, at 7:58 PM, Richard Lavin <[hidden email]> wrote:
>
> Thanks Rick
>
> Sent from my iPhone
>
>> On Dec 26, 2019, at 2:30 AM, Zara Parst <[hidden email]> wrote:
>>
>> Hi, Is it possible to crawl three different website like
>>
>> 1. https://www.urgenthomework.com/
>> 2. https://www.myassignmenthelp.net/
>> 3. https://www.assignmenthelp.net/
>>
>> in single nutch configuration and then send the respective index pages to
>> corrosponding cores [ uah, mah , yah]  in solr.  I tried to acheieve it by
>> exchange and writer id.  Please look below for my confirgurations
>>
>> -------------exchange.xml---------------------------------
>>
>>
>>
>>
>>
>>
>>
>> *<exchange id="uahIndexernew" class="default">    <writers>      <writer
>> id="indexer_solr_1" />    </writers>    <params>      <param name="expr"
>> value="doc.getFieldValue('host')=='urgenthomework.com
>> <http://urgenthomework.com>'" />    </params>  </exchange>*
>>
>>
>>
>>
>>
>>
>>
>>
>> *<exchange id="mahIndexernew" class="default">    <writers>      <writer
>> id="indexer_solr_2" />    </writers>    <params>      <param name="expr"
>> value="doc.getFieldValue('host')=='myassignmenthelp.net
>> <http://myassignmenthelp.net>'" />    </params>  </exchange>*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> * <exchange id="yahIndexernew" class="default">    <writers>      <writer
>> id="indexer_solr_3" />    </writers>    <params>      <param name="expr"
>> value="doc.getFieldValue('host')=='assignmenthelp.net
>> <http://assignmenthelp.net>'" />    </params>  </exchange>*
>>
>>
>>
>> ---------------------------------index.writers.xml----------------------------------------
>>
>> <writer id="indexer_solr_1"
>> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>>   <parameters>
>>     <param name="type" value="http" />
>>     <param name="url" value="http://localhost:8983/solr/uah" />
>>     <param name="collection" value="" />
>>     <param name="weight.field" value="" />
>>     <param name="commitSize" value="1000" />
>>     <param name="auth" value="false" />
>>     <param name="username" value="username" />
>>     <param name="password" value="password" />
>>   </parameters>
>>   <mapping>
>>     <copy>
>>       <!-- <field source="title" dest="content" />
>>       <field source="metatag.description" dest="content" />
>>       <field source="metatag.keywords" dest="content" /> -->
>>     </copy>
>>     <rename></rename>
>>     <remove>
>>       <field source="segment" />
>>       <field source="host" />
>>       <field source="url" />
>>       <!-- <field source="metatag.description" />
>>       <field source="metatag.keywords" />
>>       <field source="date" />
>>       <field source="url" />
>>        -->
>>     </remove>
>>   </mapping>
>> </writer>
>>
>>
>> <writer id="indexer_solr_2"
>> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>>   <parameters>
>>     <param name="type" value="http" />
>>     <param name="url" value="http://localhost:8983/solr/mah" />
>>     <param name="collection" value="" />
>>     <param name="weight.field" value="" />
>>     <param name="commitSize" value="1000" />
>>     <param name="auth" value="false" />
>>     <param name="username" value="username" />
>>     <param name="password" value="password" />
>>   </parameters>
>>   <mapping>
>>     <copy>
>>     </copy>
>>     <rename></rename>
>>     <remove>
>>       <field source="segment" />
>>       <field source="host" />
>>       <field source="url" />
>>     </remove>
>>   </mapping>
>> </writer>
>>
>>
>>
>> <writer id="indexer_solr_3"
>> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>>   <parameters>
>>     <param name="type" value="http" />
>>     <param name="url" value="http://localhost:8983/solr/yah" />
>>     <param name="collection" value="" />
>>     <param name="weight.field" value="" />
>>     <param name="commitSize" value="1000" />
>>     <param name="auth" value="false" />
>>     <param name="username" value="username" />
>>     <param name="password" value="password" />
>>   </parameters>
>>   <mapping>
>>     <copy>
>>     </copy>
>>     <rename></rename>
>>     <remove>
>>       <field source="segment" />
>>       <field source="host" />
>>       <field source="url" />
>>     </remove>
>>   </mapping>
>> </writer>
>>
>> ---------------------------------------------------------------------------------------------------------------
>>
>> But it is not pushing data into corrosinding cores rather it is sending
>> data in one core from different domain, Please do let me know. I am sure
>> there has to be way to achieve it. I didnt try wth sobcollecion.xml. Do you
>> think I can achieve it using subcollection?