[jira] Created: (NUTCH-760) Allow field mapping from nutch to solr index

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
Allow field mapping from nutch to solr index
--------------------------------------------

                 Key: NUTCH-760
                 URL: https://issues.apache.org/jira/browse/NUTCH-760
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
            Reporter: David Stuart


I am using nutch to crawl sites and have combined it
with solr pushing the nutch index using the solrindex command. I have
set it up as specified on the wiki using the copyField url to id in the
schema. Whilst this works fine it is stuff's up my inputs from other
sources in solr (e.g. using the solr data import handler) as they have
both id's and url's. I have patch that implements a nutch xml schema
defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Stuart updated NUTCH-760:
-------------------------------

    Attachment: solrindex_schema.patch

First pass at a schema reader for mapping basic nutch fields to solr

> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>         Attachments: solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Stuart updated NUTCH-760:
-------------------------------

    Attachment: solrindex_schema.patch

oops left out schema file

> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>         Attachments: solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766213#action_12766213 ]

Andrzej Bialecki  commented on NUTCH-760:
-----------------------------------------

Thanks David, this is a good start. We also need to address the searching part, i.e. SolrSearchBean, where Nutch hardcodes the same field names.

> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>         Attachments: solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Stuart updated NUTCH-760:
-------------------------------

    Attachment: solrindex_schema.patch

Updated patch with the modifications to the SolrSearchBean. Have also re factored a wee bit to allow other classes to hook into the solr index schema

> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>         Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767918#action_12767918 ]

Andrzej Bialecki  commented on NUTCH-760:
-----------------------------------------

A few comments to the latest patch:

* the description of the property in nutch-default.xml could be more descriptive ;)

* <schema> element has name and version attributes - do we really need these? It's not a Solr schema.xml anyway, so we don't have to pretend that we follow the same format.

* SolrSchemaReader uses static instance of NutchConfiguration - this is a big no-no, the whole point of using the property in nutch-default.xml is that you could set different values, and making this field static basically pins down the configuration to the version set on the first instantiation of the class ... Please do as other similar classes do - implement Configurable, or add Configuration to the constructor, and pass the current job configuration where appropriate.

* consequently, static references to SolrSchemaReader need to be un-staticized in other places.

* minor nits: code formatting should use 2 literal spaces indents. There are some accidental changes in NutchBean and SolrWriter.

> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>         Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767934#action_12767934 ]

David Stuart commented on NUTCH-760:
------------------------------------

Thanks,

I will have another go. It quite a big task getting my head around all of the
ins and outs of nutch but its good to help to contribute to a great product

Regards,

Dave






> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>         Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Stuart updated NUTCH-760:
-------------------------------

    Attachment: solrindex_schema.patch

Have updated patch as per comment below
    *  the description of the property in nutch-default.xml could be more descriptive

    * <schema> element has name and version attributes - do we really need these? It's not a Solr schema.xml anyway, so we don't have to pretend that we follow the same format.

    * SolrSchemaReader uses static instance of NutchConfiguration - this is a big no-no, the whole point of using the property in nutch-default.xml is that you could set different values, and making this field static basically pins down the configuration to the version set on the first instantiation of the class ... Please do as other similar classes do - implement Configurable, or add Configuration to the constructor, and pass the current job configuration where appropriate.

    * consequently, static references to SolrSchemaReader need to be un-staticized in other places.

    * minor nits: code formatting should use 2 literal spaces indents. There are some accidental changes in NutchBean and SolrWriter.


> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>         Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770464#action_12770464 ]

David Stuart commented on NUTCH-760:
------------------------------------

Hi Andrzej,

I have amended the patch to incorporate your suggestions
https://issues.apache.org/jira/browse/NUTCH-760

Regards,


Dave



> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>         Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-760:
------------------------------------

    Assignee: Andrzej Bialecki

> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>            Assignee: Andrzej Bialecki
>         Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-760.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1

> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>            Assignee: Andrzej Bialecki
>             Fix For: 1.1
>
>         Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782617#action_12782617 ]

Andrzej Bialecki  commented on NUTCH-760:
-----------------------------------------

I reworked the patch to get rid of any left-overs of static Configuration, and changed the concept of "schema" (which was misleading) to "mapping" throughout the patch and class names.

This is now committed in rev. 884269 - thanks!

> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>            Assignee: Andrzej Bialecki
>             Fix For: 1.1
>
>         Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783237#action_12783237 ]

Hudson commented on NUTCH-760:
------------------------------

Integrated in Nutch-trunk #995 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/])
    Add part of .
 Allow field mapping from nutch to solr index.


> Allow field mapping from nutch to solr index
> --------------------------------------------
>
>                 Key: NUTCH-760
>                 URL: https://issues.apache.org/jira/browse/NUTCH-760
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: David Stuart
>            Assignee: Andrzej Bialecki
>             Fix For: 1.1
>
>         Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
>
>
> I am using nutch to crawl sites and have combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's. I have patch that implements a nutch xml schema
> defining what basic nutch fields map to in your solr push.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.