[jira] Created: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

JIRA jira@apache.org
Proposition: Enable Nutch to use a parser plugin not just based on content type
-------------------------------------------------------------------------------

         Key: NUTCH-179
         URL: http://issues.apache.org/jira/browse/NUTCH-179
     Project: Nutch
        Type: Improvement
  Components: fetcher  
    Versions: 0.8-dev    
    Reporter: Gal Nitzan


Somtime there are requirements of the "real world" (usually your boss) where a special parse is required for a certain site.

Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]

Gal Nitzan updated NUTCH-179:
-----------------------------

    Description:
Somtime there are requirements of the "real world" (usually your boss) where a special parse is required for a certain site. Though the content type is text/html, a specialized parser is needed.

Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.

  was:
Somtime there are requirements of the "real world" (usually your boss) where a special parse is required for a certain site.

Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.


> Proposition: Enable Nutch to use a parser plugin not just based on content type
> -------------------------------------------------------------------------------
>
>          Key: NUTCH-179
>          URL: http://issues.apache.org/jira/browse/NUTCH-179
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Gal Nitzan

>
> Somtime there are requirements of the "real world" (usually your boss) where a special parse is required for a certain site. Though the content type is text/html, a specialized parser is needed.
> Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score.
> Currently the ParserFactory looks for a plugin based only on the content type.
> Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type.
> The implementation shouldn be to complicated.
> Looking to hear more ideas.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]

Gal Nitzan updated NUTCH-179:
-----------------------------

    Description:
Sorry, please close this issue.

I figured that if I set my parse plugin first. I can always be called first and than decide if I want to parse or not.

  was:
Somtime there are requirements of the "real world" (usually your boss) where a special parse is required for a certain site. Though the content type is text/html, a specialized parser is needed.

Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.


> Proposition: Enable Nutch to use a parser plugin not just based on content type
> -------------------------------------------------------------------------------
>
>          Key: NUTCH-179
>          URL: http://issues.apache.org/jira/browse/NUTCH-179
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Gal Nitzan

>
> Sorry, please close this issue.
> I figured that if I set my parse plugin first. I can always be called first and than decide if I want to parse or not.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

chrismattmann
Hi Gail,

 Check out:

http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/

That's the way that the parser factory currently works. Also added, but not
described in that proposal is the ability to call a parser by its id, which
is a method present in ParseUtil.java.

G'luck!

Cheers,
  Chris


______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Gal Nitzan (JIRA) [mailto:[hidden email]]
> Sent: Sunday, January 15, 2006 4:10 PM
> To: [hidden email]
> Subject: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a
> parser plugin not just based on content type
>
>      [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]
>
> Gal Nitzan updated NUTCH-179:
> -----------------------------
>
>     Description:
> Sorry, please close this issue.
>
> I figured that if I set my parse plugin first. I can always be called
> first and than decide if I want to parse or not.
>
>   was:
> Somtime there are requirements of the "real world" (usually your boss)
> where a special parse is required for a certain site. Though the content
> type is text/html, a specialized parser is needed.
>
> Sample: I am required to crawl certain sites where some of them are
> partners sites. when fetching from the partners site I need to look for
> certain entries in the text and boost the score.
>
> Currently the ParserFactory looks for a plugin based only on the content
> type.
>
> Facing this issue myself I noticed that it would give a very easy
> implementation for others if ParserFactory could use NutchConf to check
> for certain properties and if matched to use the correct plugin based on
> the url and not just the content type.
>
> The implementation shouldn be to complicated.
>
> Looking to hear more ideas.
>
>
> > Proposition: Enable Nutch to use a parser plugin not just based on
> content type
> > ------------------------------------------------------------------------
> -------
> >
> >          Key: NUTCH-179
> >          URL: http://issues.apache.org/jira/browse/NUTCH-179
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Versions: 0.8-dev
> >     Reporter: Gal Nitzan
>
> >
> > Sorry, please close this issue.
> > I figured that if I set my parse plugin first. I can always be called
> first and than decide if I want to parse or not.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]
     
Doug Cutting closed NUTCH-179:
------------------------------

    Resolution: Invalid

Closed at submitter's request.

> Proposition: Enable Nutch to use a parser plugin not just based on content type
> -------------------------------------------------------------------------------
>
>          Key: NUTCH-179
>          URL: http://issues.apache.org/jira/browse/NUTCH-179
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Gal Nitzan

>
> Sorry, please close this issue.
> I figured that if I set my parse plugin first. I can always be called first and than decide if I want to parse or not.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira