Plugin Developement Help

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Plugin Developement Help

David Stuart
Hi All,

I think I am just about finished my plugin (nutch 1.0) which adds extra metadata to during parsing the problem I am having is it doesn't seem to be adding the data to the system (via luke or readseg). I looked at in the wiki but it seems to be for 0.9 and the syntax looks different.

{code}       
  public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
      Metadata metadata = new Metadata();
      // parse the content
      DocumentFragment root;   
      String docTrans;
      try {
        byte[] contentInOctets = content.getContent();
        String input = new String(contentInOctets);
        XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
        docTrans = DocTransform.doTransform(input);
        Parse parse = parseResult.get(content.getUrl());
        metadata = parse.getData().getParseMeta();
        metadata.add("filter_html_data", docTrans);

      } catch (Exception e) {
        e.printStackTrace(LogUtil.getWarnStream(LOG));
      }
     
    return parseResult;
  }
{code}

Cheers,

Dave
Reply | Threaded
Open this post in threaded view
|

Re: Plugin Developement Help

Andrzej Białecki-2
[hidden email] wrote:

>   Hi All,
>
> I think I am just about finished my plugin (nutch 1.0) which adds extra
> metadata to during parsing the problem I am having is it doesn't seem to
> be adding the data to the system (via luke or readseg). I looked at in
> the wiki but it seems to be for 0.9 and the syntax looks different.
>
> {code}      
>   public ParseResult filter(Content content, ParseResult parseResult,
> HTMLMetaTags metaTags, DocumentFragment doc) {
>       Metadata metadata = new Metadata();
>       // parse the content
>       DocumentFragment root;  
>       String docTrans;
>       try {
>         byte[] contentInOctets = content.getContent();
>         String input = new String(contentInOctets);
>         XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
>         docTrans = DocTransform.doTransform(input);
>         Parse parse = parseResult.get(content.getUrl());
>         metadata = parse.getData().getParseMeta();
>         metadata.add("filter_html_data", docTrans);
>
>       } catch (Exception e) {
>         e.printStackTrace(LogUtil.getWarnStream(LOG));
>       }
>      
>     return parseResult;
>   }
> {code}

Did you declare that you are adding this field in the
IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
plugins do this.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Plugin Developement Help

David Stuart
I thought I did but I thought before I did a bin/nutch index (or solrindex) it would be stored somewhere it does seems to be getting to the doc.add bit which makes me think the variable is empty
{code}
    public void addIndexBackendOptions(Configuration conf) {
      LOG.warn("+_+_You called me _+_+");
      LuceneWriter.addFieldOptions("html_filter_data", STORE.YES, INDEX.UNTOKENIZED, conf);
    }
   
    public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {
      LOG.warn("________________________FILTER_______________________");
      String html_filter_data = parse.getData().getMeta("html_filter_data");
      if (html_filter_data != null){
          LOG.warn("________________________Adding filter data_______________________");
          doc.add("html_filter_data", html_filter_data);
      }
      return doc;
    }
{code}
On 24 November 2009 at 12:05 Andrzej Bialecki <[hidden email]> wrote:

> [hidden email] wrote:
> >   Hi All,
> >
> > I think I am just about finished my plugin (nutch 1.0) which adds extra
> > metadata to during parsing the problem I am having is it doesn't seem to
> > be adding the data to the system (via luke or readseg). I looked at in
> > the wiki but it seems to be for 0.9 and the syntax looks different.
> >
> > {code}       
> >   public ParseResult filter(Content content, ParseResult parseResult,
> > HTMLMetaTags metaTags, DocumentFragment doc) {
> >       Metadata metadata = new Metadata();
> >       // parse the content
> >       DocumentFragment root;   
> >       String docTrans;
> >       try {
> >         byte[] contentInOctets = content.getContent();
> >         String input = new String(contentInOctets);
> >         XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> >         docTrans = DocTransform.doTransform(input);
> >         Parse parse = parseResult.get(content.getUrl());
> >         metadata = parse.getData().getParseMeta();
> >         metadata.add("filter_html_data", docTrans);
> >
> >       } catch (Exception e) {
> >         e.printStackTrace(LogUtil.getWarnStream(LOG));
> >       }
> >     
> >     return parseResult;
> >   }
> > {code}
>
> Did you declare that you are adding this field in the
> IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
> plugins do this.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
Reply | Threaded
Open this post in threaded view
|

Re: Plugin Developement Help

David Stuart
Sorry I meant doesn't get to doc.add

David

On 24 Nov 2009, at 11:27, "[hidden email]" <[hidden email]
 > wrote:

> I thought I did but I thought before I did a bin/nutch index (or  
> solrindex) it would be stored somewhere it does seems to be getting  
> to the doc.add bit which makes me think the variable is empty
> {code}
>     public void addIndexBackendOptions(Configuration conf) {
>       LOG.warn("+_+_You called me _+_+");
>       LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,  
> INDEX.UNTOKENIZED, conf);
>     }
>
>     public NutchDocument filter(NutchDocument doc, Parse parse, Text  
> url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>       LOG.warn
> ("________________________FILTER_______________________");
>       String html_filter_data = parse.getData().getMeta
> ("html_filter_data");
>       if (html_filter_data != null){
>           LOG.warn("________________________Adding filter  
> data_______________________");
>           doc.add("html_filter_data", html_filter_data);
>       }
>       return doc;
>     }
> {code}
> On 24 November 2009 at 12:05 Andrzej Bialecki <[hidden email]> wrote:
>
> > [hidden email] wrote:
> > >   Hi All,
> > >
> > > I think I am just about finished my plugin (nutch 1.0) which  
> adds extra
> > > metadata to during parsing the problem I am having is it doesn't  
> seem to
> > > be adding the data to the system (via luke or readseg). I looked  
> at in
> > > the wiki but it seems to be for 0.9 and the syntax looks  
> different.
> > >
> > > {code}
> > >   public ParseResult filter(Content content, ParseResult  
> parseResult,
> > > HTMLMetaTags metaTags, DocumentFragment doc) {
> > >       Metadata metadata = new Metadata();
> > >       // parse the content
> > >       DocumentFragment root;
> > >       String docTrans;
> > >       try {
> > >         byte[] contentInOctets = content.getContent();
> > >         String input = new String(contentInOctets);
> > >         XSLTSimpleTransform DocTransform = new  
> XSLTSimpleTransform();
> > >         docTrans = DocTransform.doTransform(input);
> > >         Parse parse = parseResult.get(content.getUrl());
> > >         metadata = parse.getData().getParseMeta();
> > >         metadata.add("filter_html_data", docTrans);
> > >
> > >       } catch (Exception e) {
> > >         e.printStackTrace(LogUtil.getWarnStream(LOG));
> > >       }
> > >
> > >     return parseResult;
> > >   }
> > > {code}
> >
> > Did you declare that you are adding this field in the
> > IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
> > plugins do this.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
Reply | Threaded
Open this post in threaded view
|

Re: Plugin Developement Help

David Stuart
In reply to this post by David Stuart
Sorry its suppose to say "would be stored somewhere it DOESN'T seem to be getting to the doc.add bit which"

On 24 November 2009 at 12:27 "[hidden email]" <[hidden email]> wrote:

> I thought I did but I thought before I did a bin/nutch index (or solrindex) it
> would be stored somewhere it does seems to be getting to the doc.add bit which
> makes me think the variable is empty
> {code}
>     public void addIndexBackendOptions(Configuration conf) {
>       LOG.warn("+_+_You called me _+_+");
>       LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,
> INDEX.UNTOKENIZED, conf);
>     }
>    
>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>       LOG.warn("________________________FILTER_______________________");
>       String html_filter_data = parse.getData().getMeta("html_filter_data");
>       if (html_filter_data != null){
>           LOG.warn("________________________Adding filter
> data_______________________");
>           doc.add("html_filter_data", html_filter_data);
>       }
>       return doc;
>     }
> {code}
> On 24 November 2009 at 12:05 Andrzej Bialecki <[hidden email]> wrote:
>
> > [hidden email] wrote:
> > >   Hi All,
> > >
> > > I think I am just about finished my plugin (nutch 1.0) which adds extra
> > > metadata to during parsing the problem I am having is it doesn't seem to
> > > be adding the data to the system (via luke or readseg). I looked at in
> > > the wiki but it seems to be for 0.9 and the syntax looks different.
> > >
> > > {code}       
> > >   public ParseResult filter(Content content, ParseResult parseResult,
> > > HTMLMetaTags metaTags, DocumentFragment doc) {
> > >       Metadata metadata = new Metadata();
> > >       // parse the content
> > >       DocumentFragment root;   
> > >       String docTrans;
> > >       try {
> > >         byte[] contentInOctets = content.getContent();
> > >         String input = new String(contentInOctets);
> > >         XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> > >         docTrans = DocTransform.doTransform(input);
> > >         Parse parse = parseResult.get(content.getUrl());
> > >         metadata = parse.getData().getParseMeta();
> > >         metadata.add("filter_html_data", docTrans);
> > >
> > >       } catch (Exception e) {
> > >         e.printStackTrace(LogUtil.getWarnStream(LOG));
> > >       }
> > >     
> > >     return parseResult;
> > >   }
> > > {code}
> >
> > Did you declare that you are adding this field in the
> > IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
> > plugins do this.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
Reply | Threaded
Open this post in threaded view
|

Re: Plugin Developement Help

David Stuart
In reply to this post by David Stuart
Sorry keep pressing

But I dont quite understanding how the metadata is passed from the parse to the index if in my
public ParseResult filter...

Do this
        Parse parse = parseResult.get(content.getUrl());
        metadata = parse.getData().getParseMeta();
        metadata.add("filter_html_data", docTrans);

Then return
return parseResult;

Is the data passed by reference into parseResult? because when I try and retrieve it in
public NutchDocument filter...

by doing
      String html_filter_data = parse.getData().getMeta("html_filter_data");
      LOG.warn(html_filter_data);
      if (html_filter_data != null){
          LOG.warn("________________________Adding filter data_______________________");
          doc.add("html_filter_data", html_filter_data);
      }
I Never reach the add because the variable html_filter_data is empty

any ideas

Thanks for you help



On 24 November 2009 at 12:27 "[hidden email]" <[hidden email]> wrote:

> I thought I did but I thought before I did a bin/nutch index (or solrindex) it
> would be stored somewhere it does seems to be getting to the doc.add bit which
> makes me think the variable is empty
> {code}
>     public void addIndexBackendOptions(Configuration conf) {
>       LOG.warn("+_+_You called me _+_+");
>       LuceneWriter.addFieldOptions("html_filter_data", STORE.YES,
> INDEX.UNTOKENIZED, conf);
>     }
>    
>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>       LOG.warn("________________________FILTER_______________________");
>       String html_filter_data = parse.getData().getMeta("html_filter_data");
>       if (html_filter_data != null){
>           LOG.warn("________________________Adding filter
> data_______________________");
>           doc.add("html_filter_data", html_filter_data);
>       }
>       return doc;
>     }
> {code}
> On 24 November 2009 at 12:05 Andrzej Bialecki <[hidden email]> wrote:
>
> > [hidden email] wrote:
> > >   Hi All,
> > >
> > > I think I am just about finished my plugin (nutch 1.0) which adds extra
> > > metadata to during parsing the problem I am having is it doesn't seem to
> > > be adding the data to the system (via luke or readseg). I looked at in
> > > the wiki but it seems to be for 0.9 and the syntax looks different.
> > >
> > > {code}       
> > >   public ParseResult filter(Content content, ParseResult parseResult,
> > > HTMLMetaTags metaTags, DocumentFragment doc) {
> > >       Metadata metadata = new Metadata();
> > >       // parse the content
> > >       DocumentFragment root;   
> > >       String docTrans;
> > >       try {
> > >         byte[] contentInOctets = content.getContent();
> > >         String input = new String(contentInOctets);
> > >         XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();
> > >         docTrans = DocTransform.doTransform(input);
> > >         Parse parse = parseResult.get(content.getUrl());
> > >         metadata = parse.getData().getParseMeta();
> > >         metadata.add("filter_html_data", docTrans);
> > >
> > >       } catch (Exception e) {
> > >         e.printStackTrace(LogUtil.getWarnStream(LOG));
> > >       }
> > >     
> > >     return parseResult;
> > >   }
> > > {code}
> >
> > Did you declare that you are adding this field in the
> > IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
> > plugins do this.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >