Re: writing a new parse-exe plugin [NullPointerException]

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: writing a new parse-exe plugin [NullPointerException]

Tranquil
hello again,

I've added a printStackTrace to where the fetcher throws the exception:

java.lang.NullPointerException
        at org.apache.hadoop.io.Text.encode(Text.java:375)
        at org.apache.hadoop.io.Text.encode(Text.java:356)
        at org.apache.hadoop.io.Text.writeString(Text.java:396)
        at org.apache.nutch.parse.ParseData.write(ParseData.java:159)
        at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:55)
        at org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java
:61)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:315)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
Fetcher.java:397)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
:163)

here is the snippet from the end of my code:

    // don't need those fields...
    String resultText = null;
    String resultTitle = null;
    Outlink[] outlinks = null;
   final ParseData parseData = new ParseData( ParseStatus.STATUS_SUCCESS,
                                              resultTitle, outlinks,
                                              content.getMetadata());
    return new ParseImpl(resultText, parseData);



On 10/18/07, eyal edri <[hidden email]> wrote:

>
> Found how to associate multiple contentTypes to a certain plugin:
>
> just add the content type to the conf/parse-plugins.xml file: (the plugin
> can take more than one type)
>
>  <mimeType name="application/x-dosexec">
>                 <plugin id="parse-exe" />
>   </mimeType>
>
> On 10/18/07, eyal edri <[hidden email]> wrote:
> >
> > Excellent !! :)
> >
> > that did the trick!
> >
> > Any chance to create a new page on the plugin central for writing a
> > nutch-0.9 plugin, stating the checklist (written below)?
> > (i would have uploaded, but dont have the rights to open a new page)
> >
> > The checklist: (relevant for a parse plugin, implementing the Parse
> > extention point)
> >
> >    1. Create new dir under $NUTCH_HOME/src/plugins/parse-XXX
> >    2. Create new $NUTCH_HOME/src/plugins/parse-XXX/plugin.xml
> >    [displayed below]
> >    3. Create new $NUTCH_HOME/src/plugins/parse-XXX/build.xml
> >    [displayed below]
> >    4. Write the java code
> >    $NUTCH_HOME/src/plugin/parse-XXX/src/java/org/apache/nutch/parse/XXX/XXXParser.java
> >    5. Add "nutch-extensionpoints" & "parse-XXX" to the
> >    'plugins-include' property in $NUTCH_HOME/conf/nutch- site.xml
> >    6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written
> >    below] (new mime type & alias)
> >    7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
> >    below]
> >    8. copied $NUTCH_HOME/build/plugins/parse-XXX/parse- XXX.jar to
> >    $NUTCH_HOME/plugins/parse-XXX
> >    9. run ant (build successful)
> >
> > I've got a few of more questions just to tie the loose ends..:
> >
> > 1. Exe extension has a few content types related to it (e.g .
> > application(x-exe|x-msdos|x-msdownload|octet-strem))
> >    how can i config parse-exe to capture all of them? (solved)
> > 2. i've noticed that after every build i need to copy
> > build/parse-exe/parse- exe.jar to plugins/parse-exe, any way to tell him
> > to build it directly
> >    to plugins/parse-exe?
> > 3. i get a nullPointerException from fetcher after the parse-exe works,
> > can you guide me on what i should return from the parse-exe?
> >
> >  the parse-exe plugin: ( the getParse funtion)
> >
> > public class ExeParser implements Parser {
> >   public static final Log LOG = LogFactory.getLog (ExeParser.class );
> >   private Configuration conf;
> >   public static final String DOWNLOAD_DIR = "/home/eyale/HTTPSEC/nutch-
> > 0.9/DOWNLOADS/";
> >
> >   public ExeParser() {
> >     LOG.info ("EDRI:: created exe-parser object");
> >   }
> >
> >   public Parse getParse(Content content) {
> >     String resultText = null;
> >     String resultTitle = null;
> >     Outlink[] outlinks = null;
> >
> >     try {
> >
> >       byte[] raw = content.getContent();
> >
> >       // enter here my code
> >
> >        String contentLength = content.getMetadata().get(
> > Response.CONTENT_LENGTH);
> >       if (contentLength != null && raw.length != Integer.parseInt(contentLength))
> > {
> >           return new ParseStatus(ParseStatus.FAILED ,
> > ParseStatus.FAILED_TRUNCATED,
> >                   "Content truncated at "+raw.length
> >             +" bytes. Parser can't handle incomplete exe
> > file.").getEmptyParse(getConf());
> >       }
> >       // download the file (private void function)
> >       downloadContentType(content);
> >
> >     }catch (Exception e) { // run time exception
> >         if (LOG.isWarnEnabled()) {
> >           LOG.warn("General exception in EXE parser: "+e.getMessage());
> >           e.printStackTrace (LogUtil.getWarnStream(LOG));
> >         }
> >         return new ParseStatus(ParseStatus.FAILED,
> >               "Can't be handled as exe document. " +
> > e).getEmptyParse(getConf());
> >      }
> >
>
           // this is where i don't know what to return....

    final ParseData parseData = new ParseData( ParseStatus.STATUS_SUCCESS,

> >                                               resultTitle, outlinks,
> >                                               content.getMetadata());
> >     return new ParseImpl(resultText, parseData);
> >   }
> >
> > Thanks!!!
> >
> >
> >
> >
> >
> >
> > On 10/17/07, eyal edri < [hidden email]> wrote:
> > >
> > > Hi all,
> > >
> > > I'm trying to write a new plugin that will download pages with
> > > contentType: x-dosexec (EXE) files.
> > > i've followed the "write your own plugin tutorial" in the wiki and
> > > done the following actions: (some actions are not mentioned in the tutorial)
> > >
> > >
> > >    1. Created a new dir under $NUTCH_HOME/src/plugins/parse-exe
> > >    2. Created new $NUTCH_HOME/src/plugins/parse-exe/plugin.xml
> > >    [displayed below]
> > >    3. Created new $NUTCH_HOME/src/plugins/parse-exe/build.xml
> > >    [displayed below]
> > >    4. Written the java code
> > >    $NUTCH_HOME/src/plugin/parse-exe/src/java/org/apache/nutch/parse/exe/ExeParser.java
> > >    5. Add "nutch-extensionpoints" & "parse-exe" to the
> > >    'plugins-include' property in $NUTCH_HOME/conf/nutch- site.xml
> > >    6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written
> > >    below]
> > >    7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
> > >    below]
> > >    8. copied $NUTCH_HOME/build/plugins/parse-exe/parse- exe.jar to
> > >    $NUTCH_HOME/plugins/parse-exe
> > >    9. run ant (build successful)
> > >
> > > the log shows that nutch identifies the plugin:
> > >
> > > 2007-10-17 15:15:55,657 INFO  plugin.PluginRepository - Registered
> > > Plugins:
> > > 2007-10-17 15:15:55,657 INFO  plugin.PluginRepository -         the
> > > nutch core extension points (nutch-extensionpoints)
> > > 2007-10-17 15:15:55,657 INFO  plugin.PluginRepository -         Html
> > > Parse Plug-in (parse-html)
> > > 2007-10-17 15:15:55,657 INFO  plugin.PluginRepository -         Exe
> > > Parse Plug-in (parse-exe)
> > >
> > > but when the fetcher encounters a x-dosexec file it thorws an
> > > exception:
> > >
> > > 2007-10-17 15:17:16,146 WARN  parse.ParseUtil - No suitable parser
> > > found when trying to parse content http://www.foo.com/yyy/foo.exe of
> > > type application/x-dosexec
> > > 2007-10-17 15:17:16,146 WARN  fetcher.Fetcher - Error parsing:
> > > http://www.foo.com/yyy/foo.exe: failed(2,200):
> > > org.apache.nutch.parse.ParseException: parser not found for
> > > contentType=application/x-dosexec url=http://www.foo.com/yyy/movie30.exe
> > >
> > > (sorry, but the url has been masked for security reasons)
> > >
> > > Am i missing something??
> > >
> > > thanks !!
> > >
> > >
> > >
> > > [$NUTCH_HOME/src/plugins/build.xml]
> > >
> > > <ant dir="parse-exe" target="deploy"/>
> > >
> > > [parse-plugins.xml]
> > >
> > >  <mimeType name="application/x-dosexec">
> > >                 <plugin id="parse-exe" />
> > >   </mimeType>
> > >
> > >
> > > [plugin.xml] // copied and changed from parse-pdf
> > >
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <plugin
> > >    id="parse-exe"
> > >    name="Exe Parse Plug-in"
> > >    version="1.0.0"
> > >    provider-name="nutch.org">
> > >
> > >    <runtime>
> > >       <library name="parse-exe.jar">
> > >          <export name="*"/>
> > >       </library>
> > >    </runtime>
> > >
> > >    <requires>
> > >       <import plugin="nutch-extensionpoints"/>
> > >       <import plugin="lib-log4j"/>
> > >    </requires>
> > >
> > >    <extension id="org.apache.nutch.parse.exe"
> > >               name="ExeParse"
> > >               point="org.apache.nutch.parse.Parser">
> > >
> > >       <implementation id="org.apache.nutch.parse.exe.ExeParse"
> > >                       class=" org.apache.nutch.parse.exe.ExeParse">
> > >         <parameter name="contentType" value="application/x-dosexec"/>
> > >         <parameter name="pathSuffix"  value=""/>
> > >       </implementation>
> > >    </extension>
> > >
> > > </plugin>
> > >
> > >
> > > -----------------------------------------------------------------------------------------------------------------
> > >
> > > [build.xml]
> > >
> > > <?xml version=" 1.0"?>
> > >
> > > <project name="parse-exe" default="jar-core">
> > >
> > >   <import file="../build-plugin.xml"/>
> > >
> > > </project>
> > >
> > >
> > > ------------------------------------------------------------------------
> > > [ExeParser.java]
> > >
> > > public class ExeParser implements Parser {
> > >   public static final Log LOG = LogFactory.getLog("
> > > org.apache.nutch.parse.exe");
> > >   private Configuration conf;
> > >
> > >   public Parse getParse(Content content) {
> > >
> > >     try {
> > >
> > >       byte[] raw = content.getContent();
> > >
> > >       // enter here my code ( i will replace this with real code)
> > >       LOG.info ("EDRI:: you have reached the parse-exe plugin!");
> > >       System.out.println("EDRI:: system.out.print... parse-exe");
> > >
> > >
> > >
> > >
> > >       String contentLength = content.getMetadata().get(
> > > Response.CONTENT_LENGTH );
> > >       if (contentLength != null && raw.length != Integer.parseInt(contentLength))
> > > {
> > >           return new ParseStatus(ParseStatus.FAILED,
> > > ParseStatus.FAILED_TRUNCATED,
> > >                   "Content truncated at "+raw.length
> > >             +" bytes. Parser can't handle incomplete exe
> > > file.").getEmptyParse(getConf());
> > >       }
> > >
> > >     } catch (Exception e) { // run time exception
> > >         if (LOG.isWarnEnabled()) {
> > >           LOG.warn("General exception in EXE parser:
> > > "+e.getMessage());
> > >           e.printStackTrace(LogUtil.getWarnStream(LOG));
> > >         }
> > >         return new ParseStatus(ParseStatus.FAILED,
> > >               "Can't be handled as exe document. " +
> > > e).getEmptyParse(getConf());
> > >       }
> > >
> > >     /// i'm not sure what to return here if i only need to d/l the
> > > file
> > >
> > >     ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,
> > > "",null, null, null);
> > >     parseData.setConf(this.conf);
> > >     return new ParseImpl("", parseData);
> > >   }
> > >
> > >   public void setConf(Configuration conf) {
> > >     this.conf = conf;
> > >   }
> > >
> > >   public Configuration getConf() {
> > >     return this.conf;
> > >   }
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Eyal Edri
> >
> >
> >
> >
> > --
> > Eyal Edri
> >
>
>
>
> --
> Eyal Edri




--
Eyal Edri