No external command defined for contentType:

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

No external command defined for contentType:

Jon Shoberg
Anyone else get the message "No external command defined for
contentType:" without any sort of MIME content type declaration?

I can see HTML, PDF, and other documents getting fetched but failing on
the parse with the above message.  When I go directly to the server and
manually get the document I see a valid MIME header for content type
returned in the HTTP response header.

Anyone else seen this?  I'm fetching content but not parsing it reliably.

-j
Reply | Threaded
Open this post in threaded view
|

RE: No external command defined for contentType:

Vanderdray, Jacob
        What's the URL?  I think someone else had a similar problem and
it turned out to that the URL produced a redirect to URL containing a
query string.  Since Nutch was configured not to fetch URLs with query
strings, it just failed.

Jake.

-----Original Message-----
From: Jon Shoberg [mailto:[hidden email]]
Sent: Friday, September 23, 2005 12:27 PM
To: [hidden email]
Subject: No external command defined for contentType:

Anyone else get the message "No external command defined for
contentType:" without any sort of MIME content type declaration?

I can see HTML, PDF, and other documents getting fetched but failing on
the parse with the above message.  When I go directly to the server and
manually get the document I see a valid MIME header for content type
returned in the HTTP response header.

Anyone else seen this?  I'm fetching content but not parsing it
reliably.

-j
Reply | Threaded
Open this post in threaded view
|

Parcer Policy - Re: No external command defined for contentType:

Jon Shoberg

Following are output from the fetcher and headers from the firefoxweb
developer toolbar.

I'd appreciate any thoughts.  Perhaps something for parser policy.  I've
traced the source code a bit and nothing jumped out at me...

-j

--

050923 020413 fetch okay, but can't parse
http://medicalcenter.osu.edu/pdfs/PatientEd/Materials/PDFDocs/procedure/handwsh.pdf,
reason: failed(2,0): No external command defined for contentType:

Response Headers -
http://medicalcenter.osu.edu/pdfs/PatientEd/Materials/PDFDocs/procedure/handwsh.pdf

Server: Microsoft-IIS/5.0
X-Powered-By: ASP.NET
Date: Fri, 23 Sep 2005 17:14:19 GMT
Content-Type: application/pdf
Accept-Ranges: bytes
Last-Modified: Mon, 21 Jun 2004 16:10:22 GMT
Etag: "02b341aa57c41:96b"
Content-Length: 85604

200 OK


050923 020507 fetch okay, but can't parse
http://vet.osu.edu/sa/atcenter/vm522/webweek2/bovhd9.html, reason:
failed(2,0): No external command defined for contentType:

Response Headers - http://vet.osu.edu/sa/atcenter/vm522/webweek2/bovhd9.html

Date: Fri, 23 Sep 2005 17:20:57 GMT
Server: Apache/1.3.33 (Darwin) PHP/4.3.11
Cache-Control: max-age=60
Expires: Fri, 23 Sep 2005 17:21:57 GMT
Last-Modified: Fri, 15 Apr 2005 15:49:06 GMT
Etag: "31dd9-1c0-425fe272"
Accept-Ranges: bytes
Content-Length: 448
Connection: close
Content-Type: text/html

200 OK


050923 021427 fetch okay, but can't parse
http://felix.us.ohio-state.edu/search/o?SEARCH=21305366, reason:
failed(2,0): No external command defined for contentType:

Response Headers - http://felix.us.ohio-state.edu/search/o?SEARCH=1755564

Server: III 100
Pragma: no-cache
Expires: 0
Date: Fri Sep 23 17:25:05 2005 GMT
MIME-version: 1.0
Set-Cookie: SESSION_ID=1127496305.29650; path=/
Content-Type: text/html; charset=UTF-8

200 OK





Vanderdray, Jake wrote:

> What's the URL?  I think someone else had a similar problem and
> it turned out to that the URL produced a redirect to URL containing a
> query string.  Since Nutch was configured not to fetch URLs with query
> strings, it just failed.
>
> Jake.
>
> -----Original Message-----
> From: Jon Shoberg [mailto:[hidden email]]
> Sent: Friday, September 23, 2005 12:27 PM
> To: [hidden email]
> Subject: No external command defined for contentType:
>
> Anyone else get the message "No external command defined for
> contentType:" without any sort of MIME content type declaration?
>
> I can see HTML, PDF, and other documents getting fetched but failing on
> the parse with the above message.  When I go directly to the server and
> manually get the document I see a valid MIME header for content type
> returned in the HTTP response header.
>
> Anyone else seen this?  I'm fetching content but not parsing it
> reliably.
>
> -j



Reply | Threaded
Open this post in threaded view
|

Re: Parcer Policy - Re: No external command defined for contentType:

Jérôme Charron
> Following are output from the fetcher and headers from the firefoxweb
> developer toolbar.
>
> I'd appreciate any thoughts. Perhaps something for parser policy. I've
> traced the source code a bit and nothing jumped out at me...

Could you provide your plugins configuration, and the nutch startup logs.

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Parcer Policy - Re: No external command defined for contentType:

Jon Shoberg
J?r?me Charron wrote:

>
>     Following are output from the fetcher and headers from the firefoxweb
>     developer toolbar.
>
>     I'd appreciate any thoughts.  Perhaps something for parser policy.  I've
>     traced the source code a bit and nothing jumped out at me...
>
> Could you provide your plugins configuration, and the nutch startup logs.
>
> J?r?me

Jerome,

   See below.

--

<property>
   <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|pdf|msword|rss|ext)|index-basic|query-(basic|site|url)</value>
   <description>Regular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.  By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins.
   </description>
</property>


--

050923 020323 parsing file:/usr/local/nutch/conf/nutch-default.xml
050923 020323 parsing file:/usr/local/nutch/conf/nutch-site.xml
050923 020323 No FS indicated, using default:local
050923 020323 Plugins: looking in: /usr/local/nutch/plugins
050923 020323 not including: /usr/local/nutch/plugins/protocol-ftp
050923 020323 not including: /usr/local/nutch/plugins/urlfilter-prefix
050923 020323 parsing: /usr/local/nutch/plugins/parse-text/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
050923 020323 not including: /usr/local/nutch/plugins/ontology
050923 020323 parsing: /usr/local/nutch/plugins/parse-ext/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.ext.ExtParser
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.ext.ExtParser
050923 020323 parsing: /usr/local/nutch/plugins/parse-rss/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.rss.RSSParser
050923 020323 parsing:
/usr/local/nutch/plugins/protocol-httpclient/plugin.xml
050923 020323 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
050923 020323 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
050923 020323 parsing: /usr/local/nutch/plugins/parse-pdf/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.pdf.PdfParser
050923 020323 not including: /usr/local/nutch/plugins/creativecommons
050923 020323 parsing: /usr/local/nutch/plugins/parse-html/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
050923 020323 parsing: /usr/local/nutch/plugins/parse-msword/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.msword.MSWordParser
050923 020323 parsing: /usr/local/nutch/plugins/query-basic/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
050923 020323 not including: /usr/local/nutch/plugins/protocol-http
050923 020323 not including: /usr/local/nutch/plugins/index-more
050923 020323 not including: /usr/local/nutch/plugins/query-more
050923 020323 not including: /usr/local/nutch/plugins/parse-js
050923 020323 parsing: /usr/local/nutch/plugins/index-basic/plugin.xml
050923 020323 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
050923 020323 not including: /usr/local/nutch/plugins/language-identifier
050923 020323 parsing: /usr/local/nutch/plugins/query-site/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
050923 020323 not including: /usr/local/nutch/plugins/clustering-carrot2
050923 020323 not including: /usr/local/nutch/plugins/protocol-file
050923 020323 parsing: /usr/local/nutch/plugins/urlfilter-regex/plugin.xml
050923 020323 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
050923 020323 parsing: /usr/local/nutch/plugins/query-url/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
050923 020323 logging at INFO
050923 020323 fetching
http://vet.osu.edu/assets/courses/vm602/quotes/quote46.html
050923 020323 fetching
http://vet.osu.edu/assets/courses/vm562/muir/sedatives.pdf
050923 020323 http.proxy.host = null
050923 020323 http.proxy.port = 8080
050923 020323 http.timeout = 10000
050923 020323 http.content.limit = 7168000
050923 020323 http.agent = Nutch/0.7 ( nutch; http://xxxxxxx,
xxxxxx@xxxxxxxx)
050923 020323 http.auth.ntlm.username =
050923 020323 fetcher.server.delay = 3000
050923 020323 http.max.delays = 10
050923 020324 Configured Client
Reply | Threaded
Open this post in threaded view
|

Re: Parcer Policy - Re: No external command defined for contentType:

Jérôme Charron
Hello Jon, and sorry for the late response,

> I'd appreciate any thoughts. Perhaps something for parser policy. I've
> > traced the source code a bit and nothing jumped out at me...

There's some currently identified issues on the parser policy (ie
ParserFactory), and we are actively working on it.
I don't undestand why the parse-ext plugin is called in your case, whereas
it should be parser-pdf or parse-html plugins.
Here's a workaround: if you don't have needs for the parse-ext (plugin used
to perform parsing using some exernal commands), simply remove it and all
should be ok.
Could you please send me your /usr/local/nutch/plugins/parse-ext/plugin.xml
file so that I can check if something goes wrong in it.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Parcer Policy - Re: No external command defined for contentType:

Jon Shoberg
J?r?me Charron wrote:

> Hello Jon, and sorry for the late response,
>
>
>>I'd appreciate any thoughts. Perhaps something for parser policy. I've
>>
>>>traced the source code a bit and nothing jumped out at me...
>
>
> There's some currently identified issues on the parser policy (ie
> ParserFactory), and we are actively working on it.
> I don't undestand why the parse-ext plugin is called in your case, whereas
> it should be parser-pdf or parse-html plugins.
> Here's a workaround: if you don't have needs for the parse-ext (plugin used
> to perform parsing using some exernal commands), simply remove it and all
> should be ok.
> Could you please send me your /usr/local/nutch/plugins/parse-ext/plugin.xml
> file so that I can check if something goes wrong in it.
>
> Regards
>
> J?r?me
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>


"should be ok" ... as in content will be parsed correctly or that we
will not see the error message.  Lack of an error message does nto mean
thigns are ok. :)

Pased below is the file.  This is from the release-0.7 build with
patches as 0.7.1 is getting prepared.

<?xml version="1.0" encoding="UTF-8"?>
<plugin
    id="parse-ext"
    name="External Parser Plug-in"
    version="1.0.0"
    provider-name="nutch.org">



    <runtime>
       <library name="parse-ext.jar">
          <export name="*"/>
       </library>
    </runtime>

    <extension id="org.apache.nutch.parse.ext"
               name="ExtParse"
               point="org.apache.nutch.parse.Parser">

       <implementation id="ExtParser"
                       class="org.apache.nutch.parse.ext.ExtParser"
                       contentType="application/vnd.nutch.example.cat"
                       pathSuffix=""
                       command="./build/plugins/parse-ext/command"
                       timeout="10"/>

       <implementation id="ExtParser"
                       class="org.apache.nutch.parse.ext.ExtParser"
                       contentType="application/vnd.nutch.example.md5sum"
                       pathSuffix=""
                       command="./build/plugins/parse-ext/command"
                       timeout="20"/>

    </extension>

</plugin>

Reply | Threaded
Open this post in threaded view
|

Re: Parcer Policy - Re: No external command defined for contentType:

Jérôme Charron
> "should be ok" ... as in content will be parsed correctly or that we
> will not see the error message.

This is a workaround, not an error make-up.
So yes, parsing should be ok!

> Lack of an error message does nto mean
> thigns are ok. :)

Thanks for this great geeks lesson!

> Pased below is the file. This is from the release-0.7 build with
> patches as 0.7.1 is getting prepared.

OK, all seems to be ok in the file.
Thanks to give us feed back on the workaround.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/