Cached page (like google) with hits highlighted

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Cached page (like google) with hits highlighted

webdev1977
Hello Everyone!

I am up and running with my nutch 1.4 /solr 3.3  architecture and am looking to add a few new features.  

My users want the ability to view their solr results as xhtml with the hits highlighted in the document.  So a word document/pdf would become an XHTML version first.

I see that Tika can produce XHTML but I don't see a way to integrate that with the parsing that nutch does in the parse-tika plugin.  Seems like the results sent to solr for the "content" field are just the text of the document.  

Is there a way to do this?

Thanks!
Reply | Threaded
Open this post in threaded view
|

RE: Cached page (like google) with hits highlighted

Markus Jelsma-2
Hi,

You can catch the XML in a Parse Filter by walking over the DocumentFragment that is passed. It should contain the proper mark up.

Cheers,

 
 
-----Original message-----

> From:webdev1977 <[hidden email]>
> Sent: Wed 15-Aug-2012 14:09
> To: [hidden email]
> Subject: Cached page (like google) with hits highlighted
>
> Hello Everyone!
>
> I am up and running with my nutch 1.4 /solr 3.3  architecture and am looking
> to add a few new features.  
>
> My users want the ability to view their solr results as xhtml with the hits
> highlighted in the document.  So a word document/pdf would become an XHTML
> version first.
>
> I see that Tika can produce XHTML but I don't see a way to integrate that
> with the parsing that nutch does in the parse-tika plugin.  Seems like the
> results sent to solr for the "content" field are just the text of the
> document.  
>
> Is there a way to do this?
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

RE: Cached page (like google) with hits highlighted

webdev1977
Thanks Markus!

So after some testing and walking the DocumentFragment, I see that all I get is one node:
<html>
some content here and here
</html>

I guess I expected to see more from a PDF/word document (like H1 tags, etc) that would help make the xhtml format more readable.

Am I missing something? Do I have to do anything special to the DocumentFragment to format it?

Thanks!
Reply | Threaded
Open this post in threaded view
|

RE: Cached page (like google) with hits highlighted

Markus Jelsma-2
Hmm, i would also expect PDF and office documents to have at least paragraph and heading tags in Tika's XHTML representation. You can test if it's true with java -jar tika-app -x <URL>. I think it was -x, use --help to see all options.
 
 
-----Original message-----

> From:webdev1977 <[hidden email]>
> Sent: Wed 15-Aug-2012 18:22
> To: [hidden email]
> Subject: RE: Cached page (like google) with hits highlighted
>
> Thanks Markus!
>
> So after some testing and walking the DocumentFragment, I see that all I get
> is one node:
> <html>
> some content here and here
> </html>
>
> I guess I expected to see more from a PDF/word document (like H1 tags, etc)
> that would help make the xhtml format more readable.
>
> Am I missing something? Do I have to do anything special to the
> DocumentFragment to format it?
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001434.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

RE: Cached page (like google) with hits highlighted

webdev1977
Does the 1.4 version of nutch have tika-app?  Also..maybe I am not using the DocumentFragment object properly?  Below is a summary version of my code:

public ParseResult filter(Content content, ParseResult parseResult,
           HTMLMetaTags metaTags, DocumentFragment doc) {

   for (int x = 0; x < doc.getChildNodes().getLength(); x++) {
   
     System.out.println("xml node name" + doc.getChildNodes().item(x).getNodeName());
     System.out.println("xml node value" + doc.getChildNodes().item(x).getNodeValue());
     System.out.println("xml text content" + doc.getChildNodes().item(x).getTextContent());

  }
Reply | Threaded
Open this post in threaded view
|

RE: Cached page (like google) with hits highlighted

Markus Jelsma-2
No, it doesn't come with Nutch. You can download Tika 1.2 or build trunk from source.

Code looks fine. But you might want to check the headings plugin, it uses the NodeWalker to make things easier:
http://svn.apache.org/viewvc/nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java?revision=1349233&view=markup

 
 
-----Original message-----

> From:webdev1977 <[hidden email]>
> Sent: Wed 15-Aug-2012 19:00
> To: [hidden email]
> Subject: RE: Cached page (like google) with hits highlighted
>
> Does the 1.4 version of nutch have tika-app?  Also..maybe I am not using the
> DocumentFragment object properly?  Below is a summary version of my code:
>
> public ParseResult filter(Content content, ParseResult parseResult,
>            HTMLMetaTags metaTags, DocumentFragment doc) {
>
>    for (int x = 0; x < doc.getChildNodes().getLength(); x++) {
>    
>      System.out.println("xml node name" +
> doc.getChildNodes().item(x).getNodeName());
>      System.out.println("xml node value" +
> doc.getChildNodes().item(x).getNodeValue());
>      System.out.println("xml text content" +
> doc.getChildNodes().item(x).getTextContent());
>
>   }
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001440.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

RE: Cached page (like google) with hits highlighted

webdev1977
tika-app (the gui) gives me back the xhtml just fine.. not sure what is going on here.. maybe it is not stored properly in the documentfragment upon parsing?
Reply | Threaded
Open this post in threaded view
|

Re: Cached page (like google) with hits highlighted

Julien Nioche-4
In reply to this post by webdev1977
You need to use parse-tika, however the underlying parser for pdf does not
currently generate much markup, the Word one does IIRC.

Why don't you try Tika standalone with its GUI to explore what is given per
mime-type?

Julien


On 15 August 2012 17:19, webdev1977 <[hidden email]> wrote:

> Thanks Markus!
>
> So after some testing and walking the DocumentFragment, I see that all I
> get
> is one node:
> <html>
> some content here and here
> </html>
>
> I guess I expected to see more from a PDF/word document (like H1 tags, etc)
> that would help make the xhtml format more readable.
>
> Am I missing something? Do I have to do anything special to the
> DocumentFragment to format it?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001434.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: Cached page (like google) with hits highlighted

Julien Nioche-4
Sorry I had missed your previous comments.

On 16 August 2012 09:32, Julien Nioche <[hidden email]>wrote:

> You need to use parse-tika, however the underlying parser for pdf does not
> currently generate much markup, the Word one does IIRC.
>
> Why don't you try Tika standalone with its GUI to explore what is given
> per mime-type?
>
> Julien
>
>
> On 15 August 2012 17:19, webdev1977 <[hidden email]> wrote:
>
>> Thanks Markus!
>>
>> So after some testing and walking the DocumentFragment, I see that all I
>> get
>> is one node:
>> <html>
>> some content here and here
>> </html>
>>
>> I guess I expected to see more from a PDF/word document (like H1 tags,
>> etc)
>> that would help make the xhtml format more readable.
>>
>> Am I missing something? Do I have to do anything special to the
>> DocumentFragment to format it?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001434.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: Cached page (like google) with hits highlighted

webdev1977
Thanks Julien and Markus for all your help.

I poked around the code some more yesterday and it seems like the markup is just not getting in the DocumentFragment.  All I get (for word and pdf) is just one html tag with the text of the document in between.  Maybe something is not using parse-tika properly (somewhere in the nutch implementation of the parser?)

The same two documents give me tons of markup using the tika-app gui.  The versions are the same.  I am out of ideas, anyone, anyone?

Thanks!
Reply | Threaded
Open this post in threaded view
|

RE: Cached page (like google) with hits highlighted

Markus Jelsma-2
Tika has a PDF2XHTML.java in the PDF parser but i think the standard PDFParser.java is executed for the MIME-type. In ParseTika.java we ask TikaConfig for the parser of a given MIME-type. To quickly test if it works like that you can try to hack in TikaParser and load PDF2XHTML instead of getting the parser via TikaConfig.

You can also override tell the CompositeParser.setParsers(Map<MediaType, Parser> parsers) in Tika via TikeConfig.getParser() to map the PDF2XHTML parser to the PDF MIME-type. By reading the code I think that should work.
 
 
-----Original message-----

> From:webdev1977 <[hidden email]>
> Sent: Thu 16-Aug-2012 12:51
> To: [hidden email]
> Subject: Re: Cached page (like google) with hits highlighted
>
> Thanks Julien and Markus for all your help.
>
> I poked around the code some more yesterday and it seems like the markup is
> just not getting in the DocumentFragment.  All I get (for word and pdf) is
> just one html tag with the text of the document in between.  Maybe something
> is not using parse-tika properly (somewhere in the nutch implementation of
> the parser?)
>
> The same two documents give me tons of markup using the tika-app gui.  The
> versions are the same.  I am out of ideas, anyone, anyone?
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001593.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

RE: Cached page (like google) with hits highlighted

webdev1977
PDF2XHTML is already being loaded by the pdf parser.  Something is not adding it to the DocumentFragment however, I can't seem to find out where?

any other ideas?
 I don't want to run Tika separately during the parse step to get the XHTML (seems silly) but I will if I absolutely have to.
Reply | Threaded
Open this post in threaded view
|

RE: Cached page (like google) with hits highlighted

webdev1977
In reply to this post by Markus Jelsma-2
PDF2XHTML is already being loaded by the pdf parser.  Something is not adding it to the DocumentFragment however, I can't seem to find out where?

any other ideas?
 I don't want to run Tika separately during the parse step to get the XHTML (seems silly) but I will if I absolutely have to.