Highligher Example

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Highligher Example

Mag Gam
Hey

Anyone have a search result highlighter example?

I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
highlight, similar to how google does it...

tia
Reply | Threaded
Open this post in threaded view
|

Re: Highligher Example

Erik Hatcher
There are test cases in the Highlighter codebase that exercise it and  
show its use, as well as a few examples of it in the "Lucene in  
Action" codebase.

These examples output plain text with some prefix and suffix  
surrounding the highlighted terms.  Highlighting text in a PDF is  
possible, I'm pretty sure, but I don't think the same would be easily  
possible with Microsoft document formats.  I'm not sure if you are  
asking for these document types to be highlighted or just a plain  
text representation of them, though.

        Erik

On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:

> Hey
>
> Anyone have a search result highlighter example?
>
> I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
> highlight, similar to how google does it...
>
> tia


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highligher Example

Mag Gam
Thanks for the quick response Erik. I will be getting my LIA book back very
soon, I forgot it at a destination :-(

Lets assume, there is a document called "hello.pdf" and it has the content
"this is hello.pdf. It uses Acrobat"

When I perform a search for "Acrobat", i want hello.pdf to show up, and also
the 'It uses <highlight>Acrobat</highlight>'

something like that.

tia



On 9/7/06, Erik Hatcher <[hidden email]> wrote:

>
> There are test cases in the Highlighter codebase that exercise it and
> show its use, as well as a few examples of it in the "Lucene in
> Action" codebase.
>
> These examples output plain text with some prefix and suffix
> surrounding the highlighted terms.  Highlighting text in a PDF is
> possible, I'm pretty sure, but I don't think the same would be easily
> possible with Microsoft document formats.  I'm not sure if you are
> asking for these document types to be highlighted or just a plain
> text representation of them, though.
>
>         Erik
>
> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>
> > Hey
> >
> > Anyone have a search result highlighter example?
> >
> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
> > highlight, similar to how google does it...
> >
> > tia
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Highligher Example

Mark Miller-3
Highlighting a PDF document, last time I looked (quite a while ago),
involves supplying an xml file that describes offsets for highlighting.
You can specify the file in the URL. You can also do simple highlighting
by passing in a list of words to be highlighted, but this does not even
catch minor differences, like singular to plural.

If someone knows more about using to the lucene highlighter to highlight
PDF's then please speak up. I think I will have to get into this soon.

- Mark

Mag Gam wrote:

> Thanks for the quick response Erik. I will be getting my LIA book back
> very
> soon, I forgot it at a destination :-(
>
> Lets assume, there is a document called "hello.pdf" and it has the
> content
> "this is hello.pdf. It uses Acrobat"
>
> When I perform a search for "Acrobat", i want hello.pdf to show up,
> and also
> the 'It uses <highlight>Acrobat</highlight>'
>
> something like that.
>
> tia
>
>
>
> On 9/7/06, Erik Hatcher <[hidden email]> wrote:
>>
>> There are test cases in the Highlighter codebase that exercise it and
>> show its use, as well as a few examples of it in the "Lucene in
>> Action" codebase.
>>
>> These examples output plain text with some prefix and suffix
>> surrounding the highlighted terms.  Highlighting text in a PDF is
>> possible, I'm pretty sure, but I don't think the same would be easily
>> possible with Microsoft document formats.  I'm not sure if you are
>> asking for these document types to be highlighted or just a plain
>> text representation of them, though.
>>
>>         Erik
>>
>> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>>
>> > Hey
>> >
>> > Anyone have a search result highlighter example?
>> >
>> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> > highlight, similar to how google does it...
>> >
>> > tia
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highligher Example

mark harwood
If you have a budget for this stuff then Stellent provide tools for parsing multiple document types and also have a viewer that can display documents with their original formatting, plus your highlights. See http://www.stellent.com/en/products/outside_in/viewer_tech/index.htm

I don't work for Stellent and haven't used it but I do know this stuff is hard to do and they are the only ones I'm aware of trying to provide tools to cover all document types which is why I mention it. If anyone has any other similar recommendations I would be interested to hear them.


----- Original Message ----
From: Mark Miller <[hidden email]>
To: [hidden email]
Sent: Friday, 8 September, 2006 2:02:47 AM
Subject: Re: Highligher Example

Highlighting a PDF document, last time I looked (quite a while ago),
involves supplying an xml file that describes offsets for highlighting.
You can specify the file in the URL. You can also do simple highlighting
by passing in a list of words to be highlighted, but this does not even
catch minor differences, like singular to plural.

If someone knows more about using to the lucene highlighter to highlight
PDF's then please speak up. I think I will have to get into this soon.

- Mark

Mag Gam wrote:

> Thanks for the quick response Erik. I will be getting my LIA book back
> very
> soon, I forgot it at a destination :-(
>
> Lets assume, there is a document called "hello.pdf" and it has the
> content
> "this is hello.pdf. It uses Acrobat"
>
> When I perform a search for "Acrobat", i want hello.pdf to show up,
> and also
> the 'It uses <highlight>Acrobat</highlight>'
>
> something like that.
>
> tia
>
>
>
> On 9/7/06, Erik Hatcher <[hidden email]> wrote:
>>
>> There are test cases in the Highlighter codebase that exercise it and
>> show its use, as well as a few examples of it in the "Lucene in
>> Action" codebase.
>>
>> These examples output plain text with some prefix and suffix
>> surrounding the highlighted terms.  Highlighting text in a PDF is
>> possible, I'm pretty sure, but I don't think the same would be easily
>> possible with Microsoft document formats.  I'm not sure if you are
>> asking for these document types to be highlighted or just a plain
>> text representation of them, though.
>>
>>         Erik
>>
>> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>>
>> > Hey
>> >
>> > Anyone have a search result highlighter example?
>> >
>> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> > highlight, similar to how google does it...
>> >
>> > tia
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Highligher Example

Dejan Nenov-2
Second that - I was a client of Stellent - the libs work great but are
expensive. To see Stellent in action - get a copy of the free X1 desktop
search or the X1 server (Lucene based).
Another alternative is KeyView from Verity - now Autonomy.

-----Original Message-----
From: mark harwood [mailto:[hidden email]]
Sent: Friday, September 08, 2006 1:27 AM
To: [hidden email]
Subject: Re: Highligher Example

If you have a budget for this stuff then Stellent provide tools for parsing
multiple document types and also have a viewer that can display documents
with their original formatting, plus your highlights. See
http://www.stellent.com/en/products/outside_in/viewer_tech/index.htm

I don't work for Stellent and haven't used it but I do know this stuff is
hard to do and they are the only ones I'm aware of trying to provide tools
to cover all document types which is why I mention it. If anyone has any
other similar recommendations I would be interested to hear them.


----- Original Message ----
From: Mark Miller <[hidden email]>
To: [hidden email]
Sent: Friday, 8 September, 2006 2:02:47 AM
Subject: Re: Highligher Example

Highlighting a PDF document, last time I looked (quite a while ago),
involves supplying an xml file that describes offsets for highlighting.
You can specify the file in the URL. You can also do simple highlighting
by passing in a list of words to be highlighted, but this does not even
catch minor differences, like singular to plural.

If someone knows more about using to the lucene highlighter to highlight
PDF's then please speak up. I think I will have to get into this soon.

- Mark

Mag Gam wrote:

> Thanks for the quick response Erik. I will be getting my LIA book back
> very
> soon, I forgot it at a destination :-(
>
> Lets assume, there is a document called "hello.pdf" and it has the
> content
> "this is hello.pdf. It uses Acrobat"
>
> When I perform a search for "Acrobat", i want hello.pdf to show up,
> and also
> the 'It uses <highlight>Acrobat</highlight>'
>
> something like that.
>
> tia
>
>
>
> On 9/7/06, Erik Hatcher <[hidden email]> wrote:
>>
>> There are test cases in the Highlighter codebase that exercise it and
>> show its use, as well as a few examples of it in the "Lucene in
>> Action" codebase.
>>
>> These examples output plain text with some prefix and suffix
>> surrounding the highlighted terms.  Highlighting text in a PDF is
>> possible, I'm pretty sure, but I don't think the same would be easily
>> possible with Microsoft document formats.  I'm not sure if you are
>> asking for these document types to be highlighted or just a plain
>> text representation of them, though.
>>
>>         Erik
>>
>> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>>
>> > Hey
>> >
>> > Anyone have a search result highlighter example?
>> >
>> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> > highlight, similar to how google does it...
>> >
>> > tia
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highligher Example

Daniel Noll-3
Dejan Nenov wrote:
> Second that - I was a client of Stellent - the libs work great but are
> expensive. To see Stellent in action - get a copy of the free X1 desktop
> search or the X1 server (Lucene based).

I would say that the libs work great but are slow.

One problem is that they don't provide a Java API.  The "Java" API they
provide is sample code which calls a native executable, not even a JNI
library.  So you pay the penalty of that native app starting up every
time you extract a document.

If all you want is the plain text, for many document types it's actually
fairly fast, and beats having to write code for every document type
yourself (or locating libraries to do it for you.)  But as soon as you
want the marked up text, it becomes a completely different story.  We
benchmarked it to be something like 10 times slower to handle markup
than handling raw text and metadata.  Most of this extra time was spent
parsing the XML it outputs, which is often far more verbose than it
needs to be for the amount of formatting it actually contains.

Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highligher Example

Shane Perry
In reply to this post by Erik Hatcher
Not sure if this is something of interest, but there is an open source
project called File2XLIFF4j on Sourceforge.net
(http://file2xliff4j.sourceforge.net/).  The project converts many
common file formats to XLIFF.  It may be useful for getting a common
format, highlighting, and the recreating the original file with the format.

Erik Hatcher wrote:

> There are test cases in the Highlighter codebase that exercise it and
> show its use, as well as a few examples of it in the "Lucene in
> Action" codebase.
>
> These examples output plain text with some prefix and suffix
> surrounding the highlighted terms.  Highlighting text in a PDF is
> possible, I'm pretty sure, but I don't think the same would be easily
> possible with Microsoft document formats.  I'm not sure if you are
> asking for these document types to be highlighted or just a plain text
> representation of them, though.
>
>     Erik
>
> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>
>> Hey
>>
>> Anyone have a search result highlighter example?
>>
>> I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> highlight, similar to how google does it...
>>
>> tia
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highligher Example

Till Kinstler
In reply to this post by Mark Miller-3
Mark Miller schrieb:
> Highlighting a PDF document, last time I looked (quite a while ago),
> involves supplying an xml file that describes offsets for highlighting.
> You can specify the file in the URL.

PDFBox (http://www.pdfbox.org/), which is also convenient for parsing
PDFs, can generate those XML files through its class PDFHighlighter
(http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFHighlighter.html).
There is a page discribing the various options for highlighting PDFs
with PDFBox: http://www.pdfbox.org/userguide/highlighting.html.
Unfortunately, highlighting through these XML files seems not to work in
the Acrobat Reader plugin for Linux.

Till

--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
[hidden email], +49 (0) 551 39-13431, http://www.gbv.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highligher Example

Tom Emerson-3
In reply to this post by mark harwood
Autonomy's KeyView is an alternative to Stellent. It does not cover all of
the file formats that Stellent does, though many of them are probably not
interesting for most applications. When I last looked at it it did not
handle mail archives, though there was a plan to add it. I found it more
stable than Stellent, and it has a JNI interface that works quite well. It
is still quite expensive, however.

PDFBox works, but we found it to be really really slow.

YMMV,

     -tree

--
Tom Emerson
[hidden email]
http://www.dreamersrealm.net/~tree