Payloads in Solr

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Payloads in Solr

Tricia Williams
Hi All,

    I was wondering how Solr people feel about the inclusion of Payload
functionality in the Solr codebase?

    From a recent message to the [hidden email] mailing list:

>   I'm working on the issue
> https://issues.apache.org/jira/browse/SOLR-380 which is a feature
> request that allows one to index a "Structured Document" which is
> anything that can be represented by XML in order to provide more
> context to hits in the result set.  This allows us to do things like
> query the index for "Canada" and be able to not only say that that
> query matched a document titled "Some Nonsense" but also that the
> query term appeared on page 7 of chapter 1.  We can then take this one
> step further and markup/highlight the image of this page based on our
> OCR and position hit.
> For example:
>
> <book title='Some Nonsense'><chapter title='One'><page name='1'>Some
> text from page one of a book.</page><page name='7'>Some more text from
> page seven of a book. Oh and I'm from Canada.</page></chapter></book>
>
>   I accomplished this by creating a custom Tokenizer which strips the
> xml elements and stores them as a Payload at each of the Tokens
> created from the character data in the input.  The payload is the
> string that describes the XPath at that location.  So for <Canada> the
> payload is "/book[title='Some
> Nonsense']/chapter[title='One']/page[name='7']"
>
>   The other part of this work is the SolrHighlighter which is less
> important to this list.  I retrieve the TermPositions for the Query's
> Terms and use the TermPosition functionality to get back the payload
> for the hits and build output which shows hit positions categorized by
> the payload they are associated with.
    Using Payloads requires me to include lucene-core-2.3-dev.jar  which
might be a barrier.  Also, using my Tokenizer with Solr specific
TokenFilter(s) looses the Payload at modified tokens.  I probably
shouldn't generalize this but I suspect it is true.  My only issue has
come from the WordDelimiterFilter so far.

> In the following example I will denote a token by {pos,<term
> text>,<payload>}:
>
> input: <class name='mammalia'>Dog, and Cat</class>
>
> XmlPayloadTokenizer:
> {1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<and>,</class[name='mammalia'][startPos='0']>},{3,<Cat>,</class[name='mammalia'][startPos='0']>}
>
> StopFilter:
> {1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<Cat>,</class[name='mammalia'][startPos='0']>}
>
> WordDelimiterFilter:
> {1,<Dog>,<>} {2,<Cat>,</class[name='mammalia'][startPos='0']>}
> LowerCaseFilter:
> {1,<dog>,<>} {2,<cat>,</class[name='mammalia'][startPos='0']>}
    Should I create an JIRA issue about the Filters and post a patch?

Thanks,
Tricia
Reply | Threaded
Open this post in threaded view
|

Re: Payloads in Solr

Yonik Seeley-2
On Nov 17, 2007 2:18 PM, Tricia Williams <[hidden email]> wrote:
>     I was wondering how Solr people feel about the inclusion of Payload
> functionality in the Solr codebase?

All for it... depending on what one means by "payload functionality" of course.
We should probably hold off on adding a new lucene version to Solr
until the Payload API has stabilized (it will most likely be changing
very soon).

>     From a recent message to the [hidden email] mailing list:
> >   I'm working on the issue
> > https://issues.apache.org/jira/browse/SOLR-380 which is a feature
> > request that allows one to index a "Structured Document" which is
> > anything that can be represented by XML in order to provide more
> > context to hits in the result set.  This allows us to do things like
> > query the index for "Canada" and be able to not only say that that
> > query matched a document titled "Some Nonsense" but also that the
> > query term appeared on page 7 of chapter 1.  We can then take this one
> > step further and markup/highlight the image of this page based on our
> > OCR and position hit.
> > For example:
> >
> > <book title='Some Nonsense'><chapter title='One'><page name='1'>Some
> > text from page one of a book.</page><page name='7'>Some more text from
> > page seven of a book. Oh and I'm from Canada.</page></chapter></book>
> >
> >   I accomplished this by creating a custom Tokenizer which strips the
> > xml elements and stores them as a Payload at each of the Tokens
> > created from the character data in the input.  The payload is the
> > string that describes the XPath at that location.  So for <Canada> the
> > payload is "/book[title='Some
> > Nonsense']/chapter[title='One']/page[name='7']"

That's a lot of data to associate with every token... I wonder how
others have accomplished this?
One could compress it with a dictionary somewhere.
I wonder if one could index special begin_tag and end_tag tokens, and
somehow use span queries?

>     Using Payloads requires me to include lucene-core-2.3-dev.jar  which
> might be a barrier.  Also, using my Tokenizer with Solr specific
> TokenFilter(s) looses the Payload at modified tokens.

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Payloads in Solr

Tricia Williams
Thanks for your comments, Yonik!
> All for it... depending on what one means by "payload functionality" of course.
> We should probably hold off on adding a new lucene version to Solr
> until the Payload API has stabilized (it will most likely be changing
> very soon).
>
>  
It sounds like Lucene 2.3 is going to be released soonish
(http://www.nabble.com/How%27s-2.3-doing--tf4802426.html#a13740605).  As
best I can tell it will include the Payload stuff marked experimental.  
The new Lucene version will have many improvements besides Payloads
which would benefit Solr (examples galore in CHANGES.txt
http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=log).  
So I find it hard to believe that the new release will not be included.  
I recognize that the experimental status would be worrisome.  What will
it take to get Payloads to the place that they would be excepted for use
in the Solr community?  You probably know more about the projected
changes to the API than I.  Care to fill me in or suggest who I should
ask?  On the [hidden email] list Grant Ingersoll
suggested that the Payload object would be done away with and the API
would just deal with byte arrays directly.
> That's a lot of data to associate with every token... I wonder how
> others have accomplished this?
> One could compress it with a dictionary somewhere.
> I wonder if one could index special begin_tag and end_tag tokens, and
> somehow use span queries?
>
>  
I agree that is a lot of data to associate with every token - especially
since the data is repetitive in nature.  Erik Hatcher suggested I store
a representation of the structure of the document in a separate field,
store a numeric representation of the mapping of the token to the
structure as the payload for each token, and do a lookup at query time
based on the numeric mapping in the payload at the position hit to get
the structure/context back for the token.

I'm also wondering how others have accomplished this.  Grant Ingersoll
noted that one of the original use cases was XPath queries so I'm
particularly interested in finding out if anyone has implemented that,
and how.
> Yes, this will be an issue for many custom tokenizers that don't yet
> know about payloads but that create tokens.  It's not clear what to do
> in some cases when multiple tokens are created from one... should
> identical payloads be created for the new tokens... it depends on what
> the semantics of those payloads are.
>
>  
I suppose that it is only fair to take this on a case by case basis.  
Maybe we will have to write new TokenFilters for each Tokenzier that
uses Payloads (but I sure hope not!).  Maybe we can build some optional
configuration options into the TokenFilter constructor that guide their
behavior with regard to Payloads.  Maybe there is something stored in
the TokenStream that dictates how the Payloads are handled by the
TokenFilters.  Maybe there is no case where identical payloads would not
be created for new tokens and we can just change the TokenFilter to deal
with payloads directly in a uniform way.

Tricia
Reply | Threaded
Open this post in threaded view
|

Re: Payloads in Solr

Yonik Seeley-2
On Nov 18, 2007 2:25 PM, Tricia Williams <[hidden email]> wrote:

> Thanks for your comments, Yonik!
> > All for it... depending on what one means by "payload functionality" of course.
> > We should probably hold off on adding a new lucene version to Solr
> > until the Payload API has stabilized (it will most likely be changing
> > very soon).
> >
> >
> It sounds like Lucene 2.3 is going to be released soonish
> (http://www.nabble.com/How%27s-2.3-doing--tf4802426.html#a13740605).  As
> best I can tell it will include the Payload stuff marked experimental.
> The new Lucene version will have many improvements besides Payloads
> which would benefit Solr (examples galore in CHANGES.txt
> http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=log).
> So I find it hard to believe that the new release will not be included.

Sorry for the mis-understanding... Solr will include Lucene 2.3 when
it comes out (or even before).  When I mentioned holding off, my
assumption was that the payload API would be nailed down before 2.3
was released.
http://www.nabble.com/Payload-API-tf4828837.html#a13815548

> I agree that is a lot of data to associate with every token - especially
> since the data is repetitive in nature.  Erik Hatcher suggested I store
> a representation of the structure of the document in a separate field,
> store a numeric representation of the mapping of the token to the
> structure as the payload for each token, and do a lookup at query time
> based on the numeric mapping in the payload at the position hit to get
> the structure/context back for the token.

That seems like it would work for highlighting-type scenarios (where
few stored fields would be loaded), but not during querying.

> I'm also wondering how others have accomplished this.  Grant Ingersoll
> noted that one of the original use cases was XPath queries so I'm
> particularly interested in finding out if anyone has implemented that,
> and how.

Me too.   Any clarifications on that Grant???

> Maybe we will have to write new TokenFilters for each Tokenzier that
> uses Payloads (but I sure hope not!).  Maybe we can build some optional
> configuration options into the TokenFilter constructor that guide their
> behavior with regard to Payloads.

Yes, that was my thought too.

>  Maybe there is something stored in
> the TokenStream that dictates how the Payloads are handled by the
> TokenFilters.

Interesting idea.... it could be easily implemented as a flag in a bitfield:
http://www.nabble.com/new-Token-API-tf4828894.html#a13815702

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Payloads in Solr

Grant Ingersoll-2

On Nov 18, 2007, at 11:09 PM, Yonik Seeley wrote:
>
>> I'm also wondering how others have accomplished this.  Grant  
>> Ingersoll
>> noted that one of the original use cases was XPath queries so I'm
>> particularly interested in finding out if anyone has implemented  
>> that,
>> and how.
>
> Me too.   Any clarifications on that Grant???

 From what I understand from Michael Busch, you can store the path at  
each token, but this doesn't seem efficient to me.  I would think you  
may want to come up with some more efficient encoding.  I am cc'ing  
Michael on this thread to see if he is able to add any light to the  
subject (he may not be able to b/c of employer reasons).   If he  
can't, then we can brainstorm a bit more on how to do it most  
efficiently.

An interesting thing here to think about is how we can come up with  
more general support for XML documents and other structured docs.  For  
instance, a common syntax used in NLP for tokens is something like:
The|DET quick|JJ red|JJ fox|NN jumped|VB over|??? the|DET lazy|JJ  
brown|JJ dogs|NN or other variations that also apply phrase  
identification, semantic relationships, etc.  These things, to me, all  
logically fit as payloads, so it may be wise to think about coming up  
with one or two generic supports for these kind of things.  One could  
be the default XML/XPath marked up document, but another might be this  
pipe notation that is common in NLP.

See http://wiki.apache.org/lucene-java/Payload_Planning and the  
related threads

-Grant
Reply | Threaded
Open this post in threaded view
|

Re: Payloads in Solr

Tricia Williams
In reply to this post by Yonik Seeley-2
Yonik Seeley wrote:
>
> http://www.nabble.com/Payload-API-tf4828837.html#a13815548
>
>  
> http://www.nabble.com/new-Token-API-tf4828894.html#a13815702
>
>  
Thanks for these links.  I didn't even realize you had started these
conversations.

Thank you!
Tricia
Reply | Threaded
Open this post in threaded view
|

Re: Payloads in Solr

pgwillia
In reply to this post by Grant Ingersoll-2
I started this thread back in November.  Recall that I'm indexing xml and storing the xpath as a payload in each token.  I am not encoding or mapping the xpath but storing the text directly as String.getBytes().  We're not using this to query in any way, just to add context to our search results.  Presently, I'm ready to bounce around some more ideas about encoding xpath or strings in general.

Back in the day Grant said:
 
From what I understand from Michael Busch, you can store the path at  
each token, but this doesn't seem efficient to me.  I would think you  
may want to come up with some more efficient encoding.  I am cc'ing  
Michael on this thread to see if he is able to add any light to the  
subject (he may not be able to b/c of employer reasons).   If he  
can't, then we can brainstorm a bit more on how to do it most  
efficiently.
The word "encoding" in Grant's response brings to mind Huffman coding (http://en.wikipedia.org/wiki/Huffman_coding).  This would not solve the query on payload problem that Yonik pointed out because the encoding would be document centric, but could reduce the amount of total bytes that I need to store.

Any ideas?

Tricia