Structured Lucene documents

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Structured Lucene documents

Pierre-Yves LANDRON
Hello,Is it possible to structure lucene documents via Solr, so one document coud fit into another one ?What I would like to do, for example :I want to retrieve full text articles, that fit on several pages for each of them. Results must take in account both the pages and the article from wich the search terms are from. I can create a lucene document for each pages of the article AND the article itself, and do two requests to get my results, but it would duplicate the full text in the index, and will not be too efficient. Ideally, what I would like to do is to create a document for indexing the text of each pages of the article, and group these documents in one document that describe the article : this way, when Lucene retrieve a requested term, i'll get the article and the page that contains the term.I wonder if there's a way to emulate elegantly this behavior with Solr ?Kind Regards,Pierre-Yves Landron
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: Structured Lucene documents

Pieter Berkel
In theory, you could store all your pages in a single document using a
dynamic field type:

<dynamicField name="page*" type="text" indexed="true" stored="true" />

Store each page in a separate field (e.g. page1, page2, page3 .. pageN) then
at query time, use the highlighting parameters to highlight matches in the
page fields. You should be able to determine the page field that matched the
query by observing the highlighted results (I'm not certain if the
hl.flparameter accepts dynamic field names, you may need to specify
them all
manually):

hl=true&hl.fl=page1,page2,page3,pageN&hl.requireFieldMatch=true

It sounds like a bit of a rough hack and I haven't actually tried to do this
myself, maybe someone else has a better idea?

cheers,
Piete


On 08/08/2007, Pierre-Yves LANDRON <[hidden email]> wrote:

>
> Hello,Is it possible to structure lucene documents via Solr, so one
> document coud fit into another one ?What I would like to do, for example :I
> want to retrieve full text articles, that fit on several pages for each of
> them. Results must take in account both the pages and the article from wich
> the search terms are from. I can create a lucene document for each pages of
> the article AND the article itself, and do two requests to get my results,
> but it would duplicate the full text in the index, and will not be too
> efficient. Ideally, what I would like to do is to create a document for
> indexing the text of each pages of the article, and group these documents in
> one document that describe the article : this way, when Lucene retrieve a
> requested term, i'll get the article and the page that contains the term.Iwonder if there's a way to emulate elegantly this behavior with Solr ?Kind
> Regards,Pierre-Yves Landron
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: Structured Lucene documents

Pierre-Yves LANDRON
In reply to this post by Pierre-Yves LANDRON
Hello !Thanks Pieter,That seems a good idea - if not an ideal one - even if it sort of an hack. I will try it as soon as possible and keep you informed.The hl.fl parameter doesn't have to be initialized, I think, so it won't be a problem.On the other hand, I will have the exact same problem to specify the (dynamic) field on wich the request is performed... I need to be able to execute the request on the full text of the page only : must I specify all of the -hightly variable- name of each page field in my query ?I think that structured index document could be of great value to complex documents indexation. Is there a way that someday Solr will include such possibility, or is it basically impossible (due to the way Lucene works for example) ?Kind Regards,Pierre-Yves Landron> Date: Wed, 8 Aug 2007 23:12:07 +1000> From: [hidden email]> To: [hidden email]> Subject: Re: Structured Lucene documents> > In theory, you could store all your pages in a single document using a> dynamic field type:> > <dynamicField name="page*" type="text" indexed="true" stored="true" />> > Store each page in a separate field (e.g. page1, page2, page3 .. pageN) then> at query time, use the highlighting parameters to highlight matches in the> page fields. You should be able to determine the page field that matched the> query by observing the highlighted results (I'm not certain if the> hl.flparameter accepts dynamic field names, you may need to specify> them all> manually):> > hl=true&hl.fl=page1,page2,page3,pageN&hl.requireFieldMatch=true> > It sounds like a bit of a rough hack and I haven't actually tried to do this> myself, maybe someone else has a better idea?> > cheers,> Piete> > > On 08/08/2007, Pierre-Yves LANDRON <[hidden email]> wrote:> >> > Hello,Is it possible to structure lucene documents via Solr, so one> > document coud fit into another one ?What I would like to do, for example :I> > want to retrieve full text articles, that fit on several pages for each of> > them. Results must take in account both the pages and the article from wich> > the search terms are from. I can create a lucene document for each pages of> > the article AND the article itself, and do two requests to get my results,> > but it would duplicate the full text in the index, and will not be too> > efficient. Ideally, what I would like to do is to create a document for> > indexing the text of each pages of the article, and group these documents in> > one document that describe the article : this way, when Lucene retrieve a> > requested term, i'll get the article and the page that contains the term.Iwonder if there's a way to emulate elegantly this behavior with Solr ?Kind> > Regards,Pierre-Yves Landron> > _________________________________________________________________> > Express yourself instantly with MSN Messenger! Download today it's FREE!> > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: Structured Lucene documents

Pieter Berkel
On 13/08/07, Pierre-Yves LANDRON <[hidden email]> wrote:

>
> Hello !Thanks Pieter,That seems a good idea - if not an ideal one - even
> if it sort of an hack. I will try it as soon as possible and keep you
> informed.The hl.fl parameter doesn't have to be initialized, I think, so
> it won't be a problem.On the other hand, I will have the exact same
> problem to specify the (dynamic) field on wich the request is performed... I
> need to be able to execute the request on the full text of the page only :
> must I specify all of the -hightly variable- name of each page field in my
> query ?I think that structured index document could be of great value to
> complex documents indexation. Is there a way that someday Solr will include
> such possibility, or is it basically impossible (due to the way Lucene works
> for example) ?Kind Regards,Pierre-Yves Landron


Hi Pierre-Yves,

Maybe you could use dynamic field copy in your schema.xml to index content
from all page stored in your document in a separate field, something like:

<copyField source="page*" dest="all_pages" />

and then you would only need to query on the "all_pages" field.  Not quite
sure how this might be affected by the hl.requireFieldMatch=true parameter
but it's worth a try.

cheers,
Piete
Reply | Threaded
Open this post in threaded view
|

RE: Structured Lucene documents

Pierre-Yves LANDRON
Hello !



At least, I've had the oportunity to test your solution, Pieter, which was to use dynamic field :


> <dynamicField name="page*" type="text" indexed="true" stored="true" />
>
> Store each page in a separate field (e.g. page1, page2, page3 .. pageN) then
> at query time, use the highlighting parameters to highlight matches in the
> page fields. You should be able to determine the page field that matched the
> query by observing the highlighted results (I'm not certain if the
> hl.flparameter accepts dynamic field names, you may need to specify
> them all
> manually):
>
> hl=true&hl.fl=page1,page2,page3,pageN&hl.requireFieldMatch=true



As expected, when using the option requireFieldMatch=true ; it does not work...

But when the option is set to false, it seems to work fine, and in first thought, I don't need it...
As
you say, I need to specify each field when requesting the index... It's
a big letdown, in my case : it's a shame because, your solution nearly
answer my problem.

It seems the highlights fields must be specified, and that I can't use the * completion to do so.

 Am I true ? Is there a way to go throught this obligation ?



Anyway, thanks you very much !

Kind Regards,

Pierre-Yves Landron








> Date: Mon, 13 Aug 2007 21:57:42 +1000
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Structured Lucene documents
>
> On 13/08/07, Pierre-Yves LANDRON <[hidden email]> wrote:
> >
> > Hello !Thanks Pieter,That seems a good idea - if not an ideal one - even
> > if it sort of an hack. I will try it as soon as possible and keep you
> > informed.The hl.fl parameter doesn't have to be initialized, I think, so
> > it won't be a problem.On the other hand, I will have the exact same
> > problem to specify the (dynamic) field on wich the request is performed... I
> > need to be able to execute the request on the full text of the page only :
> > must I specify all of the -hightly variable- name of each page field in my
> > query ?I think that structured index document could be of great value to
> > complex documents indexation. Is there a way that someday Solr will include
> > such possibility, or is it basically impossible (due to the way Lucene works
> > for example) ?Kind Regards,Pierre-Yves Landron
>
>
> Hi Pierre-Yves,
>
> Maybe you could use dynamic field copy in your schema.xml to index content
> from all page stored in your document in a separate field, something like:
>
> <copyField source="page*" dest="all_pages" />
>
> and then you would only need to query on the "all_pages" field.  Not quite
> sure how this might be affected by the hl.requireFieldMatch=true parameter
> but it's worth a try.
>
> cheers,
> Piete

_________________________________________________________________
Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy!
http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.aspx&mkt=en-us
Reply | Threaded
Open this post in threaded view
|

Re: Structured Lucene documents

Pieter Berkel
On 21/08/07, Pierre-Yves LANDRON <[hidden email]> wrote:
>
> It seems the highlights fields must be specified, and that I can't use the
> * completion to do so.
> Am I true ? Is there a way to go throught this obligation ?


As far as I know, dynamic fields are used mainly at during indexing and
aren't expandable at query time.  It would be quite cool if Solr could do
query-time expansions of dynamic fields (e.g. hl.fl=page_*) however that
would require some knowledge of the dynamic fields already stored in the
index, which I don't think is currently available in either Solr or Lucene.

Piete
Reply | Threaded
Open this post in threaded view
|

Re: Structured Lucene documents

hossman

: aren't expandable at query time.  It would be quite cool if Solr could do
: query-time expansions of dynamic fields (e.g. hl.fl=page_*) however that
: would require some knowledge of the dynamic fields already stored in the
: index, which I don't think is currently available in either Solr or Lucene.

it is possible to get a list of all indexed fields from the underlying
Lucence IndexReader, so it's certianly possible .. the notion of
supporting "glob" syntax in all the situations where a list of field names
is used has been talked about before, but no one has attempted a
combrehensive patch yet.

note the comments in this issue, and the two threads it links to...

http://issues.apache.org/jira/browse/SOLR-247



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Structured Lucene documents

pgwillia
In reply to this post by Pierre-Yves LANDRON
Hi All,

The Structured (or Multi-Page, Multi-Part) document problem is a problem I've been thinking about for a while.  A couple of years ago when the project I was working on was using Lucene only (no Solr), we solved this problem in several steps.  At the point of ingestion we created a custom analyzer and surrounding Java code that created a mapping for positions to which page it is on (recall that analyzers tokenize the terms in a given field and mark the position of the token).  This mapping was stored outside of the Lucene index.  At query time, we used home built Java to pull the position hits matching the query from the index and augmented the results generated by Lucene.  At presentation time the results were molded into xml and then transformed by several xsl sheets, one of which translated the position hits to the page they were on using the information gleamed from the ingestion stage.

When we moved to Solr, we created a custom QueryResponseWriter in order to get the position locations into the xml results and kept the same transformation to obtain the page level hits.  The ingestion stage stays the same -- so really we're using Lucene to build the index, but Solr sits on top of it to serve results.

I admit this is an awkward hack.  Peter Binkley (peter.binkley@ualberta.ca) who I worked with on the project made this suggested improvement:

"Paged-Text" FieldType for Solr

A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

<lst name="pages">
        <lst name="doc1">
                <int name="pageid">234</int>
                <int name="pageid">236</int>
        </lst>
        <lst name="doc2">
                <int name="pageid">19</int>
        </lst>
</lst>
<lst name="hitpos">
        <lst name="doc1">
                <lst name="234">
                        <int name="pos">14325</int>
                </lst>
        </lst>
        ...
</lst>

We have some code that does something like this in a Lucene context, which could form the basis for a Solr fieldtype; but it would probably be just as easy to start fresh.
My current project would like to have some meta data about each sub-part of the document also included.  For example: each page would have a url, and/or a title associated with the content.  This becomes  meaningful when we index things like newspapers and monographs which may have page, chapter, or section level content.    So a solution would ideally have taken this into consideration.
 
Does anyone with more experience know if this is a reasonable approach?  Does an issue exist for this feature request?  Other comments or questions?

Thanks,
Tricia

Pierre-Yves LANDRON wrote
Hello,Is it possible to structure lucene documents via Solr, so one document coud fit into another one ?What I would like to do, for example :I want to retrieve full text articles, that fit on several pages for each of them. Results must take in account both the pages and the article from wich the search terms are from. I can create a lucene document for each pages of the article AND the article itself, and do two requests to get my results, but it would duplicate the full text in the index, and will not be too efficient. Ideally, what I would like to do is to create a document for indexing the text of each pages of the article, and group these documents in one document that describe the article : this way, when Lucene retrieve a requested term, i'll get the article and the page that contains the term.I wonder if there's a way to emulate elegantly this behavior with Solr ?Kind Regards,Pierre-Yves Landron