Newbie question: using Lucene to index hierarchical information.

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Newbie question: using Lucene to index hierarchical information.

leonardinius
Hi all,

First of all, sorry for my poor English. It's not my native language.

I'm trying to use Lucene to index hierarchical kind of information: I have
structured html and pdf/word documents and I want to index them in ways to
perform search in titles, text, paragraphs or tables only, or any
combinations of items mentioned above. At the moment I see 3 possible
solutions:

   - Create the set of all possible fields, like: contents, title, heading,
   table etc... And index the data in all them accordingly. Possible impacts:
   - a big count of fields
      - data duplication (because I need to make search looking in the
      paragraphs to look inside all the inner elements, so every outer element
      indexed will contain all the inner element content as well)
   - Create the hierarchy of the fields, like "title", "paragraph/title",
   "paragraph/title/subparagraph/table". Possible impacts:
      - count of fields remains the same
      - soft set of fields (not consistent)
      - I'm not sure about the ways I could process required information and
      perform search.
      - Performance issues?
      - Use one field for content and just add location prefix to content.
   For example "contents:*paragraph/heading:*token1 token2". *
   paragraph/heading:* here is used as additional information prefix. So, I
   (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
      - Strong set of index fields (small)
      - Additional information processing - all the queries I'll use will
      have to work as PrefixQuery
      - Performance issues?


So, have anyone tried to make things work like that? Or am I trying to use
wrench to hammer in nails? I assume Lucene wasn't thought to be used like
that, but it's worth trying (at least asking).
Any results / suggestions are welcome!

--
Bests regards,
Leonid Maslov!
Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
Reply | Threaded
Open this post in threaded view
|

Re: Newbie question: using Lucene to index hierarchical information.

leonardinius
Any comments, suggestions? Maybe I should rephrase my original message or
describe it in detail?
I really would like to get any response if possible.

Thanks a lot in advance!

On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <[hidden email]> wrote:

> Hi all,
>
> First of all, sorry for my poor English. It's not my native language.
>
> I'm trying to use Lucene to index hierarchical kind of information: I have
> structured html and pdf/word documents and I want to index them in ways to
> perform search in titles, text, paragraphs or tables only, or any
> combinations of items mentioned above. At the moment I see 3 possible
> solutions:
>
>    - Create the set of all possible fields, like: contents, title,
>    heading, table etc... And index the data in all them accordingly. Possible
>    impacts:
>    - a big count of fields
>       - data duplication (because I need to make search looking in the
>       paragraphs to look inside all the inner elements, so every outer element
>       indexed will contain all the inner element content as well)
>    - Create the hierarchy of the fields, like "title", "paragraph/title",
>    "paragraph/title/subparagraph/table". Possible impacts:
>       - count of fields remains the same
>       - soft set of fields (not consistent)
>       - I'm not sure about the ways I could process required information
>       and perform search.
>       - Performance issues?
>       - Use one field for content and just add location prefix to content.
>    For example "contents:*paragraph/heading:*token1 token2". *
>    paragraph/heading:* here is used as additional information prefix. So,
>    I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
>       - Strong set of index fields (small)
>       - Additional information processing - all the queries I'll use will
>       have to work as PrefixQuery
>       - Performance issues?
>
>
> So, have anyone tried to make things work like that? Or am I trying to use
> wrench to hammer in nails? I assume Lucene wasn't thought to be used like
> that, but it's worth trying (at least asking).
> Any results / suggestions are welcome!
>
> --
> Bests regards,
> Leonid Maslov!
> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
>



--
Bests regards,
Leonid Maslov!
Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
Reply | Threaded
Open this post in threaded view
|

Re: Newbie question: using Lucene to index hierarchical information.

Karsten F.-2
Hi Leonid,

what kind of query is your use case?

Comlex scenario:
You need all the hierarchical structure information in one query. This means you want to search with xpath in a real xml-Database. (like: All Documents with a subtitle XY which contains directly after this subtitle a table with the same column like ...)

Normal scenario:
You want to search for only one part of your hierarchical information like 'Document with word xy in title' and 'Documents with word xy in table'.

I am not familiar with lucene use in xml-Databases, but I can advice for "normal scenario":

Take a look to the xml-aware search in xtf ( http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7 ).
The idea is to use one lucene-document for each section with only two fields: "text" and "sectionType".
But to collect all hits belonging to one hierarchical information (e.g. one html-File) and compress this to one representative hit in lucene.

Best regards
  Karsten

leonardinius wrote
Any comments, suggestions? Maybe I should rephrase my original message or
describe it in detail?
I really would like to get any response if possible.

Thanks a lot in advance!

On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <leonidms@gmail.com> wrote:

> Hi all,
>
> First of all, sorry for my poor English. It's not my native language.
>
> I'm trying to use Lucene to index hierarchical kind of information: I have
> structured html and pdf/word documents and I want to index them in ways to
> perform search in titles, text, paragraphs or tables only, or any
> combinations of items mentioned above. At the moment I see 3 possible
> solutions:
>
>    - Create the set of all possible fields, like: contents, title,
>    heading, table etc... And index the data in all them accordingly. Possible
>    impacts:
>    - a big count of fields
>       - data duplication (because I need to make search looking in the
>       paragraphs to look inside all the inner elements, so every outer element
>       indexed will contain all the inner element content as well)
>    - Create the hierarchy of the fields, like "title", "paragraph/title",
>    "paragraph/title/subparagraph/table". Possible impacts:
>       - count of fields remains the same
>       - soft set of fields (not consistent)
>       - I'm not sure about the ways I could process required information
>       and perform search.
>       - Performance issues?
>       - Use one field for content and just add location prefix to content.
>    For example "contents:*paragraph/heading:*token1 token2". *
>    paragraph/heading:* here is used as additional information prefix. So,
>    I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
>       - Strong set of index fields (small)
>       - Additional information processing - all the queries I'll use will
>       have to work as PrefixQuery
>       - Performance issues?
>
>
> So, have anyone tried to make things work like that? Or am I trying to use
> wrench to hammer in nails? I assume Lucene wasn't thought to be used like
> that, but it's worth trying (at least asking).
> Any results / suggestions are welcome!
>
> --
> Bests regards,
> Leonid Maslov!
> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
>



--
Bests regards,
Leonid Maslov!
Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
Reply | Threaded
Open this post in threaded view
|

Re: Newbie question: using Lucene to index hierarchical information.

leonardinius
Hi all,
Thanks a lot for such a quick reply.

Both scenario sounds very well for me. I would like to do my best and try to
implement any of them (as the proof of the concept) and then incrementally
improve, retest, investigate and rewrite then :)

So, from the soap opera to the question part then:

   - How to implement those things (a and b) on the Lucene and Lucene
   contribs codebase?
      - I looked at the
      http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
and
      didn't like that (too big, to heavy, ready-to use solution instead of
      toolkit). And I didn't understood how to implement "Normal
scenario" on top
      of that?
   - Any suggestions how could I begin implementing these things? Gently
   moving from "Normal" scenario to some more advanced "Complex"? What should I
   afraid off and possible impacts if any?

Have anybody tried to use Lucene to analyse things like that? What would be
possible solutions to store indexed data and perform queries on that? If
Lucene isn't the right tool for this job, maybe some other toolkit would
more useful(possibly on top of the Lucene)

Thanks in advance for any suggestions and comments. I would appreciate any
ideas and directions to look into.


On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
<[hidden email]>wrote:

>
> Hi Leonid,
>
> what kind of query is your use case?
>
> Comlex scenario:
> You need all the hierarchical structure information in one query. This
> means
> you want to search with xpath in a real xml-Database. (like: All Documents
> with a subtitle XY which contains directly after this subtitle a table with
> the same column like ...)
>
> Normal scenario:
> You want to search for only one part of your hierarchical information like
> 'Document with word xy in title' and 'Documents with word xy in table'.
>
> I am not familiar with lucene use in xml-Databases, but I can advice for
> "normal scenario":
>
> Take a look to the xml-aware search in xtf (
>
> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
> ).
> The idea is to use one lucene-document for each section with only two
> fields: "text" and "sectionType".
> But to collect all hits belonging to one hierarchical information (e.g. one
> html-File) and compress this to one representative hit in lucene.
>
> Best regards
>  Karsten
>
>
> leonardinius wrote:
> >
> > Any comments, suggestions? Maybe I should rephrase my original message or
> > describe it in detail?
> > I really would like to get any response if possible.
> >
> > Thanks a lot in advance!
> >
> > On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <[hidden email]>
> wrote:
> >
> >> Hi all,
> >>
> >> First of all, sorry for my poor English. It's not my native language.
> >>
> >> I'm trying to use Lucene to index hierarchical kind of information: I
> >> have
> >> structured html and pdf/word documents and I want to index them in ways
> >> to
> >> perform search in titles, text, paragraphs or tables only, or any
> >> combinations of items mentioned above. At the moment I see 3 possible
> >> solutions:
> >>
> >>    - Create the set of all possible fields, like: contents, title,
> >>    heading, table etc... And index the data in all them accordingly.
> >> Possible
> >>    impacts:
> >>    - a big count of fields
> >>       - data duplication (because I need to make search looking in the
> >>       paragraphs to look inside all the inner elements, so every outer
> >> element
> >>       indexed will contain all the inner element content as well)
> >>    - Create the hierarchy of the fields, like "title",
> "paragraph/title",
> >>    "paragraph/title/subparagraph/table". Possible impacts:
> >>       - count of fields remains the same
> >>       - soft set of fields (not consistent)
> >>       - I'm not sure about the ways I could process required information
> >>       and perform search.
> >>       - Performance issues?
> >>       - Use one field for content and just add location prefix to
> >> content.
> >>    For example "contents:*paragraph/heading:*token1 token2". *
> >>    paragraph/heading:* here is used as additional information prefix.
> So,
> >>    I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
> >>       - Strong set of index fields (small)
> >>       - Additional information processing - all the queries I'll use
> will
> >>       have to work as PrefixQuery
> >>       - Performance issues?
> >>
> >>
> >> So, have anyone tried to make things work like that? Or am I trying to
> >> use
> >> wrench to hammer in nails? I assume Lucene wasn't thought to be used
> like
> >> that, but it's worth trying (at least asking).
> >> Any results / suggestions are welcome!
> >>
> >> --
> >> Bests regards,
> >> Leonid Maslov!
> >> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
> >>
> >
> >
> >
> > --
> > Bests regards,
> > Leonid Maslov!
> > Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19266355.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Bests regards,
Leonid Maslov!
Princess Margaret  - "I have as much privacy as a goldfish in a bowl."
Reply | Threaded
Open this post in threaded view
|

Re: Newbie question: using Lucene to index hierarchical information.

Karsten F.-2
Hi Leonid,

do you really need the "Complex scenario"?
what kind of query is your use case?

If you really need xpath please look for xml-Databases.

Otherwise you can possible use xtf out of the box, because "indexing of large structured documents" is exactly the use case for which xtf was developed (TEI documents, but html is less complex then TEI).
Again the main idea:
1. Use xml-Elements (with its descendants) to divide the structured document into sections.
2. index each section as lucene document (field "text") with an extra field "section type"
3. after all sections of one structured document insert one (terminal) lucene document with the other metadata of the structured document (e.g. creation date, title, ..)

the document from point 3 is the representative of the structured document (and the representative is the unit of retrieval, because the user search for a document not for a section)
If you search e.g. for
sectionType:table text:words inside section
you have hits with the lucene documents belonging to the sections.

Possible for your use case it would be enough to insert a stored lucene field "document key".
In xtf the lucene document-number of each hit is incremented until the representative is reached.

This is a rough description, but source code of xtf is very readable.

best regards

  Karsten


leonardinius wrote
Hi all,
Thanks a lot for such a quick reply.

Both scenario sounds very well for me. I would like to do my best and try to
implement any of them (as the proof of the concept) and then incrementally
improve, retest, investigate and rewrite then :)

So, from the soap opera to the question part then:

   - How to implement those things (a and b) on the Lucene and Lucene
   contribs codebase?
      - I looked at the
      http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
and
      didn't like that (too big, to heavy, ready-to use solution instead of
      toolkit). And I didn't understood how to implement "Normal
scenario" on top
      of that?
   - Any suggestions how could I begin implementing these things? Gently
   moving from "Normal" scenario to some more advanced "Complex"? What should I
   afraid off and possible impacts if any?

Have anybody tried to use Lucene to analyse things like that? What would be
possible solutions to store indexed data and perform queries on that? If
Lucene isn't the right tool for this job, maybe some other toolkit would
more useful(possibly on top of the Lucene)

Thanks in advance for any suggestions and comments. I would appreciate any
ideas and directions to look into.


On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
<karsten-lucene@fiz-technik.de>wrote:

> Take a look to the xml-aware search in xtf (
>
> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
> ).
> The idea is to use one lucene-document for each section with only two
> fields: "text" and "sectionType".
> But to collect all hits belonging to one hierarchical information (e.g. one
> html-File) and compress this to one representative hit in lucene.
>
> Best regards
>  Karsten
>
Reply | Threaded
Open this post in threaded view
|

Re: Newbie question: using Lucene to index hierarchical information.

leonardinius
Hi Karsten,
Thanks a lot. I finally have got Your idea.

Ok, I think it's worth to do the real job now :) Thanks for the advices,
finally I have understood the directions I could go for it.

>
>  do you really need the "Complex scenario"?
>
what kind of query is your use case?

My Query UC is smth like this: find documents where paragraphs are similar
to this document paragraphs or paragraph or part of it (using N-Gramms or
similar/modified tokenenizers and Stemm/NLP like similarity).

I finally understood the idea behind XML-based approach. I think XML based
approach isn't suitable for me anyway for some reasons:

   - DB support (MSSQL and Oracle or some Java ad-hoc solutions)
   - Speed with XPATH like queries on big datasets.

So I assume the the variant You recommend suits me the best.
However it's hard to understand what xtf does by just opening it's source
code and being newbie in Lucene. But thats should be done - should be done,
no one will do my job for me anyway. :))

I'll try to make some time to digg in xtf code. If smth is unclear or
questionable - I assume xtf mailing list would be the right place to ask -
not this particularly one (java-lucene-user)?

Thanks a lot for pointing out possible directions and solutions. I really
appreciate You help and time You spent to provide such as helpful
descriptions. God bless OSS community!

On Tue, Sep 9, 2008 at 12:26 AM, Karsten F.
<[hidden email]>wrote:

>
> Hi Leonid,
>
> do you really need the "Complex scenario"?
> what kind of query is your use case?
>
> If you really need xpath please look for xml-Databases.
>
> Otherwise you can possible use xtf out of the box, because "indexing of
> large structured documents" is exactly the use case for which xtf was
> developed (TEI documents, but html is less complex then TEI).
> Again the main idea:
> 1. Use xml-Elements (with its descendants) to divide the structured
> document
> into sections.
> 2. index each section as lucene document (field "text") with an extra field
> "section type"
> 3. after all sections of one structured document insert one (terminal)
> lucene document with the other metadata of the structured document (e.g.
> creation date, title, ..)
>
> the document from point 3 is the representative of the structured document
> (and the representative is the unit of retrieval, because the user search
> for a document not for a section)
> If you search e.g. for
> sectionType:table text:words inside section
> you have hits with the lucene documents belonging to the sections.
>
> Possible for your use case it would be enough to insert a stored lucene
> field "document key".
> In xtf the lucene document-number of each hit is incremented until the
> representative is reached.
>
> This is a rough description, but source code of xtf is very readable.
>
> best regards
>
>  Karsten
>
>
>
> leonardinius wrote:
> >
> > Hi all,
> > Thanks a lot for such a quick reply.
> >
> > Both scenario sounds very well for me. I would like to do my best and try
> > to
> > implement any of them (as the proof of the concept) and then
> incrementally
> > improve, retest, investigate and rewrite then :)
> >
> > So, from the soap opera to the question part then:
> >
> >    - How to implement those things (a and b) on the Lucene and Lucene
> >    contribs codebase?
> >       - I looked at the
> >
> >
> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
> > and
> >       didn't like that (too big, to heavy, ready-to use solution instead
> > of
> >       toolkit). And I didn't understood how to implement "Normal
> > scenario" on top
> >       of that?
> >    - Any suggestions how could I begin implementing these things? Gently
> >    moving from "Normal" scenario to some more advanced "Complex"? What
> > should I
> >    afraid off and possible impacts if any?
> >
> > Have anybody tried to use Lucene to analyse things like that? What would
> > be
> > possible solutions to store indexed data and perform queries on that? If
> > Lucene isn't the right tool for this job, maybe some other toolkit would
> > more useful(possibly on top of the Lucene)
> >
> > Thanks in advance for any suggestions and comments. I would appreciate
> any
> > ideas and directions to look into.
> >
> >
> > On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
> > <[hidden email]>wrote:
> >
> >> Take a look to the xml-aware search in xtf (
> >>
> >>
> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
> >> ).
> >> The idea is to use one lucene-document for each section with only two
> >> fields: "text" and "sectionType".
> >> But to collect all hits belonging to one hierarchical information (e.g.
> >> one
> >> html-File) and compress this to one representative hit in lucene.
> >>
> >> Best regards
> >>  Karsten
> >>
> >
>
> --
> View this message in context:
> http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19381593.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Bests regards,
Leonid Maslov!
Personal blog: http://leonardinius.blogspot.com/

Random thought:
Marcel Marceau  - "Never get a mime talking. He won't stop."
Reply | Threaded
Open this post in threaded view
|

Re: Newbie question: using Lucene to index hierarchical information.

Marcelo F. Ochoa
In reply to this post by leonardinius
Hi Leonid
   If you are not familiar with Oracle XMLDB schema mappings here an
example of how to store WikiPedia XML dumps into Oracle database, but
using XML-to-relational model:
http://marceloochoa.blogspot.com/2007/12/uploading-wikipedia-dumps-to-oracle.html
   The structure of WikiPedia dumps seem to be similar to your data
model, so if you are using Oracle you can use this example as jump
start to eficient mapping XML inside Oracle.
   Also there is an example of how to index it with Lucene running as
a new Domain Index for Oracle databases, to get the best things of the
two worlds :) Lucene for getting free text searching eficiently,
relational DB to quick sort and filter relational data.
   Best regards, Marcelo.
On Mon, Sep 1, 2008 at 4:25 AM, Leonid Maslov <[hidden email]> wrote:

> Hi all,
>
> First of all, sorry for my poor English. It's not my native language.
>
> I'm trying to use Lucene to index hierarchical kind of information: I have
> structured html and pdf/word documents and I want to index them in ways to
> perform search in titles, text, paragraphs or tables only, or any
> combinations of items mentioned above. At the moment I see 3 possible
> solutions:
>
>   - Create the set of all possible fields, like: contents, title, heading,
>   table etc... And index the data in all them accordingly. Possible impacts:
>   - a big count of fields
>      - data duplication (because I need to make search looking in the
>      paragraphs to look inside all the inner elements, so every outer element
>      indexed will contain all the inner element content as well)
>   - Create the hierarchy of the fields, like "title", "paragraph/title",
>   "paragraph/title/subparagraph/table". Possible impacts:
>      - count of fields remains the same
>      - soft set of fields (not consistent)
>      - I'm not sure about the ways I could process required information and
>      perform search.
>      - Performance issues?
>      - Use one field for content and just add location prefix to content.
>   For example "contents:*paragraph/heading:*token1 token2". *
>   paragraph/heading:* here is used as additional information prefix. So, I
>   (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
>      - Strong set of index fields (small)
>      - Additional information processing - all the queries I'll use will
>      have to work as PrefixQuery
>      - Performance issues?
>
>
> So, have anyone tried to make things work like that? Or am I trying to use
> wrench to hammer in nails? I assume Lucene wasn't thought to be used like
> that, but it's worth trying (at least asking).
> Any results / suggestions are welcome!
>
> --
> Bests regards,
> Leonid Maslov!
> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
>



--
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]