Building Lucene index for XML document

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Building Lucene index for XML document

maureen tanuwidjaja
Hi...
  I am a Final Year Undergrad.My Final year project is about search  engine for  XML Document..I am currently building this system  using Lucene.
 
  The example of XML element from an XML document :
  ----------------------------------------------
  <article>
      <body>
          <section>
          This is my first text
          </section>
         <p>
          This is my second text
        </p>
      </body>
  </article>
  ----------------------------------------------
  After the XML Document is parsed,I will get
 
  tag       : article
    value    :
 
  tag       : article/body
    value    :
 
  tag       : article/body/section
  value    : This is my first text
 
    tag       : article/body/p
    value    : This is my second text
 
 
  Constructing the Lucene Index, I treat :
  1.the XML tag as the field
  2.the value of it as the terms to be indexed
 
 
  Before implementing this search engine,I have designed to build the index in such a way that every XML tag is converted using binary value,in order to reduce the size index and perhaps for faster searching.To illustrate:
 
  article will be converted to 0
  article/body will be converted to 0.0
  article/body/section will be converted to 0.0.0
    article/body/p will be converted to 0.0.1
 
  Now,because of using lucene for the implementation,i wonder wheter such conversion will still be useful for efficiency..I  wonder wheter inside the lucene index itself, such kind of conversion  or perhaps even further optimization is already done in order to reduce  the size index or for  faster searching.
 
  Can anyone give me some information?
 
  Really need help...Thanks a lot
 
 
  Regards,
 
  Maureen
 
 
 
 
 
---------------------------------
Be a PS3 game guru.
Get your game face on with the latest PS3 news and previews at Yahoo! Games.
Reply | Threaded
Open this post in threaded view
|

Re: Building Lucene index for XML document

Daniel Noll-3-2
maureen tanuwidjaja wrote:

> Before implementing this search engine,I have designed to build the
> index in such a way that every XML tag is converted using binary
> value,in order to reduce the size index and perhaps for faster
> searching.To illustrate:
>
>   article will be converted to 0
>   article/body will be converted to 0.0
>   article/body/section will be converted to 0.0.0
>     article/body/p will be converted to 0.0.1
>  
> Now,because of using lucene for the implementation,i wonder wheter
> such conversion will still be useful for efficiency..I wonder wheter
> inside the lucene index itself, such kind of conversion or perhaps even
> further optimization is already done in order to reduce the size index
> or for faster searching.

Indeed you don't need to do this because each field stores its name as
an integer lookup already.  See SegmentTermEnum / FieldInfos.

Daniel

--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Building Lucene index for XML document

maureen tanuwidjaja

 
  Thanks a lot Daniel :)
   
  Regards,
  Maureen

Daniel Noll <[hidden email]> wrote:
  maureen tanuwidjaja wrote:

> Before implementing this search engine,I have designed to build the
> index in such a way that every XML tag is converted using binary
> value,in order to reduce the size index and perhaps for faster
> searching.To illustrate:
>
> article will be converted to 0
> article/body will be converted to 0.0
> article/body/section will be converted to 0.0.0
> article/body/p will be converted to 0.0.1
>
> Now,because of using lucene for the implementation,i wonder wheter
> such conversion will still be useful for efficiency..I wonder wheter
> inside the lucene index itself, such kind of conversion or perhaps even
> further optimization is already done in order to reduce the size index
> or for faster searching.

Indeed you don't need to do this because each field stores its name as
an integer lookup already. See SegmentTermEnum / FieldInfos.

Daniel

--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61 2 9280 0699
Web: http://nuix.com/ Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



 
---------------------------------
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
Reply | Threaded
Open this post in threaded view
|

Re: Building Lucene index for XML document

maureen tanuwidjaja
btw Daniel,can please give me the reference to find the explanation about SegmentTermEnum/Field Infos if such one exist? I search but best can only find http://lucene.apache.org/java/docs/clover/org/apache/lucene/index/SegmentTermEnum.html which is the source code only...
   
  Many thanks and Best regards ^^
  Maureen
   
 

maureen tanuwidjaja <[hidden email]> wrote:
 

Thanks a lot Daniel :)

Regards,
Maureen

Daniel Noll wrote:
maureen tanuwidjaja wrote:

> Before implementing this search engine,I have designed to build the
> index in such a way that every XML tag is converted using binary
> value,in order to reduce the size index and perhaps for faster
> searching.To illustrate:
>
> article will be converted to 0
> article/body will be converted to 0.0
> article/body/section will be converted to 0.0.0
> article/body/p will be converted to 0.0.1
>
> Now,because of using lucene for the implementation,i wonder wheter
> such conversion will still be useful for efficiency..I wonder wheter
> inside the lucene index itself, such kind of conversion or perhaps even
> further optimization is already done in order to reduce the size index
> or for faster searching.

Indeed you don't need to do this because each field stores its name as
an integer lookup already. See SegmentTermEnum / FieldInfos.

Daniel

--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61 2 9280 0699
Web: http://nuix.com/ Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.

 
---------------------------------
Expecting? Get great news right away with email Auto-Check.
Try the Yahoo! Mail Beta.
Reply | Threaded
Open this post in threaded view
|

Re: Building Lucene index for XML document

Doron Cohen
Hi Maureen,

Some relevant info in the file formats doc -
http://lucene.apache.org/java/docs/fileformats.html

Regards,
Doron

maureen tanuwidjaja <[hidden email]> wrote on 25/01/2007
01:31:25:

> btw Daniel,can please give me the reference to find the explanation
> about SegmentTermEnum/Field Infos if such one exist? I search but
> best can only find http://lucene.apache.
> org/java/docs/clover/org/apache/lucene/index/SegmentTermEnum.html
> which is the source code only...
>
>   Many thanks and Best regards ^^
>   Maureen
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Building Lucene index for XML document

maureen tanuwidjaja
Thanks Doron =)
 
  Regards,
  Maureen

Doron Cohen <[hidden email]> wrote:  Hi Maureen,

Some relevant info in the file formats doc -
http://lucene.apache.org/java/docs/fileformats.html

Regards,
Doron

maureen tanuwidjaja  wrote on 25/01/2007
01:31:25:

> btw Daniel,can please give me the reference to find the explanation
> about SegmentTermEnum/Field Infos if such one exist? I search but
> best can only find http://lucene.apache.
> org/java/docs/clover/org/apache/lucene/index/SegmentTermEnum.html
> which is the source code only...
>
>   Many thanks and Best regards ^^
>   Maureen
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



 
---------------------------------
It's here! Your new message!
Get new email alerts with the free Yahoo! Toolbar.
Reply | Threaded
Open this post in threaded view
|

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

maureen tanuwidjaja
Hi Mike and Erick and all,
 
  I have fixed my code and yes,indexing is much faster than previously when I do such "hammering" with IndexWriter
 
  However,I am now encountering the error while indexing
 
  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
 
  This error never happens before..it doesnt even exist when I do such hammering with the IndexWriter
 
  Is it because the document I am going to index is too much?coz I am  indexing thousands of XML Documents and total size is about 1Giga...
 
 
 
---------------------------------
TV dinner still cooling?
Check out "Tonight's Picks" on Yahoo! TV.
Reply | Threaded
Open this post in threaded view
|

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Пустовалов Михаил
try this : -XX:MaxPermSize=128m

On Fri, 26 Jan 2007 19:32:45 +0300, maureen tanuwidjaja  
<[hidden email]> wrote:

> Hi Mike and Erick and all,
>  I have fixed my code and yes,indexing is much faster than previously  
> when I do such "hammering" with IndexWriter
>  However,I am now encountering the error while indexing
>  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>  This error never happens before..it doesnt even exist when I do such  
> hammering with the IndexWriter
>  Is it because the document I am going to index is too much?coz I am  
> indexing thousands of XML Documents and total size is about 1Giga...
> ---------------------------------
> TV dinner still cooling?
> Check out "Tonight's Picks" on Yahoo! TV.



--
----====[]====----

Пустовалов Михаил.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

maureen tanuwidjaja

 
 
  Errrr...where shall I put that" -XX:MaxPermSize=128m"?
 
 
  Thanks Pustovalov
 
  Regards,
  Maureen
 

Пустовалов Михаил <[hidden email]> wrote:  try this : -XX:MaxPermSize=128m

On Fri, 26 Jan 2007 19:32:45 +0300, maureen tanuwidjaja  
 wrote:

> Hi Mike and Erick and all,
>  I have fixed my code and yes,indexing is much faster than previously  
> when I do such "hammering" with IndexWriter
>  However,I am now encountering the error while indexing
>  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>  This error never happens before..it doesnt even exist when I do such  
> hammering with the IndexWriter
>  Is it because the document I am going to index is too much?coz I am  
> indexing thousands of XML Documents and total size is about 1Giga...
> ---------------------------------
> TV dinner still cooling?
> Check out "Tonight's Picks" on Yahoo! TV.



--
----====[]====----

Пустовалов Михаил.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



 
---------------------------------
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
Reply | Threaded
Open this post in threaded view
|

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Пустовалов Михаил
in your java command line,  of course :)

Example : java -Xms128m -Xmx1024m -server -Djava.awt.headless=true  
-XX:MaxPermSize=128m protei.Starter


On Fri, 26 Jan 2007 19:39:13 +0300, maureen tanuwidjaja  
<[hidden email]> wrote:

>
>  Errrr...where shall I put that" -XX:MaxPermSize=128m"?
>  Thanks Pustovalov
>  Regards,
>   Maureen
>
> ПуÑ�товалов Михаил <[hidden email]> wrote:  try  
> this : -XX:MaxPermSize=128m
>
> On Fri, 26 Jan 2007 19:32:45 +0300, maureen tanuwidjaja
>  wrote:
>
>> Hi Mike and Erick and all,
>>  I have fixed my code and yes,indexing is much faster than previously
>> when I do such "hammering" with IndexWriter
>>  However,I am now encountering the error while indexing
>>  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>  This error never happens before..it doesnt even exist when I do such
>> hammering with the IndexWriter
>>  Is it because the document I am going to index is too much?coz I am
>> indexing thousands of XML Documents and total size is about 1Giga...
>> ---------------------------------
>> TV dinner still cooling?
>> Check out "Tonight's Picks" on Yahoo! TV.
>
>
>



--
----====[]====----

Пустовалов Михаил.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

maureen tanuwidjaja
oh thanks then:)

Пустовалов Михаил <[hidden email]> wrote:  in your java command line,  of course :)

Example : java -Xms128m -Xmx1024m -server -Djava.awt.headless=true  
-XX:MaxPermSize=128m protei.Starter


On Fri, 26 Jan 2007 19:39:13 +0300, maureen tanuwidjaja  
 wrote:

>
>  Errrr...where shall I put that" -XX:MaxPermSize=128m"?
>  Thanks Pustovalov
>  Regards,
>   Maureen
>
> Пу�товалов Михаил
 wrote:  try  

> this : -XX:MaxPermSize=128m
>
> On Fri, 26 Jan 2007 19:32:45 +0300, maureen tanuwidjaja
>  wrote:
>
>> Hi Mike and Erick and all,
>>  I have fixed my code and yes,indexing is much faster than previously
>> when I do such "hammering" with IndexWriter
>>  However,I am now encountering the error while indexing
>>  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>  This error never happens before..it doesnt even exist when I do such
>> hammering with the IndexWriter
>>  Is it because the document I am going to index is too much?coz I am
>> indexing thousands of XML Documents and total size is about 1Giga...
>> ---------------------------------
>> TV dinner still cooling?
>> Check out "Tonight's Picks" on Yahoo! TV.
>
>
>



--
----====[]====----

Пустовалов Михаил.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




 
---------------------------------
Finding fabulous fares is fun.
Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains.
Reply | Threaded
Open this post in threaded view
|

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Пустовалов Михаил
in my applications JVM throws [java.lang.OutOfMemoryError: Java heap  
space] when too many  java classes has been loaded and/or when i use some  
byte code manipulation libraries ... (hibernate, asm, cglib for example) -  
JVM has no more memory for compile bytecode.

On Fri, 26 Jan 2007 19:46:06 +0300, maureen tanuwidjaja  
<[hidden email]> wrote:

> java.lang.OutOfMemoryError: Java heap space



--
----====[]====----

WBR, Pustovalov Mike.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]