lucene functionality

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

lucene functionality

Mark Mei
At the bottom of this email is the sample xml file that we are using today.
We have about 10 million of these.

We need to know whether Lucene can support the following functionalities.
(1) Each field is searchable and indexable.
(2) Fields such as STARTTIME and ENDTIME need to be treated as a pair so
that we can apply timestamp operation such as search by data time ranges
(3) Fields such as DMA need to be treated as numerical and be able to use
math operators ( > < =) for those fields.

We also use Apache Commons Digester to parse the xml files. So we want to
know, can all of the above requirements be supported by combining both
Digester and Lucene together, or do we need other modules in order for us to
support those requirements?
If these functionalities can be supported, please tell us about the effort
involved (ie, do I need to rewrite 90% of Lucene/Digester to include support
for these requirements, or is it more like spending one/two afternoons
extending some classes ? )

<DOCUMENT>
  <DREREFERENCE>61926433</DREREFERENCE>
  <DREDBNAME>News</DREDBNAME>
  <SEGMENTID>61829557</SEGMENTID>
  <SHOWID>2051460</SHOWID>
  <PROGRAMID>21181</PROGRAMID>
  <PROGRAMNAME>Action 10 News This Morning</PROGRAMNAME>
  <PREFIX>wthi0600</PREFIX>
  <STATIONID>903</STATIONID>
  <STATIONNAME>WTHI-TV</STATIONNAME>
  <AFFILIATEID>17</AFFILIATEID>
  <AFFILIATENAME>CBS</AFFILIATENAME>
  <MARKETID>141</MARKETID>
  <MARKETNAME>Terre Haute</MARKETNAME>
  <MEDIATYPE>T</MEDIATYPE>
  <DMA>149</DMA>
  <SOURCETYPE>CC</SOURCETYPE>
  <STARTTIME>2005-07-04 06:00:00</STARTTIME>
  <ENDTIME>2005-07-04 07:00:00</ENDTIME>
  <STARTMETER>00:42:53</STARTMETER>
  <ENDMETER>00:45:02</ENDMETER>
  <DREDATE>2006-01-25 00:00:00</DREDATE>
  <DRETITLE>At we take you to break with a look at some of the fourth of
July fun going on around the wabash valley today.</DRETITLE>
  <DRECONTENT>At we take you to break with a look at some of the fourth of
July fun going on around the wabash valley today. This is action 10 news
this morning on wthi. He's been the US Attorney general for only a few
months. But alberto gonzales may already be in the running for a new job.
And not just any job, either.</DRECONTENT>
</DOCUMENT>
Reply | Threaded
Open this post in threaded view
|

Re: lucene functionality

Patrek
I would suggest you take a look at exist-db (http://exist-db.org/).

A database for XML documents that support XQuery.

We are using both products here (lucene and exist-db), and for what you are
looking for, exist-db seems better.

Our documents are far more complex than yours (about 500 different element
in the structure) and even if we don't have millions, we have more than 53K
documents.

Once loaded in the database, performance are impressives to find info on
documents parts (xpath) where no index exists. And for your structure, you
could even create indexes which would boost performance even more.

Don't hesitate to contact me directly if you have more questions.

Patrick

On 12/13/06, Mark Mei <[hidden email]> wrote:

>
> At the bottom of this email is the sample xml file that we are using
> today.
> We have about 10 million of these.
>
> We need to know whether Lucene can support the following functionalities.
> (1) Each field is searchable and indexable.
> (2) Fields such as STARTTIME and ENDTIME need to be treated as a pair so
> that we can apply timestamp operation such as search by data time ranges
> (3) Fields such as DMA need to be treated as numerical and be able to use
> math operators ( > < =) for those fields.
>
> We also use Apache Commons Digester to parse the xml files. So we want to
> know, can all of the above requirements be supported by combining both
> Digester and Lucene together, or do we need other modules in order for us
> to
> support those requirements?
> If these functionalities can be supported, please tell us about the effort
> involved (ie, do I need to rewrite 90% of Lucene/Digester to include
> support
> for these requirements, or is it more like spending one/two afternoons
> extending some classes ? )
>
> <DOCUMENT>
>   <DREREFERENCE>61926433</DREREFERENCE>
>   <DREDBNAME>News</DREDBNAME>
>   <SEGMENTID>61829557</SEGMENTID>
>   <SHOWID>2051460</SHOWID>
>   <PROGRAMID>21181</PROGRAMID>
>   <PROGRAMNAME>Action 10 News This Morning</PROGRAMNAME>
>   <PREFIX>wthi0600</PREFIX>
>   <STATIONID>903</STATIONID>
>   <STATIONNAME>WTHI-TV</STATIONNAME>
>   <AFFILIATEID>17</AFFILIATEID>
>   <AFFILIATENAME>CBS</AFFILIATENAME>
>   <MARKETID>141</MARKETID>
>   <MARKETNAME>Terre Haute</MARKETNAME>
>   <MEDIATYPE>T</MEDIATYPE>
>   <DMA>149</DMA>
>   <SOURCETYPE>CC</SOURCETYPE>
>   <STARTTIME>2005-07-04 06:00:00</STARTTIME>
>   <ENDTIME>2005-07-04 07:00:00</ENDTIME>
>   <STARTMETER>00:42:53</STARTMETER>
>   <ENDMETER>00:45:02</ENDMETER>
>   <DREDATE>2006-01-25 00:00:00</DREDATE>
>   <DRETITLE>At we take you to break with a look at some of the fourth of
> July fun going on around the wabash valley today.</DRETITLE>
>   <DRECONTENT>At we take you to break with a look at some of the fourth of
> July fun going on around the wabash valley today. This is action 10 news
> this morning on wthi. He's been the US Attorney general for only a few
> months. But alberto gonzales may already be in the running for a new job.
> And not just any job, either.</DRECONTENT>
> </DOCUMENT>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: lucene functionality

Marcelo F. Ochoa
In reply to this post by Mark Mei
Hi Mark:
  For 10 million records We recommend an strong database such as Oracle.
  You can annotate the Schema (.xsd) which describes your XML record
to store some field in traditional VARCHAR2 or NUMBER columns to query
it faster, and <DRECONTENT> in a CLOB column.
  You can find more information at:
http://www.oracle.com/technology/tech/xml/xmldb/index.html
http://www.oracle.com/technology/oramag/oracle/05-mar/o25xmlex.html
  If you annotate the schema, you don't need to parse the XML records
with Digester to store into the Oracle database, you can simple insert
using an XMLType object, using Ftp or WebDAV.
  Best regards, Marcelo.

On 12/13/06, Mark Mei <[hidden email]> wrote:

> At the bottom of this email is the sample xml file that we are using today.
> We have about 10 million of these.
>
> We need to know whether Lucene can support the following functionalities.
> (1) Each field is searchable and indexable.
> (2) Fields such as STARTTIME and ENDTIME need to be treated as a pair so
> that we can apply timestamp operation such as search by data time ranges
> (3) Fields such as DMA need to be treated as numerical and be able to use
> math operators ( > < =) for those fields.
>
> We also use Apache Commons Digester to parse the xml files. So we want to
> know, can all of the above requirements be supported by combining both
> Digester and Lucene together, or do we need other modules in order for us to
> support those requirements?
> If these functionalities can be supported, please tell us about the effort
> involved (ie, do I need to rewrite 90% of Lucene/Digester to include support
> for these requirements, or is it more like spending one/two afternoons
> extending some classes ? )
>
> <DOCUMENT>
>   <DREREFERENCE>61926433</DREREFERENCE>
>   <DREDBNAME>News</DREDBNAME>
>   <SEGMENTID>61829557</SEGMENTID>
>   <SHOWID>2051460</SHOWID>
>   <PROGRAMID>21181</PROGRAMID>
>   <PROGRAMNAME>Action 10 News This Morning</PROGRAMNAME>
>   <PREFIX>wthi0600</PREFIX>
>   <STATIONID>903</STATIONID>
>   <STATIONNAME>WTHI-TV</STATIONNAME>
>   <AFFILIATEID>17</AFFILIATEID>
>   <AFFILIATENAME>CBS</AFFILIATENAME>
>   <MARKETID>141</MARKETID>
>   <MARKETNAME>Terre Haute</MARKETNAME>
>   <MEDIATYPE>T</MEDIATYPE>
>   <DMA>149</DMA>
>   <SOURCETYPE>CC</SOURCETYPE>
>   <STARTTIME>2005-07-04 06:00:00</STARTTIME>
>   <ENDTIME>2005-07-04 07:00:00</ENDTIME>
>   <STARTMETER>00:42:53</STARTMETER>
>   <ENDMETER>00:45:02</ENDMETER>
>   <DREDATE>2006-01-25 00:00:00</DREDATE>
>   <DRETITLE>At we take you to break with a look at some of the fourth of
> July fun going on around the wabash valley today.</DRETITLE>
>   <DRECONTENT>At we take you to break with a look at some of the fourth of
> July fun going on around the wabash valley today. This is action 10 news
> this morning on wthi. He's been the US Attorney general for only a few
> months. But alberto gonzales may already be in the running for a new job.
> And not just any job, either.</DRECONTENT>
> </DOCUMENT>
>
>


--
Marcelo F. Ochoa
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene functionality

Doron Cohen
In reply to this post by Mark Mei
Lucene RangeQuery would do for the "time" and "numeric" reqs.

"Mark Mei" <[hidden email]> wrote:
> At the bottom of this email is the sample xml file that we are using
today.
> We have about 10 million of these.
>
> We need to know whether Lucene can support the following functionalities.
> (1) Each field is searchable and indexable.

This is natural in Lucene.

> (2) Fields such as STARTTIME and ENDTIME need to be treated as a pair so
> that we can apply timestamp operation such as search by data time ranges

You should be able to do it with two fields - start/end - and then you have
all the flexibility for queries. Conditions can be set on either one of
these or on both. In case that both are used, i.e. a doc matches only if
its start-to-end range falls within a certain minStart maxEnd range, you
would need two open ended range queries (or range filters) - one to apply
on the start date and one to apply on the end date, because a single range
query values must have the same field name. Also notice that open ended
range queries must be created programmatically - current QueryParser does
not support this. See using DateTools for saving the time values in a
resolution that matches your needs.

> (3) Fields such as DMA need to be treated as numerical and be able to use
> math operators ( > < =) for those fields.

Same comments on range queries / filters apply. Be aware though that
comparison is lexicographic, so numeric values should be indexed as
strings. See NumberTools.

>
> We also use Apache Commons Digester to parse the xml files. So we want to
> know, can all of the above requirements be supported by combining both
> Digester and Lucene together, or do we need other modules in order for us
to
> support those requirements?
> If these functionalities can be supported, please tell us about the
effort
> involved (ie, do I need to rewrite 90% of Lucene/Digester to include
support
> for these requirements, or is it more like spending one/two afternoons
> extending some classes ? )

Apart from the XML handling (don't know about that), for the other reqs it
is just using Lucene's API.

>
> <DOCUMENT>
>   <DREREFERENCE>61926433</DREREFERENCE>
>   <DREDBNAME>News</DREDBNAME>
>   <SEGMENTID>61829557</SEGMENTID>
>   <SHOWID>2051460</SHOWID>
>   <PROGRAMID>21181</PROGRAMID>
>   <PROGRAMNAME>Action 10 News This Morning</PROGRAMNAME>
>   <PREFIX>wthi0600</PREFIX>
>   <STATIONID>903</STATIONID>
>   <STATIONNAME>WTHI-TV</STATIONNAME>
>   <AFFILIATEID>17</AFFILIATEID>
>   <AFFILIATENAME>CBS</AFFILIATENAME>
>   <MARKETID>141</MARKETID>
>   <MARKETNAME>Terre Haute</MARKETNAME>
>   <MEDIATYPE>T</MEDIATYPE>
>   <DMA>149</DMA>
>   <SOURCETYPE>CC</SOURCETYPE>
>   <STARTTIME>2005-07-04 06:00:00</STARTTIME>
>   <ENDTIME>2005-07-04 07:00:00</ENDTIME>
>   <STARTMETER>00:42:53</STARTMETER>
>   <ENDMETER>00:45:02</ENDMETER>
>   <DREDATE>2006-01-25 00:00:00</DREDATE>
>   <DRETITLE>At we take you to break with a look at some of the fourth of
> July fun going on around the wabash valley today.</DRETITLE>
>   <DRECONTENT>At we take you to break with a look at some of the fourth
of
> July fun going on around the wabash valley today. This is action 10 news
> this morning on wthi. He's been the US Attorney general for only a few
> months. But alberto gonzales may already be in the running for a new job.
> And not just any job, either.</DRECONTENT>
> </DOCUMENT>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene functionality

Chris Hostetter-3
In reply to this post by Marcelo F. Ochoa

:   For 10 million records We recommend an strong database such as Oracle.

eh ... who is "We" in that statement?

I Suspect you'll find other people on this list who have no problems
running Lucene indexes containing 10 million documents.

If you want a database, then by all means use a database, but if you want
a Lucene index of 10 million documents, you can build one.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene functionality

Marcelo F. Ochoa
Hi Chris:

On 12/13/06, Chris Hostetter <[hidden email]> wrote:
>
> :   For 10 million records We recommend an strong database such as Oracle.
>
> eh ... who is "We" in that statement?
  We are independent consultants working for many years with Oracle databases ;)
>
> I Suspect you'll find other people on this list who have no problems
> running Lucene indexes containing 10 million documents.
  I know that Lucene can manage more than 10 million documents
perfectly, but IMO the problem here is different, I think that the XML
showed on the example implied that searching a document by the xpath
/DOCUMENT/[DREREFERENCE=61926433] is like searching in a table by
primary key, not looking at the inverted index.
>
> If you want a database, then by all means use a database, but if you want
> a Lucene index of 10 million documents, you can build one.
  Yes, you can build an inverted index for 10 million documents
perfectly, but the XML documents showed look like a simple relational
data.
>
>
> -Hoss
  Best regards, Marcelo.
--
Marcelo F. Ochoa
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene functionality

Chris Hostetter-3

: > :   For 10 million records We recommend an strong database such as Oracle.
: >
: > eh ... who is "We" in that statement?
:   We are independent consultants working for many years with Oracle databases ;)

And that's a perfectly acceptible answer, i just don't want any first time
Lucene users to read your statement as "we the lucene community recommend
you use a database"

: perfectly, but IMO the problem here is different, I think that the XML
: showed on the example implied that searching a document by the xpath
: /DOCUMENT/[DREREFERENCE=61926433] is like searching in a table by
: primary key, not looking at the inverted index.

that's a matter of perception: you looked at the example XML and felt that
implied searching by xpath in a way that could easily be done using simple
select statments ... i looked at the part of hte question that said...

> (1) Each field is searchable and indexable.

...and I assumed hte real problem is being ableto address use cases like
"find all documents where the DRECONTENT contains the words "Action" and
the words "News" near eachother -- using stemming and other Text Analysys
tricks i may wnat to customize on a per field basis) which make me think
Lucene is a better choice then a straight relational database.

:   Yes, you can build an inverted index for 10 million documents
: perfectly, but the XML documents showed look like a simple relational
: data.

again, perception ... nothing in the question asked about doing relational
queries, so i don't think it's wise to immediately suggest a relational
database as the "recommended" solution.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene functionality

Marcelo F. Ochoa
Hi Chris:
<snip>
>
> > (1) Each field is searchable and indexable.
>
> ...and I assumed hte real problem is being ableto address use cases like
> "find all documents where the DRECONTENT contains the words "Action" and
> the words "News" near eachother -- using stemming and other Text Analysys
> tricks i may wnat to customize on a per field basis) which make me think
> Lucene is a better choice then a straight relational database.
  Yep, may be Mark can clarify the expected use cases for searching.
  But, the mixed mode can coexists.
  I am working on the Oracle/Lucene integration, so you can perfectly
store the content of the XML document in a relational table leaving
the DRECONTENT in a CLOB column and this column indexed with Lucene.
  Querying for /DOCUMENT/[DREREFERENCE=61926433] can be transformed by
the optimizer into a "select ... for ... where DREREFERENCE=61926433"
(using a btree implementation)
  and "find all documents where the DRECONTENT contains the words
"Action" " into
  select ... for where lcontains(DRECONTENT,'Action')>0
  The two world can coexists very well :)
>
> :   Yes, you can build an inverted index for 10 million documents
> : perfectly, but the XML documents showed look like a simple relational
> : data.
>
> again, perception ... nothing in the question asked about doing relational
> queries, so i don't think it's wise to immediately suggest a relational
> database as the "recommended" solution.
  Best regards, Marcelo.
--
Marcelo F. Ochoa
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene functionality

Erik Hatcher
In reply to this post by Patrek

On Dec 13, 2006, at 1:51 PM, Patrick Turcotte wrote:

> I would suggest you take a look at exist-db (http://exist-db.org/).

I really doubt eXist can handle 10M XML files.  Last time I tried it,  
it choked on 20k of them.

        Erik


>
> A database for XML documents that support XQuery.
>
> We are using both products here (lucene and exist-db), and for what  
> you are
> looking for, exist-db seems better.
>
> Our documents are far more complex than yours (about 500 different  
> element
> in the structure) and even if we don't have millions, we have more  
> than 53K
> documents.
>
> Once loaded in the database, performance are impressives to find  
> info on
> documents parts (xpath) where no index exists. And for your  
> structure, you
> could even create indexes which would boost performance even more.
>
> Don't hesitate to contact me directly if you have more questions.
>
> Patrick
>
> On 12/13/06, Mark Mei <[hidden email]> wrote:
>>
>> At the bottom of this email is the sample xml file that we are using
>> today.
>> We have about 10 million of these.
>>
>> We need to know whether Lucene can support the following  
>> functionalities.
>> (1) Each field is searchable and indexable.
>> (2) Fields such as STARTTIME and ENDTIME need to be treated as a  
>> pair so
>> that we can apply timestamp operation such as search by data time  
>> ranges
>> (3) Fields such as DMA need to be treated as numerical and be able  
>> to use
>> math operators ( > < =) for those fields.
>>
>> We also use Apache Commons Digester to parse the xml files. So we  
>> want to
>> know, can all of the above requirements be supported by combining  
>> both
>> Digester and Lucene together, or do we need other modules in order  
>> for us
>> to
>> support those requirements?
>> If these functionalities can be supported, please tell us about  
>> the effort
>> involved (ie, do I need to rewrite 90% of Lucene/Digester to include
>> support
>> for these requirements, or is it more like spending one/two  
>> afternoons
>> extending some classes ? )
>>
>> <DOCUMENT>
>>   <DREREFERENCE>61926433</DREREFERENCE>
>>   <DREDBNAME>News</DREDBNAME>
>>   <SEGMENTID>61829557</SEGMENTID>
>>   <SHOWID>2051460</SHOWID>
>>   <PROGRAMID>21181</PROGRAMID>
>>   <PROGRAMNAME>Action 10 News This Morning</PROGRAMNAME>
>>   <PREFIX>wthi0600</PREFIX>
>>   <STATIONID>903</STATIONID>
>>   <STATIONNAME>WTHI-TV</STATIONNAME>
>>   <AFFILIATEID>17</AFFILIATEID>
>>   <AFFILIATENAME>CBS</AFFILIATENAME>
>>   <MARKETID>141</MARKETID>
>>   <MARKETNAME>Terre Haute</MARKETNAME>
>>   <MEDIATYPE>T</MEDIATYPE>
>>   <DMA>149</DMA>
>>   <SOURCETYPE>CC</SOURCETYPE>
>>   <STARTTIME>2005-07-04 06:00:00</STARTTIME>
>>   <ENDTIME>2005-07-04 07:00:00</ENDTIME>
>>   <STARTMETER>00:42:53</STARTMETER>
>>   <ENDMETER>00:45:02</ENDMETER>
>>   <DREDATE>2006-01-25 00:00:00</DREDATE>
>>   <DRETITLE>At we take you to break with a look at some of the  
>> fourth of
>> July fun going on around the wabash valley today.</DRETITLE>
>>   <DRECONTENT>At we take you to break with a look at some of the  
>> fourth of
>> July fun going on around the wabash valley today. This is action  
>> 10 news
>> this morning on wthi. He's been the US Attorney general for only a  
>> few
>> months. But alberto gonzales may already be in the running for a  
>> new job.
>> And not just any job, either.</DRECONTENT>
>> </DOCUMENT>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene functionality

Patrek
On 12/14/06, Erik Hatcher <[hidden email]> wrote:
>
>
> On Dec 13, 2006, at 1:51 PM, Patrick Turcotte wrote:
>
> > I would suggest you take a look at exist-db (http://exist-db.org/).
>
> I really doubt eXist can handle 10M XML files.  Last time I tried it,
> it choked on 20k of them.


It is true I don't know about 10M, but we are working with 3x 53k without
any problems (version 1.1.1).

Patrick