Getting only the Ids, not the whole documents.

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting only the Ids, not the whole documents.

makkhar
Hi all,

   Can I get just a list of document Ids given a search criteria ? To elaborate here is my situation:

I store 20000 contracts in the file system index each with some parameterName and Value. Given a search criterion - (paramValue='draft'). I need to get just an ArrayList of Strings containing contract Ids. I dont need the lucene documents, just the Ids.

Can this be done ?

-thanks
Reply | Threaded
Open this post in threaded view
|

Re: Getting only the Ids, not the whole documents.

is_maximum
yes if you extend your class from HitCollector and override the collect()
mthod with following signature you can get IDs

public void collect(int id, float score)

On 8/2/07, makkhar <[hidden email]> wrote:

>
>
> Hi all,
>
>    Can I get just a list of document Ids given a search criteria ? To
> elaborate here is my situation:
>
> I store 20000 contracts in the file system index each with some
> parameterName and Value. Given a search criterion - (paramValue='draft').
> I
> need to get just an ArrayList of Strings containing contract Ids. I dont
> need the lucene documents, just the Ids.
>
> Can this be done ?
>
> -thanks
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/
--
Regards
Mohammad
Pixelshot
Reply | Threaded
Open this post in threaded view
|

RE: Getting only the Ids, not the whole documents.

Chhabra, Kapil
In reply to this post by makkhar
What is the structure of your index?
If you havnt already, then add a new field to your index that stores the
contractId. For all other fields, set the "store" flag to false while
indexing.

You can now safely retrieve the value of this contractId field based on
your search results.

Regards,
kapilChhabra


-----Original Message-----
From: makkhar [mailto:[hidden email]]
Sent: Thursday, August 02, 2007 2:26 PM
To: [hidden email]
Subject: Getting only the Ids, not the whole documents.


Hi all,

   Can I get just a list of document Ids given a search criteria ? To
elaborate here is my situation:

I store 20000 contracts in the file system index each with some
parameterName and Value. Given a search criterion -
(paramValue='draft'). I
need to get just an ArrayList of Strings containing contract Ids. I dont
need the lucene documents, just the Ids.

Can this be done ?

-thanks

--
View this message in context:
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t
f4204907.html#a11960750
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting only the Ids, not the whole documents.

makkhar
In reply to this post by is_maximum

Hi,

   The solution you suggested will definitely work but will definitely slow down my search by an order of magnitude. The problem I am trying to solve is performance, thats why I need the collection of IDs and not the whole documents.

- thanks for the prompt reply.

is_maximum wrote
yes if you extend your class from HitCollector and override the collect()
mthod with following signature you can get IDs

public void collect(int id, float score)

On 8/2/07, makkhar <mkharche@selectica.com> wrote:
>
>
> Hi all,
>
>    Can I get just a list of document Ids given a search criteria ? To
> elaborate here is my situation:
>
> I store 20000 contracts in the file system index each with some
> parameterName and Value. Given a search criterion - (paramValue='draft').
> I
> need to get just an ArrayList of Strings containing contract Ids. I dont
> need the lucene documents, just the Ids.
>
> Can this be done ?
>
> -thanks
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

RE: Getting only the Ids, not the whole documents.

makkhar
In reply to this post by Chhabra, Kapil
Heres my index structure :

Document -> contract ID   -    id (index AND store)
              -> paramName   -    name (index AND store)
              -> paramValue   -    value (index BUT NOT store)

When I get back 20000 hits, each document contains ID and paramName, I have no interest in paramName (but I have to STORE it for some other reason), can I not just get a plain java String Array of the contract IDs that matched ? !

-thanks for the prompt reply.


Chhabra, Kapil wrote
What is the structure of your index?
If you havnt already, then add a new field to your index that stores the
contractId. For all other fields, set the "store" flag to false while
indexing.

You can now safely retrieve the value of this contractId field based on
your search results.

Regards,
kapilChhabra


-----Original Message-----
From: makkhar [mailto:mkharche@selectica.com]
Sent: Thursday, August 02, 2007 2:26 PM
To: java-user@lucene.apache.org
Subject: Getting only the Ids, not the whole documents.


Hi all,

   Can I get just a list of document Ids given a search criteria ? To
elaborate here is my situation:

I store 20000 contracts in the file system index each with some
parameterName and Value. Given a search criterion -
(paramValue='draft'). I
need to get just an ArrayList of Strings containing contract Ids. I dont
need the lucene documents, just the Ids.

Can this be done ?

-thanks

--
View this message in context:
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t
f4204907.html#a11960750
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Getting only the Ids, not the whole documents.

is_maximum
In reply to this post by makkhar
yes it decrease the performance but the only solution.
I've spent many weeks to find best way to retrive my own IDs but find this
way as last one

now I am storing the ids in a BitSet structure and it's fast enough

public void collect(...){
idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID")));

}

On 8/2/07, makkhar <[hidden email]> wrote:

>
>
>
> Hi,
>
>    The solution you suggested will definitely work but will definitely
> slow
> down my search by an order of magnitude. The problem I am trying to solve
> is
> performance, thats why I need the collection of IDs and not the whole
> documents.
>
> - thanks for the prompt reply.
>
>
> is_maximum wrote:
> >
> > yes if you extend your class from HitCollector and override the
> collect()
> > mthod with following signature you can get IDs
> >
> > public void collect(int id, float score)
> >
> > On 8/2/07, makkhar <[hidden email]> wrote:
> >>
> >>
> >> Hi all,
> >>
> >>    Can I get just a list of document Ids given a search criteria ? To
> >> elaborate here is my situation:
> >>
> >> I store 20000 contracts in the file system index each with some
> >> parameterName and Value. Given a search criterion -
> (paramValue='draft').
> >> I
> >> need to get just an ArrayList of Strings containing contract Ids. I
> dont
> >> need the lucene documents, just the Ids.
> >>
> >> Can this be done ?
> >>
> >> -thanks
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
> > --
> > Regards,
> > Mohammad
> > --------------------------
> > see my blog: http://brainable.blogspot.com/
> > another in Persian: http://fekre-motefavet.blogspot.com/
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/
--
Regards
Mohammad
Pixelshot
Reply | Threaded
Open this post in threaded view
|

Re: Getting only the Ids, not the whole documents.

is_maximum
In reply to this post by makkhar
you should not store them in an Array structure since they will take up the
memory.
the BitSet is the best structure to store them


On 8/2/07, makkhar <[hidden email]> wrote:

>
>
> Heres my index structure :
>
> Document -> contract ID   -    id (index AND store)
>               -> paramName   -    name (index AND store)
>               -> paramValue   -    value (index BUT NOT store)
>
> When I get back 20000 hits, each document contains ID and paramName, I
> have
> no interest in paramName (but I have to STORE it for some other reason),
> can
> I not just get a plain java String Array of the contract IDs that matched
> ?
> !
>
> -thanks for the prompt reply.
>
>
>
> Chhabra, Kapil wrote:
> >
> > What is the structure of your index?
> > If you havnt already, then add a new field to your index that stores the
> > contractId. For all other fields, set the "store" flag to false while
> > indexing.
> >
> > You can now safely retrieve the value of this contractId field based on
> > your search results.
> >
> > Regards,
> > kapilChhabra
> >
> >
> > -----Original Message-----
> > From: makkhar [mailto:[hidden email]]
> > Sent: Thursday, August 02, 2007 2:26 PM
> > To: [hidden email]
> > Subject: Getting only the Ids, not the whole documents.
> >
> >
> > Hi all,
> >
> >    Can I get just a list of document Ids given a search criteria ? To
> > elaborate here is my situation:
> >
> > I store 20000 contracts in the file system index each with some
> > parameterName and Value. Given a search criterion -
> > (paramValue='draft'). I
> > need to get just an ArrayList of Strings containing contract Ids. I dont
> > need the lucene documents, just the Ids.
> >
> > Can this be done ?
> >
> > -thanks
> >
> > --
> > View this message in context:
> > http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t
> > f4204907.html#a11960750
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961211
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/
--
Regards
Mohammad
Pixelshot
Reply | Threaded
Open this post in threaded view
|

Re: Getting only the Ids, not the whole documents.

testn
In reply to this post by is_maximum
Hi,

Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an ability to help you load only the first field in the document without loading all the fields. After that, you can keep the whole document if you like. It should help improve performance better.


is_maximum wrote
yes it decrease the performance but the only solution.
I've spent many weeks to find best way to retrive my own IDs but find this
way as last one

now I am storing the ids in a BitSet structure and it's fast enough

public void collect(...){
idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID")));

}

On 8/2/07, makkhar <mkharche@selectica.com> wrote:
>
>
>
> Hi,
>
>    The solution you suggested will definitely work but will definitely
> slow
> down my search by an order of magnitude. The problem I am trying to solve
> is
> performance, thats why I need the collection of IDs and not the whole
> documents.
>
> - thanks for the prompt reply.
>
>
> is_maximum wrote:
> >
> > yes if you extend your class from HitCollector and override the
> collect()
> > mthod with following signature you can get IDs
> >
> > public void collect(int id, float score)
> >
> > On 8/2/07, makkhar <mkharche@selectica.com> wrote:
> >>
> >>
> >> Hi all,
> >>
> >>    Can I get just a list of document Ids given a search criteria ? To
> >> elaborate here is my situation:
> >>
> >> I store 20000 contracts in the file system index each with some
> >> parameterName and Value. Given a search criterion -
> (paramValue='draft').
> >> I
> >> need to get just an ArrayList of Strings containing contract Ids. I
> dont
> >> need the lucene documents, just the Ids.
> >>
> >> Can this be done ?
> >>
> >> -thanks
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Regards,
> > Mohammad
> > --------------------------
> > see my blog: http://brainable.blogspot.com/
> > another in Persian: http://fekre-motefavet.blogspot.com/
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Getting only the Ids, not the whole documents.

Daniel Noll-3-2
In reply to this post by is_maximum
On Thursday 02 August 2007 19:28:48 Mohammad Norouzi wrote:
> you should not store them in an Array structure since they will take up the
> memory.
> the BitSet is the best structure to store them

You can't store strings in a BitSet.

What I would do is return a List<String> but make a custom subclass of
AbstractList<String> which creates the strings on demand from the Hits
object.  We use similar tricks to convert Hits into a List of another more
complex object type and it works great.  You can cache the strings as they're
retrieved if you're planning to use some strings much more than others.

Daniel


--
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting only the Ids, not the whole documents.

Mark Miller-3
In reply to this post by testn
If you are just retrieving your custom id and you have more stored
fields (and they are not tiny) you certainly do want to use a field
selector. I would suggest SetBasedFieldSelector.

- Mark

testn wrote:

> Hi,
>
> Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an
> ability to help you load only the first field in the document without
> loading all the fields. After that, you can keep the whole document if you
> like. It should help improve performance better.
>
>
>
> is_maximum wrote:
>  
>> yes it decrease the performance but the only solution.
>> I've spent many weeks to find best way to retrive my own IDs but find this
>> way as last one
>>
>> now I am storing the ids in a BitSet structure and it's fast enough
>>
>> public void collect(...){
>> idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID")));
>>
>> }
>>
>> On 8/2/07, makkhar <[hidden email]> wrote:
>>    
>>>
>>> Hi,
>>>
>>>    The solution you suggested will definitely work but will definitely
>>> slow
>>> down my search by an order of magnitude. The problem I am trying to solve
>>> is
>>> performance, thats why I need the collection of IDs and not the whole
>>> documents.
>>>
>>> - thanks for the prompt reply.
>>>
>>>
>>> is_maximum wrote:
>>>      
>>>> yes if you extend your class from HitCollector and override the
>>>>        
>>> collect()
>>>      
>>>> mthod with following signature you can get IDs
>>>>
>>>> public void collect(int id, float score)
>>>>
>>>> On 8/2/07, makkhar <[hidden email]> wrote:
>>>>        
>>>>> Hi all,
>>>>>
>>>>>    Can I get just a list of document Ids given a search criteria ? To
>>>>> elaborate here is my situation:
>>>>>
>>>>> I store 20000 contracts in the file system index each with some
>>>>> parameterName and Value. Given a search criterion -
>>>>>          
>>> (paramValue='draft').
>>>      
>>>>> I
>>>>> need to get just an ArrayList of Strings containing contract Ids. I
>>>>>          
>>> dont
>>>      
>>>>> need the lucene documents, just the Ids.
>>>>>
>>>>> Can this be done ?
>>>>>
>>>>> -thanks
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>>          
>>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
>>>      
>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>>          
>>>> --
>>>> Regards,
>>>> Mohammad
>>>> --------------------------
>>>> see my blog: http://brainable.blogspot.com/
>>>> another in Persian: http://fekre-motefavet.blogspot.com/
>>>>
>>>>
>>>>        
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>      
>> --
>> Regards,
>> Mohammad
>> --------------------------
>> see my blog: http://brainable.blogspot.com/
>> another in Persian: http://fekre-motefavet.blogspot.com/
>>
>>
>>    
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting only the Ids, not the whole documents.

Mike Klaas
You still have a disk seek per doc if the index can't fit in memory  
(usually more costly than reading the fields) .

Why not use FieldCache?

-Mike

On 2-Aug-07, at 5:41 PM, Mark Miller wrote:

> If you are just retrieving your custom id and you have more stored  
> fields (and they are not tiny) you certainly do want to use a field  
> selector. I would suggest SetBasedFieldSelector.
>
> - Mark
>
> testn wrote:
>> Hi,
>>
>> Why don't you consider to use FieldSelector?  
>> LoadFirstFieldSelector has an
>> ability to help you load only the first field in the document without
>> loading all the fields. After that, you can keep the whole  
>> document if you
>> like. It should help improve performance better.
>>
>>
>>
>> is_maximum wrote:
>>
>>> yes it decrease the performance but the only solution.
>>> I've spent many weeks to find best way to retrive my own IDs but  
>>> find this
>>> way as last one
>>>
>>> now I am storing the ids in a BitSet structure and it's fast enough
>>>
>>> public void collect(...){
>>> idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID")));
>>>
>>> }
>>>
>>> On 8/2/07, makkhar <[hidden email]> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>>    The solution you suggested will definitely work but will  
>>>> definitely
>>>> slow
>>>> down my search by an order of magnitude. The problem I am trying  
>>>> to solve
>>>> is
>>>> performance, thats why I need the collection of IDs and not the  
>>>> whole
>>>> documents.
>>>>
>>>> - thanks for the prompt reply.
>>>>
>>>>
>>>> is_maximum wrote:
>>>>
>>>>> yes if you extend your class from HitCollector and override the
>>>>>
>>>> collect()
>>>>
>>>>> mthod with following signature you can get IDs
>>>>>
>>>>> public void collect(int id, float score)
>>>>>
>>>>> On 8/2/07, makkhar <[hidden email]> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>    Can I get just a list of document Ids given a search  
>>>>>> criteria ? To
>>>>>> elaborate here is my situation:
>>>>>>
>>>>>> I store 20000 contracts in the file system index each with some
>>>>>> parameterName and Value. Given a search criterion -
>>>>>>
>>>> (paramValue='draft').
>>>>
>>>>>> I
>>>>>> need to get just an ArrayList of Strings containing contract  
>>>>>> Ids. I
>>>>>>
>>>> dont
>>>>
>>>>>> need the lucene documents, just the Ids.
>>>>>>
>>>>>> Can this be done ?
>>>>>>
>>>>>> -thanks
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>>
>>>>>>
>>>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole- 
>>>> documents.-tf4204907.html#a11960750
>>>>
>>>>>> Sent from the Lucene - Java Users mailing list archive at  
>>>>>> Nabble.com.
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> ----
>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>> For additional commands, e-mail: [hidden email]
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Regards,
>>>>> Mohammad
>>>>> --------------------------
>>>>> see my blog: http://brainable.blogspot.com/
>>>>> another in Persian: http://fekre-motefavet.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole- 
>>>> documents.-tf4204907.html#a11961159
>>>> Sent from the Lucene - Java Users mailing list archive at  
>>>> Nabble.com.
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>>
>>> --
>>> Regards,
>>> Mohammad
>>> --------------------------
>>> see my blog: http://brainable.blogspot.com/
>>> another in Persian: http://fekre-motefavet.blogspot.com/
>>>
>>>
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]