Get All terms from all documents

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Get All terms from all documents

Roberto Martins-3
Hello,

I need to get all terms from all documents to be placed in my interface
almost like the facets, how can i do it?

thanks

--
"Without love, we are birds with broken wings."
Morrie
Reply | Threaded
Open this post in threaded view
|

Re: Get All terms from all documents

Grant Ingersoll-2
All terms from all docs?  Really?

At any rate, see http://wiki.apache.org/solr/TermsComponent  May need  
a mod to not require any field, but for now you can enter all fields  
(which you can get from LukeRequestHandler)

-Grant


On Dec 17, 2008, at 2:17 PM, roberto wrote:

> Hello,
>
> I need to get all terms from all documents to be placed in my  
> interface
> almost like the facets, how can i do it?
>
> thanks
>
> --
> "Without love, we are birds with broken wings."
> Morrie

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










Reply | Threaded
Open this post in threaded view
|

Re: Get All terms from all documents

Roberto Martins-3
Grant

It completely crazy do something like this i know, but the customer want´s,
i´m really trying to figure out how to do it in a better way, maybe using
the (auto suggest) filter from solr 1.3 to get all the words starting with
some letter and cache the letter in the client side, out client is going to
be write in swing, what do you guys think?

Thanks,

On Wed, Dec 17, 2008 at 8:05 PM, Grant Ingersoll <[hidden email]>wrote:

> All terms from all docs?  Really?
>
> At any rate, see http://wiki.apache.org/solr/TermsComponent  May need a
> mod to not require any field, but for now you can enter all fields (which
> you can get from LukeRequestHandler)
>
> -Grant
>
>
>
> On Dec 17, 2008, at 2:17 PM, roberto wrote:
>
> Hello,
>>
>> I need to get all terms from all documents to be placed in my interface
>> almost like the facets, how can i do it?
>>
>> thanks
>>
>> --
>> "Without love, we are birds with broken wings."
>> Morrie
>>
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>


--
"Without love, we are birds with broken wings."
Morrie
Reply | Threaded
Open this post in threaded view
|

Re: Get All terms from all documents

Erick Erickson
I think I'd pin the user down and have him give me the real-world
use-cases that require this, then see if there's a more reasonable
 way to satisfy that use-case. Do they want type-ahead? What
is the user of the system going to see? Because, for instance,
a drop-down of 10,000 terms is totally useless.

Best
Erick

On Wed, Dec 17, 2008 at 10:02 PM, roberto <[hidden email]> wrote:

> Grant
>
> It completely crazy do something like this i know, but the customer want´s,
> i´m really trying to figure out how to do it in a better way, maybe using
> the (auto suggest) filter from solr 1.3 to get all the words starting with
> some letter and cache the letter in the client side, out client is going to
> be write in swing, what do you guys think?
>
> Thanks,
>
> On Wed, Dec 17, 2008 at 8:05 PM, Grant Ingersoll <[hidden email]
> >wrote:
>
> > All terms from all docs?  Really?
> >
> > At any rate, see http://wiki.apache.org/solr/TermsComponent  May need a
> > mod to not require any field, but for now you can enter all fields (which
> > you can get from LukeRequestHandler)
> >
> > -Grant
> >
> >
> >
> > On Dec 17, 2008, at 2:17 PM, roberto wrote:
> >
> > Hello,
> >>
> >> I need to get all terms from all documents to be placed in my interface
> >> almost like the facets, how can i do it?
> >>
> >> thanks
> >>
> >> --
> >> "Without love, we are birds with broken wings."
> >> Morrie
> >>
> >
> > --------------------------
> > Grant Ingersoll
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> --
> "Without love, we are birds with broken wings."
> Morrie
>
Reply | Threaded
Open this post in threaded view
|

Re: Get All terms from all documents

Roberto Martins-3
Erick,

Thanks for the answer, let me clarify the thing, we would like to have a
combobox with the terms to guide the user in the search i mean, if a have
thousands of documents and want to tell them how many documents in the base
have the particular word, how can i do that?

thanks

On Thu, Dec 18, 2008 at 11:25 AM, Erick Erickson <[hidden email]>wrote:

> I think I'd pin the user down and have him give me the real-world
> use-cases that require this, then see if there's a more reasonable
>  way to satisfy that use-case. Do they want type-ahead? What
> is the user of the system going to see? Because, for instance,
> a drop-down of 10,000 terms is totally useless.
>
> Best
> Erick
>
> On Wed, Dec 17, 2008 at 10:02 PM, roberto <[hidden email]> wrote:
>
> > Grant
> >
> > It completely crazy do something like this i know, but the customer
> want´s,
> > i´m really trying to figure out how to do it in a better way, maybe using
> > the (auto suggest) filter from solr 1.3 to get all the words starting
> with
> > some letter and cache the letter in the client side, out client is going
> to
> > be write in swing, what do you guys think?
> >
> > Thanks,
> >
> > On Wed, Dec 17, 2008 at 8:05 PM, Grant Ingersoll <[hidden email]
> > >wrote:
> >
> > > All terms from all docs?  Really?
> > >
> > > At any rate, see http://wiki.apache.org/solr/TermsComponent  May need
> a
> > > mod to not require any field, but for now you can enter all fields
> (which
> > > you can get from LukeRequestHandler)
> > >
> > > -Grant
> > >
> > >
> > >
> > > On Dec 17, 2008, at 2:17 PM, roberto wrote:
> > >
> > > Hello,
> > >>
> > >> I need to get all terms from all documents to be placed in my
> interface
> > >> almost like the facets, how can i do it?
> > >>
> > >> thanks
> > >>
> > >> --
> > >> "Without love, we are birds with broken wings."
> > >> Morrie
> > >>
> > >
> > > --------------------------
> > > Grant Ingersoll
> > >
> > > Lucene Helpful Hints:
> > > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > > http://wiki.apache.org/lucene-java/LuceneFAQ
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > "Without love, we are birds with broken wings."
> > Morrie
> >
>



--
"Without love, we are birds with broken wings."
Morrie
Reply | Threaded
Open this post in threaded view
|

Re: Get All terms from all documents

Erick Erickson
How do you get the word in the first place? If the combobox
is for all words in your index, it's probably completely useless
to provide this information because there is too much data to
guide the user at all. I mean a list of 10,000 words with some sort
of document frequency seems to me to require significant
developer work without adding to the user experience at all...

If that's the case, I'd really work with your customer and try
to persuade them that this is a feature that adds little value,
and that there are higher-value features you should do first.

But if you really, really require the information, here's what I
would recommend:

Use TermDocs/TermEnum to traverse your index gathering
this data *at index time*. Then create a *very special* document
that you also put in your index (stored, but not indexed
in this case) that contains an unique field (say frequencies).

Upon startup of your searcher, read in this very special document,
parse it and create a map of words and frequencies that you use
to find the number of documents containing that word.

Hope this helps
Erick


On Thu, Dec 18, 2008 at 1:53 PM, roberto <[hidden email]> wrote:

> Erick,
>
> Thanks for the answer, let me clarify the thing, we would like to have a
> combobox with the terms to guide the user in the search i mean, if a have
> thousands of documents and want to tell them how many documents in the base
> have the particular word, how can i do that?
>
> thanks
>
> On Thu, Dec 18, 2008 at 11:25 AM, Erick Erickson <[hidden email]
> >wrote:
>
> > I think I'd pin the user down and have him give me the real-world
> > use-cases that require this, then see if there's a more reasonable
> >  way to satisfy that use-case. Do they want type-ahead? What
> > is the user of the system going to see? Because, for instance,
> > a drop-down of 10,000 terms is totally useless.
> >
> > Best
> > Erick
> >
> > On Wed, Dec 17, 2008 at 10:02 PM, roberto <[hidden email]> wrote:
> >
> > > Grant
> > >
> > > It completely crazy do something like this i know, but the customer
> > want´s,
> > > i´m really trying to figure out how to do it in a better way, maybe
> using
> > > the (auto suggest) filter from solr 1.3 to get all the words starting
> > with
> > > some letter and cache the letter in the client side, out client is
> going
> > to
> > > be write in swing, what do you guys think?
> > >
> > > Thanks,
> > >
> > > On Wed, Dec 17, 2008 at 8:05 PM, Grant Ingersoll <[hidden email]
> > > >wrote:
> > >
> > > > All terms from all docs?  Really?
> > > >
> > > > At any rate, see http://wiki.apache.org/solr/TermsComponent  May
> need
> > a
> > > > mod to not require any field, but for now you can enter all fields
> > (which
> > > > you can get from LukeRequestHandler)
> > > >
> > > > -Grant
> > > >
> > > >
> > > >
> > > > On Dec 17, 2008, at 2:17 PM, roberto wrote:
> > > >
> > > > Hello,
> > > >>
> > > >> I need to get all terms from all documents to be placed in my
> > interface
> > > >> almost like the facets, how can i do it?
> > > >>
> > > >> thanks
> > > >>
> > > >> --
> > > >> "Without love, we are birds with broken wings."
> > > >> Morrie
> > > >>
> > > >
> > > > --------------------------
> > > > Grant Ingersoll
> > > >
> > > > Lucene Helpful Hints:
> > > > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > > > http://wiki.apache.org/lucene-java/LuceneFAQ
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > "Without love, we are birds with broken wings."
> > > Morrie
> > >
> >
>
>
>
> --
> "Without love, we are birds with broken wings."
> Morrie
>
Reply | Threaded
Open this post in threaded view
|

Re: Get All terms from all documents

Mike Klaas
In reply to this post by Roberto Martins-3

On 18-Dec-08, at 10:53 AM, roberto wrote:

> Erick,
>
> Thanks for the answer, let me clarify the thing, we would like to  
> have a
> combobox with the terms to guide the user in the search i mean, if a  
> have
> thousands of documents and want to tell them how many documents in  
> the base
> have the particular word, how can i do that?

Sounds like you want query autocomplete.  The best way to do this  
(including if you want the box filled with some queries), is to use  
the query logs, not the documents.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Get All terms from all documents

Roberto Martins-3
Erick,

Thanks this sounds good, i'll try.

Mike,

Could you give more details about query logs?

Thanks

On Fri, Dec 19, 2008 at 12:02 AM, Mike Klaas <[hidden email]> wrote:

>
> On 18-Dec-08, at 10:53 AM, roberto wrote:
>
>  Erick,
>>
>> Thanks for the answer, let me clarify the thing, we would like to have a
>> combobox with the terms to guide the user in the search i mean, if a have
>> thousands of documents and want to tell them how many documents in the
>> base
>> have the particular word, how can i do that?
>>
>
> Sounds like you want query autocomplete.  The best way to do this
> (including if you want the box filled with some queries), is to use the
> query logs, not the documents.
>
> -Mike
>



--
"Without love, we are birds with broken wings."
Morrie
Reply | Threaded
Open this post in threaded view
|

Re: Get All terms from all documents

Grant Ingersoll-2
I'd add you probably don't want just the query logs, people may search  
for things that aren't in the index, too.  Your call as to whether  
that is useful or not.  Also, have a look at the TermsComponent, as it  
will tell you the doc freq for terms.


On Dec 19, 2008, at 10:08 AM, roberto wrote:

> Erick,
>
> Thanks this sounds good, i'll try.
>
> Mike,
>
> Could you give more details about query logs?
>
> Thanks
>
> On Fri, Dec 19, 2008 at 12:02 AM, Mike Klaas <[hidden email]>  
> wrote:
>
>>
>> On 18-Dec-08, at 10:53 AM, roberto wrote:
>>
>> Erick,
>>>
>>> Thanks for the answer, let me clarify the thing, we would like to  
>>> have a
>>> combobox with the terms to guide the user in the search i mean, if  
>>> a have
>>> thousands of documents and want to tell them how many documents in  
>>> the
>>> base
>>> have the particular word, how can i do that?
>>>
>>
>> Sounds like you want query autocomplete.  The best way to do this
>> (including if you want the box filled with some queries), is to use  
>> the
>> query logs, not the documents.
>>
>> -Mike
>>
>
>
>
> --
> "Without love, we are birds with broken wings."
> Morrie

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










Reply | Threaded
Open this post in threaded view
|

Re: Get All terms from all documents

Walter Underwood, Netflix
At Netflix, we load the completion lexicon with movie titles, person
names, and a few aliases. Even then, we find a few misspellings in
our metadata (is it "NWA" or "N.W.A."?). Extracting terms from
documents will find a lot of misspellings.

You really do not want to rely on random users to correctly spell
things like Ratatouille and Koyaanisqatsi. Trust me.

Autocomplete needs to be really fast, so we use a dedicated
in-memory index (RAMDirectory) in the front end webapp and
also use an HTTP cache in the load balancer.

We get at least 25 million autocomplete requests a day, more
than 10X the number of search requests. I would plan for
10-15X search traffic.

wunder

On 12/19/08 10:45 AM, "Grant Ingersoll" <[hidden email]> wrote:

> I'd add you probably don't want just the query logs, people may search
> for things that aren't in the index, too.  Your call as to whether
> that is useful or not.  Also, have a look at the TermsComponent, as it
> will tell you the doc freq for terms.
>
> On Dec 19, 2008, at 10:08 AM, roberto wrote:
>
>> Erick,
>>
>> Thanks this sounds good, i'll try.
>>
>> Mike,
>>
>> Could you give more details about query logs?
>>
>> Thanks
>>
>> On Fri, Dec 19, 2008 at 12:02 AM, Mike Klaas <[hidden email]>
>> wrote:
>>
>>>
>>> On 18-Dec-08, at 10:53 AM, roberto wrote:
>>>
>>> Erick,
>>>>
>>>> Thanks for the answer, let me clarify the thing, we would like to
>>>> have a
>>>> combobox with the terms to guide the user in the search i mean, if
>>>> a have
>>>> thousands of documents and want to tell them how many documents in
>>>> the
>>>> base
>>>> have the particular word, how can i do that?
>>>>
>>>
>>> Sounds like you want query autocomplete.  The best way to do this
>>> (including if you want the box filled with some queries), is to use
>>> the
>>> query logs, not the documents.
>>>
>>> -Mike
>>>
>> --
>> "Without love, we are birds with broken wings."
>> Morrie
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ