Problems about using Lucene to generate tag cloud..

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Problems about using Lucene to generate tag cloud..

wuqi-2
Hi,
I am trying to use Lucene index to implement a tag cloud  system. I add a new field  named "tags" in index to  store all the tags,and we don't support tags with more than one word, so different tags of the same document just are separate by white space.  The "tags" filed in one document  may looks like this :
doc1  tags : travel Beijing  news
doc2  tags:  beijing sports news
I can easily retrieve tags related with single document,and also get the documents related with certain tag, but it's hard  find a "efficient" way to  get frequent tags  from a "set" of documents of this index.Tthe set of the documents is always generated dynamically, may be a search result, a  dynamically generated category through clustering. The document set is very large, maybe several ten thousands or several hundred thousands.So simply  iterate all  the documents in the set and find the frequent tags might not be applicable.Do you have any better idea ?

Thanks
-Qi
Reply | Threaded
Open this post in threaded view
|

RE: Problems about using Lucene to generate tag cloud..

Dominique Bejean
May be you can index the set of documents in a temporary index. This index
needs only one field (tag).

Then you can browse the terms collection of the index and get each couple
term/frequency

        IndexReader reader = IndexReader.open(temp_index);
        TermEnum terms = reader.terms();

        while (terms.next()) {
            String field = terms.term().field();

            if (!"tag".equals(field)) continue;

            String term = terms.term().text();
            int freq = terms.docFreq();
        }

        terms.close();
        reader.close();



-----Message d'origine-----
De : wuqi [mailto:[hidden email]]
Envoyé : lundi 31 mars 2008 09:07
À : [hidden email]
Objet : Problems about using Lucene to generate tag cloud..

Hi,
I am trying to use Lucene index to implement a tag cloud  system. I add a
new field  named "tags" in index to  store all the tags,and we don't support
tags with more than one word, so different tags of the same document just
are separate by white space.  The "tags" filed in one document  may looks
like this :
doc1  tags : travel Beijing  news
doc2  tags:  beijing sports news
I can easily retrieve tags related with single document,and also get the
documents related with certain tag, but it's hard  find a "efficient" way to
get frequent tags  from a "set" of documents of this index.Tthe set of the
documents is always generated dynamically, may be a search result, a
dynamically generated category through clustering. The document set is very
large, maybe several ten thousands or several hundred thousands.So simply
iterate all  the documents in the set and find the frequent tags might not
be applicable.Do you have any better idea ?

Thanks
-Qi


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Problems about using Lucene to generate tag cloud..

wuqi-2
so build  a index for the dynamically generated docucements set ,and then  try to find frequency for each terms in this index... not sure it's fast enoug.but it's worth to have a try...
Thank you  Doinique!
----- Original Message -----
From: "Dominique Béjean" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, April 01, 2008 3:51 PM
Subject: RE: Problems about using Lucene to generate tag cloud..


May be you can index the set of documents in a temporary index. This index
needs only one field (tag).

Then you can browse the terms collection of the index and get each couple
term/frequency

        IndexReader reader = IndexReader.open(temp_index);
        TermEnum terms = reader.terms();

        while (terms.next()) {
            String field = terms.term().field();

            if (!"tag".equals(field)) continue;

            String term = terms.term().text();
            int freq = terms.docFreq();
        }

        terms.close();
        reader.close();



-----Message d'origine-----
De : wuqi [mailto:[hidden email]]
Envoyé : lundi 31 mars 2008 09:07
À : [hidden email]
Objet : Problems about using Lucene to generate tag cloud..

Hi,
I am trying to use Lucene index to implement a tag cloud  system. I add a
new field  named "tags" in index to  store all the tags,and we don't support
tags with more than one word, so different tags of the same document just
are separate by white space.  The "tags" filed in one document  may looks
like this :
doc1  tags : travel Beijing  news
doc2  tags:  beijing sports news
I can easily retrieve tags related with single document,and also get the
documents related with certain tag, but it's hard  find a "efficient" way to
get frequent tags  from a "set" of documents of this index.Tthe set of the
documents is always generated dynamically, may be a search result, a
dynamically generated category through clustering. The document set is very
large, maybe several ten thousands or several hundred thousands.So simply
iterate all  the documents in the set and find the frequent tags might not
be applicable.Do you have any better idea ?

Thanks
-Qi


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Problems about using Lucene to generate tag cloud..

Dominique Bejean
On www.crossfeeds.com, I use this method in order to update hourly a tag
cloud based on the title of 20.000 RSS articles (RSS published during the
last 24 hours). It takes 1 minute.
 

-----Message d'origine-----
De : wuqi [mailto:[hidden email]]
Envoyé : mardi 1 avril 2008 14:10
À : [hidden email]
Objet : Re: Problems about using Lucene to generate tag cloud..

so build  a index for the dynamically generated docucements set ,and then
try to find frequency for each terms in this index... not sure it's fast
enoug.but it's worth to have a try...
Thank you  Doinique!
----- Original Message -----
From: "Dominique Béjean" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, April 01, 2008 3:51 PM
Subject: RE: Problems about using Lucene to generate tag cloud..


May be you can index the set of documents in a temporary index. This index
needs only one field (tag).

Then you can browse the terms collection of the index and get each couple
term/frequency

        IndexReader reader = IndexReader.open(temp_index);
        TermEnum terms = reader.terms();

        while (terms.next()) {
            String field = terms.term().field();

            if (!"tag".equals(field)) continue;

            String term = terms.term().text();
            int freq = terms.docFreq();
        }

        terms.close();
        reader.close();



-----Message d'origine-----
De : wuqi [mailto:[hidden email]]
Envoyé : lundi 31 mars 2008 09:07
À : [hidden email]
Objet : Problems about using Lucene to generate tag cloud..

Hi,
I am trying to use Lucene index to implement a tag cloud  system. I add a
new field  named "tags" in index to  store all the tags,and we don't support
tags with more than one word, so different tags of the same document just
are separate by white space.  The "tags" filed in one document  may looks
like this :
doc1  tags : travel Beijing  news
doc2  tags:  beijing sports news
I can easily retrieve tags related with single document,and also get the
documents related with certain tag, but it's hard  find a "efficient" way to
get frequent tags  from a "set" of documents of this index.Tthe set of the
documents is always generated dynamically, may be a search result, a
dynamically generated category through clustering. The document set is very
large, maybe several ten thousands or several hundred thousands.So simply
iterate all  the documents in the set and find the frequent tags might not
be applicable.Do you have any better idea ?

Thanks
-Qi


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Problems about using Lucene to generate tag cloud..

wuqi-2
I registered myself just now, an interesting website.
It seems crossfeeds generate a tag cloud offline hourly ? But I have a more strict time requirement. user submit a query in my website, and they may get  tens of thousands of  search results. I need to generate a tag cloud for all these document returned just during seconds of time.
I think your solution might  can fulfill this,if  the indexing process and term ordering process were  totally finish in memory..


----- Original Message -----
From: "Dominique Béjean" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, April 01, 2008 8:30 PM
Subject: RE: Problems about using Lucene to generate tag cloud..


On www.crossfeeds.com, I use this method in order to update hourly a tag
cloud based on the title of 20.000 RSS articles (RSS published during the
last 24 hours). It takes 1 minute.
 

-----Message d'origine-----
De : wuqi [mailto:[hidden email]]
Envoyé : mardi 1 avril 2008 14:10
À : [hidden email]
Objet : Re: Problems about using Lucene to generate tag cloud..

so build  a index for the dynamically generated docucements set ,and then
try to find frequency for each terms in this index... not sure it's fast
enoug.but it's worth to have a try...
Thank you  Doinique!
----- Original Message -----
From: "Dominique Béjean" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, April 01, 2008 3:51 PM
Subject: RE: Problems about using Lucene to generate tag cloud..


May be you can index the set of documents in a temporary index. This index
needs only one field (tag).

Then you can browse the terms collection of the index and get each couple
term/frequency

        IndexReader reader = IndexReader.open(temp_index);
        TermEnum terms = reader.terms();

        while (terms.next()) {
            String field = terms.term().field();

            if (!"tag".equals(field)) continue;

            String term = terms.term().text();
            int freq = terms.docFreq();
        }

        terms.close();
        reader.close();



-----Message d'origine-----
De : wuqi [mailto:[hidden email]]
Envoyé : lundi 31 mars 2008 09:07
À : [hidden email]
Objet : Problems about using Lucene to generate tag cloud..

Hi,
I am trying to use Lucene index to implement a tag cloud  system. I add a
new field  named "tags" in index to  store all the tags,and we don't support
tags with more than one word, so different tags of the same document just
are separate by white space.  The "tags" filed in one document  may looks
like this :
doc1  tags : travel Beijing  news
doc2  tags:  beijing sports news
I can easily retrieve tags related with single document,and also get the
documents related with certain tag, but it's hard  find a "efficient" way to
get frequent tags  from a "set" of documents of this index.Tthe set of the
documents is always generated dynamically, may be a search result, a
dynamically generated category through clustering. The document set is very
large, maybe several ten thousands or several hundred thousands.So simply
iterate all  the documents in the set and find the frequent tags might not
be applicable.Do you have any better idea ?

Thanks
-Qi


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Problems about using Lucene to generate tag cloud..

Daniel Noll-3-2
In reply to this post by Dominique Bejean
On Tuesday 01 April 2008 18:51:55 Dominique Béjean wrote:
>         IndexReader reader = IndexReader.open(temp_index);
>         TermEnum terms = reader.terms();
>
>         while (terms.next()) {
>             String field = terms.term().field();

Gotcha: after calling terms() it's already pointing at the first term.  So you
need to rewrite this as a do-while loop.

Possibly my least favourite feature of Lucene. :-(

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Problems about using Lucene to generate tag cloud..

Dominique Bejean
Hum, it looks like it is not true.
Use a do-while loop make the first terms.term().field() generate a null
pointer exception.

-----Message d'origine-----
De : Daniel Noll [mailto:[hidden email]]
Envoyé : mardi 1 avril 2008 23:58
À : [hidden email]
Objet : Re: Problems about using Lucene to generate tag cloud..

On Tuesday 01 April 2008 18:51:55 Dominique Béjean wrote:
>         IndexReader reader = IndexReader.open(temp_index);
>         TermEnum terms = reader.terms();
>
>         while (terms.next()) {
>             String field = terms.term().field();

Gotcha: after calling terms() it's already pointing at the first term.  So
you
need to rewrite this as a do-while loop.

Possibly my least favourite feature of Lucene. :-(

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Problems about using Lucene to generate tag cloud..

Daniel Noll-3-2
On Thursday 03 April 2008 08:08:09 Dominique Béjean wrote:
> Hum, it looks like it is not true.
> Use a do-while loop make the first terms.term().field() generate a null
> pointer exception.

Depends which terms method you use.

    TermEnum terms = reader.terms();
    System.out.println(terms.term());   => null

    terms = reader.terms(new Term("id", ""));
    System.out.println(terms.term());   => id:0

The first method makes a normal while loop work but it also makes the
assumption that there is only one field in the index, which may not be the
case forever even if it's the case initially.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Problems about using Lucene to generate tag cloud..

John Wang-9
check out http://www.browseengine.com
tag cloud impl on lucene is avail.

-John

On Wed, Apr 2, 2008 at 4:12 PM, Daniel Noll <[hidden email]> wrote:

> On Thursday 03 April 2008 08:08:09 Dominique Béjean wrote:
> > Hum, it looks like it is not true.
> > Use a do-while loop make the first terms.term().field() generate a null
> > pointer exception.
>
> Depends which terms method you use.
>
>    TermEnum terms = reader.terms();
>    System.out.println(terms.term());   => null
>
>    terms = reader.terms(new Term("id", ""));
>    System.out.println(terms.term());   => id:0
>
> The first method makes a normal while loop work but it also makes the
> assumption that there is only one field in the index, which may not be the
> case forever even if it's the case initially.
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Problems about using Lucene to generate tag cloud..

wuqi-2
Very useful.. Thank you!

----- Original Message -----
From: "John Wang" <[hidden email]>
To: <[hidden email]>
Sent: Saturday, April 05, 2008 8:35 AM
Subject: Re: Problems about using Lucene to generate tag cloud..


check out http://www.browseengine.com
tag cloud impl on lucene is avail.

-John

On Wed, Apr 2, 2008 at 4:12 PM, Daniel Noll <[hidden email]> wrote:

> On Thursday 03 April 2008 08:08:09 Dominique Béjean wrote:
> > Hum, it looks like it is not true.
> > Use a do-while loop make the first terms.term().field() generate a null
> > pointer exception.
>
> Depends which terms method you use.
>
>    TermEnum terms = reader.terms();
>    System.out.println(terms.term());   => null
>
>    terms = reader.terms(new Term("id", ""));
>    System.out.println(terms.term());   => id:0
>
> The first method makes a normal while loop work but it also makes the
> assumption that there is only one field in the index, which may not be the
> case forever even if it's the case initially.
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Problems about using Lucene to generate tag cloud..

Marvin Humphrey
In reply to this post by Daniel Noll-3-2

On Apr 1, 2008, at 2:57 PM, Daniel Noll wrote:

> On Tuesday 01 April 2008 18:51:55 Dominique Béjean wrote:
>>        IndexReader reader = IndexReader.open(temp_index);
>>        TermEnum terms = reader.terms();
>>
>>        while (terms.next()) {
>>            String field = terms.term().field();
>
> Gotcha: after calling terms() it's already pointing at the first  
> term.  So you
> need to rewrite this as a do-while loop.
>
> Possibly my least favourite feature of Lucene. :-(

What would a better API look like?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]