Max Frequency and Tf/Idf

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Max Frequency and Tf/Idf

Danilo Cicognani
Hello everybody.
We are building a complex automatic classification system using Lucene.
We need to manage normalized Tf/Idf (Term Frequency / Inverse Document
Frequency).
We understood that Lucene can give us Tf and Df and we are using these
values to calculate the normalized Tf/Idf but we would like to optimize this
calculation for better performance.
Is there any way to expose the maximum term frequency in a document from
Lucene, and maybe to obtain the normalized Tf/Idf from Lucene?
There aren't a public methods to get these values, but maybe Lucene holds
these informations privately and with a modify on Lucene source we could
have the work done to fasten the system.

P.S. Sorry for MY English: I hope I explained clearly my question.

**** 1000 KBye ****

 [) /\ |\| | |_ ()

web: www.ciconet.it
Web Portal Now: www.webportalnow.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Max Frequency and Tf/Idf

Grant Ingersoll
The Term Vector code can be used to get the term frequencies from a
specific document.  Search this list, see the Lucene In Action book or
look at http://www.cnlp.org/apachecon2005 for examples on how to use
Term Vectors

Danilo Cicognani wrote:

> Hello everybody.
> We are building a complex automatic classification system using Lucene.
> We need to manage normalized Tf/Idf (Term Frequency / Inverse Document
> Frequency).
> We understood that Lucene can give us Tf and Df and we are using these
> values to calculate the normalized Tf/Idf but we would like to optimize this
> calculation for better performance.
> Is there any way to expose the maximum term frequency in a document from
> Lucene, and maybe to obtain the normalized Tf/Idf from Lucene?
> There aren't a public methods to get these values, but maybe Lucene holds
> these informations privately and with a modify on Lucene source we could
> have the work done to fasten the system.
>
> P.S. Sorry for MY English: I hope I explained clearly my question.
>
> **** 1000 KBye ****
>
>  [) /\ |\| | |_ ()
>
> web: www.ciconet.it
> Web Portal Now: www.webportalnow.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Max Frequency and Tf/Idf

Danilo Cicognani
Hi Grant Ingersoll and everybody.

> The Term Vector code can be used to get the term frequencies from a
> specific document.  Search this list, see the Lucene In
> Action book or
> look at http://www.cnlp.org/apachecon2005 for examples on how to use
> Term Vectors

Maybe I didn't explain well my question.
Following is the code we are using now: we was considering the possiblity to
have more informations from Lucene (for example the maximum term frequency
in one document) to optimized the calculations.
The first method is the one that start the calculation of Tf/Idf using the
class TTfIdf whose constructor is reported below.

public TTfIdf getFieldTfIdf(long tid, long aid, String field) throws
RisorseMultipleException, IOException, RisorsaNonTrovataException,
TTfIdfException {
                reader= IndexReader.open(indexDir);
                int id=getDocumentId(tid,aid);
                TermFreqVector tfv = reader.getTermFreqVector(id,field);
                int[] freqs=tfv.getTermFrequencies();
                String[] terms=tfv.getTerms();
                int[] df=new int[terms.length];
                for(int i=0;i<df.length;i++)
                        df[i]=reader.docFreq(new Term(field,terms[i]));
                TTfIdf tfidf = new TTfIdf(terms,freqs,df,reader.numDocs());
                reader.close();
                return tfidf;
        }

public TTfIdf(String[] terms,int[] freqs, int[] df,int docs) throws
TTfIdfException{
                if(terms.length!=freqs.length||terms.length!=df.length)
throw new
TTfIdfException("I vettori dei termini e delle frequenze sono di diversa
lunghezza!");
                this.terms=terms;
                int l=freqs.length;
                int maxfreq=0;
                for(int i=0;i<l;i++){ // CAN BE OPTIMIZED IN SOME WAY?
                        if(freqs[i]>maxfreq) maxfreq=freqs[i];
                }
                this.freqs=new double[l];
                double tf;
                double idf;
                for(int i=0;i<l;i++){ // CAN BE OPTIMIZED IN SOME WAY?
                        tf=(double)freqs[i]/(double)maxfreq;
                        idf=Math.log((double)docs/(double)df[i]);
                        this.freqs[i]=tf*idf;
                }
        }

Have you got some suggestions?

**** 1000 KBye ****

 [) /\ |\| | |_ ()

web: www.ciconet.it
Web Portal Now: www.webportalnow.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Max Frequency and Tf/Idf

Karl Wettin-3

18 apr 2006 kl. 11.45 skrev Danilo Cicognani:

> Following is the code we are using now: we was considering the  
> possiblity to
> have more informations from Lucene (for example the maximum term  
> frequency
> in one document) to optimized the calculations.
> The first method is the one that start the calculation of Tf/Idf  
> using the
> class TTfIdf whose constructor is reported below.
>
> for(int i=0;i<l;i++){ // CAN BE OPTIMIZED IN SOME WAY?
> if(freqs[i]>maxfreq) maxfreq=freqs[i];
> }
> this.freqs=new double[l];
> double tf;
> double idf;
> for(int i=0;i<l;i++){ // CAN BE OPTIMIZED IN SOME WAY?
> tf=(double)freqs[i]/(double)maxfreq;
> idf=Math.log((double)docs/(double)df[i]);
> this.freqs[i]=tf*idf;
> }

Not quite sure what you do above, but I guess you could caclulate the  
information at index time. To persist it in the index, extend/hack  
TermFreqVector and related IO-classes.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]