I need to delete entries from posting list. How to do it in Lucene 4.0? I need to do this to test different pruning algorithms.
Thanks in advance ZP |
On 19/03/2012 11:24, Zeynep P. wrote:
> I need to delete entries from posting list. How to do it in Lucene 4.0? I > need to do this to test different pruning algorithms. > > Thanks in advance http://issues.apache.org/jira/browse/LUCENE-1812 http://issues.apache.org/jira/browse/LUCENE-2632 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
That is perfect
Thank you very much Best regards ZP |
This post was updated on .
While using the pruning package, I realised that ridf is calculated in RIDFTermPruningPolicy as follows:
Math.log(1 - Math.pow(Math.E, termPositions.freq() / maxDoc)) - df However, according to the original paper for residual idf, it should be -log(df/D) + log (1 - e^(-tf/D)). Thus, in the equation, Math.pow should be Math.pow(Math.E, - (termPositions.freq() / maxDoc)) Do I miss something in the calculation or is this a bug? Thanks in advance ZP |
On 27/03/2012 20:25, Zeynep P. wrote:
> While using the pruning package, I realised that ridf is calculated in > RIDFTermPruningPolicy as follows: > Math.log(1 - Math.pow(Math.E, termPositions.freq() / maxDoc)) - df > > However, according to the original paper (Blanco et al.) for residual idf, > it should be -log(df/D) + log (1 - e^(*-*tf/D)). Thus, in the equation, > Math.pow should be Math.pow(Math.E, - (termPositions.freq() / maxDoc)) > > Do I miss something in the calculation or is this a bug? Hmm, good question! After checking the original paper again, and then checking our implementation, I think that this is indeed a bug, and we should add the minus there, but ... this formula may be completely broken either way. The paper that you mention (http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf) says thus: "Residual idf is defined in [3] as the difference between the observed idf (IDF ) and the idf expected under the assumption that the terms follow an independence model, such as Poisson (IDF^). [...] If tf is the total number of tokens for a term t, then the ridf devised by a Poisson distribution is RIDF = IDF − IDF^ = −log(df/D) + log(1 − e^(-tf/D)) [2] " Since the purpose of the RIDF metric is to select informative words collection-wide, and not per-document, then it makes sense that they use a collection-wide metric like IDF as a baseline vs. another collection-wide metric based on total term frequency, or rather the total number of term occurrences in a collection. The problem in our implementation is that we use a within-document term frequency (the number of occurrences of t in the current document) and not a collection-wide term frequency... so, it looks to me that the fix would be to first fully traverse the doc enumeration and calculate the total number of term occurrences in all documents (e.g. in RIDFTermPruningPolicy.initPositionsTerm(..) ), and use this value in the formula in place of termPositions.freq(). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
On 29/03/2012 11:14, Andrzej Bialecki wrote:
> The problem in our implementation is that we use a within-document term > frequency (the number of occurrences of t in the current document) and > not a collection-wide term frequency... so, it looks to me that the fix > would be to first fully traverse the doc enumeration and calculate the > total number of term occurrences in all documents (e.g. in > RIDFTermPruningPolicy.initPositionsTerm(..) ), and use this value in the > formula in place of termPositions.freq(). > This is the fix that I implemented, it's now committed to branch_3x and will be included in release 3.6. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
Powered by Nabble | Edit this page |