Multi-field IDF

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Multi-field IDF

Nicolás Lichtmaier
IDF measures the selectivity of a term. But the calculation is
per-field. That can be bad for very short fields (like titles). One
example of this problem: If I don't delete stop words, then "or", "and",
etc. should be dealt with low IDF values, however "or" is, perhaps, not
so usual in titles. Then, "or" will have a high IDF value and be treated
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or
multi-field IDF value. This value would include in its calculation
longer fields that has more "normal text"-like stats. However this is
not trivial because I can't just add document-frequencies (I would be
counting some documents several times if "or" is present in more than
one field). I would need need to OR the bit-vectors that signal the
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multi-field IDF

Ahmet Arslan
Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then it has some discrimination power in that domain.

I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier <[hidden email]> wrote:
IDF measures the selectivity of a term. But the calculation is
per-field. That can be bad for very short fields (like titles). One
example of this problem: If I don't delete stop words, then "or", "and",
etc. should be dealt with low IDF values, however "or" is, perhaps, not
so usual in titles. Then, "or" will have a high IDF value and be treated
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or
multi-field IDF value. This value would include in its calculation
longer fields that has more "normal text"-like stats. However this is
not trivial because I can't just add document-frequencies (I would be
counting some documents several times if "or" is present in more than
one field). I would need need to OR the bit-vectors that signal the
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multi-field IDF

Nicolás Lichtmaier
That depends on what you want. In this case I want to use a
discrimination power based in all the body text, not just the titles.
Because otherwise terms that are really not that relevant end up being
very high!


El 17/11/16 a las 18:25, Ahmet Arslan escribió:

> Hi Nicholas,
>
> IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then it has some discrimination power in that domain.
>
> I think it's OK 'or' to get a high IDF value in this case.
>
> Ahmet
>
>
>
> On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier <[hidden email]> wrote:
> IDF measures the selectivity of a term. But the calculation is
> per-field. That can be bad for very short fields (like titles). One
> example of this problem: If I don't delete stop words, then "or", "and",
> etc. should be dealt with low IDF values, however "or" is, perhaps, not
> so usual in titles. Then, "or" will have a high IDF value and be treated
> as an important term. That's bad.
>
> One solution I see is to modify the Similarity to have a global, or
> multi-field IDF value. This value would include in its calculation
> longer fields that has more "normal text"-like stats. However this is
> not trivial because I can't just add document-frequencies (I would be
> counting some documents several times if "or" is present in more than
> one field). I would need need to OR the bit-vectors that signal the
> presence of the term, right? Not trivial.
>
> Has anyone encountered this issue? Has it been solved? Is my thinking wrong?
>
> Should I also try the developers' list?
>
> Thanks!
>
> Nicolás.-
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multi-field IDF

wmartinusa
are you familiar with pivoted normalized document length practice or
theory? or croft's recent work on relevance algorithms accounting for
structured field presence?



On 11/17/2016 5:20 PM, Nicolás Lichtmaier wrote:

> That depends on what you want. In this case I want to use a
> discrimination power based in all the body text, not just the titles.
> Because otherwise terms that are really not that relevant end up being
> very high!
>
>
> El 17/11/16 a las 18:25, Ahmet Arslan escribió:
>> Hi Nicholas,
>>
>> IDF, among others, is a measure of term specificity. If 'or' is not
>> so usual in titles, then it has some discrimination power in that
>> domain.
>>
>> I think it's OK 'or' to get a high IDF value in this case.
>>
>> Ahmet
>>
>>
>>
>> On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier
>> <[hidden email]> wrote:
>> IDF measures the selectivity of a term. But the calculation is
>> per-field. That can be bad for very short fields (like titles). One
>> example of this problem: If I don't delete stop words, then "or", "and",
>> etc. should be dealt with low IDF values, however "or" is, perhaps, not
>> so usual in titles. Then, "or" will have a high IDF value and be treated
>> as an important term. That's bad.
>>
>> One solution I see is to modify the Similarity to have a global, or
>> multi-field IDF value. This value would include in its calculation
>> longer fields that has more "normal text"-like stats. However this is
>> not trivial because I can't just add document-frequencies (I would be
>> counting some documents several times if "or" is present in more than
>> one field). I would need need to OR the bit-vectors that signal the
>> presence of the term, right? Not trivial.
>>
>> Has anyone encountered this issue? Has it been solved? Is my thinking
>> wrong?
>>
>> Should I also try the developers' list?
>>
>> Thanks!
>>
>> Nicolás.-
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

Reply | Threaded
Open this post in threaded view
|

Re: Multi-field IDF

Ahmet Arslan
In reply to this post by Nicolás Lichtmaier
Hi Nicholas,

Aha, I see that you are into field-based scoring, which is an unsolved problem.

Then, you might find BlendedTermQuery and SynonymQuery relevant.

Ahmet




On Friday, November 18, 2016 12:22 AM, Nicolás Lichtmaier <[hidden email]> wrote:
That depends on what you want. In this case I want to use a
discrimination power based in all the body text, not just the titles.
Because otherwise terms that are really not that relevant end up being
very high!


El 17/11/16 a las 18:25, Ahmet Arslan escribió:

> Hi Nicholas,
>
> IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then it has some discrimination power in that domain.
>
> I think it's OK 'or' to get a high IDF value in this case.
>
> Ahmet
>
>
>
> On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier <[hidden email]> wrote:
> IDF measures the selectivity of a term. But the calculation is
> per-field. That can be bad for very short fields (like titles). One
> example of this problem: If I don't delete stop words, then "or", "and",
> etc. should be dealt with low IDF values, however "or" is, perhaps, not
> so usual in titles. Then, "or" will have a high IDF value and be treated
> as an important term. That's bad.
>
> One solution I see is to modify the Similarity to have a global, or
> multi-field IDF value. This value would include in its calculation
> longer fields that has more "normal text"-like stats. However this is
> not trivial because I can't just add document-frequencies (I would be
> counting some documents several times if "or" is present in more than
> one field). I would need need to OR the bit-vectors that signal the
> presence of the term, right? Not trivial.
>
> Has anyone encountered this issue? Has it been solved? Is my thinking wrong?
>
> Should I also try the developers' list?
>
> Thanks!
>
> Nicolás.-
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multi-field IDF

wmartinusa
In this work, we aim to improve the fi eld weighting for structured doc-
ument retrieval. We fi rst introduce the notion of fi eld relevance as the
generalization of fi eld weights, and discuss how it can be estimated using
relevant documents, which eff ectively implements relevance feedback for
fi eld weighting. We then propose a framework for estimating fi eld rele-
vance based on the combination of several sources. Evaluation on several
structured document collections show that fi eld weighting based on the
suggested framework improves retrieval e ffectiveness signi cantly.


https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1051




On 11/18/2016 3:57 AM, Ahmet Arslan wrote:

> Hi Nicholas,
>
> Aha, I see that you are into field-based scoring, which is an unsolved problem.
>
> Then, you might find BlendedTermQuery and SynonymQuery relevant.
>
> Ahmet
>
>
>
>
> On Friday, November 18, 2016 12:22 AM, Nicolás Lichtmaier <[hidden email]> wrote:
> That depends on what you want. In this case I want to use a
> discrimination power based in all the body text, not just the titles.
> Because otherwise terms that are really not that relevant end up being
> very high!
>
>
> El 17/11/16 a las 18:25, Ahmet Arslan escribió:
>> Hi Nicholas,
>>
>> IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then it has some discrimination power in that domain.
>>
>> I think it's OK 'or' to get a high IDF value in this case.
>>
>> Ahmet
>>
>>
>>
>> On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier <[hidden email]> wrote:
>> IDF measures the selectivity of a term. But the calculation is
>> per-field. That can be bad for very short fields (like titles). One
>> example of this problem: If I don't delete stop words, then "or", "and",
>> etc. should be dealt with low IDF values, however "or" is, perhaps, not
>> so usual in titles. Then, "or" will have a high IDF value and be treated
>> as an important term. That's bad.
>>
>> One solution I see is to modify the Similarity to have a global, or
>> multi-field IDF value. This value would include in its calculation
>> longer fields that has more "normal text"-like stats. However this is
>> not trivial because I can't just add document-frequencies (I would be
>> counting some documents several times if "or" is present in more than
>> one field). I would need need to OR the bit-vectors that signal the
>> presence of the term, right? Not trivial.
>>
>> Has anyone encountered this issue? Has it been solved? Is my thinking wrong?
>>
>> Should I also try the developers' list?
>>
>> Thanks!
>>
>> Nicolás.-
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>