Most efficient way to find related terms

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Most efficient way to find related terms

Martin Bayly
I'm wondering what the most efficient approach is to finding all terms
which are related to a specific term X.

 

By related I mean terms which appear in a specific document field that
also contains the target term X.

 

e.g. Document has a keyword field, field1 that can contain multiple
keywords.

 

document1 - field1 HAS key1, key2, key3

document2 - field1 HAS key2, key4

document3 - field1 HAS key5

 

If I want to find terms related to key2, I need to return key1, key3,
key4

 

Obviously I can do a search for key2, iterate all the docs and collect
there field1 terms manually.

 

But presumably a more efficient way is to use TermDocs:

 

1. TermDocs termdocs = IndexReader.termDocs(new Term("field1", "key2")

2. Iterate term docs to get documents containing that term

3. Now this is the bit I'm not sure of:

a. I could call Document doc = IndexReader.document(n), but that will
load all fields and I only want the field1

b. Presumably better to call Document doc = IndexReader.document(n,
fieldSelector)

c. Or would I be better to turn on TermFrequencyVectors for this field
so I can call IndexReader.getTermFrequencyVector(n, "field1") - don't
particularly care about the frequencies as it will always be 1 for a
particular doc.

 

Other approaches?

 

I'm going to perf test to see how (b) and (c) compare but would be glad
if anyone has any insights.

 

Thanks

Martin