Most efficient way to find related terms

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Most efficient way to find related terms

Martin Bayly
I'm wondering what the most efficient approach is to finding all terms
which are related to a specific term X.


By related I mean terms which appear in a specific document field that
also contains the target term X.


e.g. Document has a keyword field, field1 that can contain multiple


document1 - field1 HAS key1, key2, key3

document2 - field1 HAS key2, key4

document3 - field1 HAS key5


If I want to find terms related to key2, I need to return key1, key3,


Obviously I can do a search for key2, iterate all the docs and collect
there field1 terms manually.


But presumably a more efficient way is to use TermDocs:


1. TermDocs termdocs = IndexReader.termDocs(new Term("field1", "key2")

2. Iterate term docs to get documents containing that term

3. Now this is the bit I'm not sure of:

a. I could call Document doc = IndexReader.document(n), but that will
load all fields and I only want the field1

b. Presumably better to call Document doc = IndexReader.document(n,

c. Or would I be better to turn on TermFrequencyVectors for this field
so I can call IndexReader.getTermFrequencyVector(n, "field1") - don't
particularly care about the frequencies as it will always be 1 for a
particular doc.


Other approaches?


I'm going to perf test to see how (b) and (c) compare but would be glad
if anyone has any insights.