DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG?
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND?
INSERTED IN THE BUG DATABASE.
------- Additional Comments From [hidden email] 2005-04-28 16:04 -------
(In reply to comment #2)
> Nicolas, thanks for the contribution! I took a quick look at the ZIP file.
> Would it be possible for you to describe (here and/or in the Javadocs) how these
> 12+ classes work to provide Document update functionality?
The goal of this contribution is to overwrite only the files containing
information about the term posting list ( .tis , .tii, .frq, etc..).
In the Lucene API, the term posting lists are accessible with
IndexReader.Terms() (Enumerate all the terms) and IndexReader.TermPositions()
(For a specific term, enumerate each pair <doc number, Freq, <position>^freq > )
So, if i modified the output of this 2 methods (add new terms, delete relations
between document and terms, etc..) and rewrite the output in the lucene index, I
recreate a new lucene term posting list. That's what this contribution does !
To do this, i create a interface called TermProducter containing this 2 methods
(Terms() and TermPositions()).A class implementing this interface have to
produce this 2 kind of ouputs (so it produce the posting lists). For Exemple a
IndexReader could implements this interface, but you can also create your own
term posting list producter, or create a TermProducter that modify the content
of the original IndexReader ouput.
Then, with the TermWriter class that takes in input a TermProducter and a lucene
index, you can rewrite the lucene term posting list with the content of the
So now the question is : How can i modified the term posting list ? , What are
my tools ?
You have 2 types of Tools : TermGenerator and TermTransformer.
* The TermGenerator Interface. It generates a TermProducter instance. Its goal
is to create a new posting list. The interface is simple:
public TermProducter CreateTermProducter();
There are 2 proposed Implementations:
- TermReader . A IndexReader Wrapper implementing TermProducter
- TermAdder . you can create your own posting list by adding term/documen
relation. It's like a virtual index.
* The TermTransformer Interface. It modifies the output of a TermProducter. The
public TermProducter transform(TermProducter producter);
There are 2 proposed Implementations:
- TermFilter. Filter some term/doc relations
- TermReplacer. You can replace some term/doc relations by others relations
* You have also a special TermProducter implementation called TermMerger. It
merges several TermProducter. (useful )
void add(TermProducter producter )
Now we can play by combining and create a kind of pipeline. For exemple, a
update process :
(1) TermReader----> (2) TermFilter ----> (4)TermMeger (-----> (5) TermWriter )
(3) TermAdder --->-----+
1 - we read the lucene posting list
2- we delete somes terms
3 - wa add new term
4- we merge the 2 TermProducters to create the final TermProducter
5- we write the termproducter informations in the lucene index.
This design allows flexibility because If i just want replace terms i can use
this simple/optimized process:
(1) TermReader----> (2) TermReplacer (---->TermWriter )
So you can create your own pipeline of terms transformation.
--- A COMPLET EXEMPLE ---
Use case: i have to delete a term in several documents.
1 - I have to know all the lucene document numbers.
The main class is the IndexUpdater. It contains a TermWriter and allow to select
the desired doc.
So i must create a instance.
IndexUpdater updater = IndexUpdater(IndexReader reader);
After this, i can execute a lucene query to select all the desired documents, to
DocumentSelection docsel=updater.selectDoc(Query query);
Ok now i have a DocumentSelection instance allowing to a
TermGenerator/TermTransformer to know which document is selected or not to
delete the terms.
2 - delete their relations with the desired terms.
So now I create a TermFilter and delete the term in the selected document.
filter.deleteTerm(new Term("field","deletedvalue"), docsel);
3- now i create a pipeline like this: TermReader----> TermFilter (
We have a method in the IndexUpdater to create a TermReader of the lucene index.
TermReader reader= updater.getTermReader();
4- I close and so write in the index the new posting lists.
Ok , is it clear ?
PS: 1 - sorry for english, 2 - I know this contribution is not perfect (name of
classes, design, implementation) and can be certainly fixed but it's a first
step to a easy update of the postings lists, a lack in Lucene.
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
|Free forum by Nabble||Edit this page|