Merging indexes - which is best option?

classic Classic list List threaded Threaded
4 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

Merging indexes - which is best option?

adb
I am creating several temporary batches of indexes to separate indices and
periodically will merge those batches to a set of master indices.  I'm using
IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the master
may already contain the index for that document and I get a duplicate.

Duplicates are prevented in the temporary index, because when adding Documents,
I call IndexWriter#deleteDocuments(Term) with my UID, before I add the Document.

I have two choices

a) merge indexes then clean up any duplicates in the master (or vice versa).
Probably IndexWriter.deleteDocuments(Term[]) would suit here with all the UIDs
of the incoming documents.

b) iterate through the Documents in the temporary index and add them to the master

b sounds worse as it seems an IndexWriter's Analyzer cannot be null and I guess
there's a penalty in assembling the Document from the reader.

Any views?
Antony







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Merging indexes - which is best option?

Karsten F.-2
Hi Antony,

I decided first to delete all duplicates from master(iW) and then to insert all temporary indices(other).
Any other opinions?

Best regards
  Karsten

<code>
    public static synchronized void merge(IndexWriter iW, Directory[] other, final String uniqueID_FieldName) throws IOException{
        final Term firstFieldTerm = new Term(uniqueID_FieldName, "");
        boolean rollback = true;
        try {
            Term[] possibleDuplicates;
            for(Directory toAddDir : other){
                IndexReader toAddIR = IndexReader.open(toAddDir);
                try{
                    int indexSize = toAddIR.numDocs();
                    possibleDuplicates = new Term[indexSize];

                    int cnt = 0;
                    TermEnum possibleDuplicateTerms = toAddIR.terms(firstFieldTerm);
                    Term possibleDuplicateTerm = possibleDuplicateTerms.term();
                    while(true){
                        if(possibleDuplicateTerm == null){
                            break;
                        }
                        if(possibleDuplicateTerm.field() != uniqueID_FieldName){
                            assert !possibleDuplicateTerm.field().equals(uniqueID_FieldName);
                            break;
                        }
                        //assert:
                        if(moreThenOneDocument(toAddIR, possibleDuplicateTerm)){
                        System.out.println( "please use then unique id unique! " + possibleDuplicateTerm);
                        }
                        assert cnt < indexSize : "please don't use more then one unique id for each document";
                        possibleDuplicates[cnt++]=possibleDuplicateTerm;
                        possibleDuplicateTerms.next();
                        possibleDuplicateTerm = possibleDuplicateTerms.term();
                    }
                    if( indexSize != cnt ){
                        possibleDuplicates = Arrays.copyOf(possibleDuplicates, cnt);
                        System.out.println("log: " + indexSize  + " != " + cnt);
                    }
                } finally {
                    toAddIR.close();
                }
                iW.deleteDocuments(possibleDuplicates);
            }
            iW.addIndexes(other);
            rollback = false;
        } finally {
            if(rollback){
                iW.abort();
            } else {
                iW.flush();
            }
        }
    }
    public static boolean moreThenOneDocument(IndexReader iR, Term term) throws IOException{
    TermDocs tDoc = iR.termDocs(term);
    if(tDoc.next()){
    if(tDoc.next()){
    return true;
    }
    }
    return false;
    }
</code>
Antony Bowesman wrote
I am creating several temporary batches of indexes to separate indices and
periodically will merge those batches to a set of master indices.  I'm using
IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the master
may already contain the index for that document and I get a duplicate.

Duplicates are prevented in the temporary index, because when adding Documents,
I call IndexWriter#deleteDocuments(Term) with my UID, before I add the Document.

I have two choices

a) merge indexes then clean up any duplicates in the master (or vice versa).
Probably IndexWriter.deleteDocuments(Term[]) would suit here with all the UIDs
of the incoming documents.

b) iterate through the Documents in the temporary index and add them to the master

b sounds worse as it seems an IndexWriter's Analyzer cannot be null and I guess
there's a penalty in assembling the Document from the reader.

Any views?
Antony
adb
Reply | Threaded
Open this post in threaded view
|

Re: Merging indexes - which is best option?

adb
Thanks Karsten,

> I decided first to delete all duplicates from master(iW) and then to insert
> all temporary indices(other).

I reached the same conclusion.  As your code shows, it's a simple enough
solution.  You had a good point with the iW.abort() in the rollback case.

Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Scoring

Ulrich Vachon-2
Hi all,

It is possible to have the score of each term composing the query like:
 - query = "foo bar"

I would like to have the score for "foo" and "bar". Actually the score is based on results reached by the full query " foo bar".

Regards,
Ulrich

-----Message d'origine-----
De : Antony Bowesman [mailto:[hidden email]]
Envoyé : mardi 9 septembre 2008 08:12
À : [hidden email]
Objet : Re: Merging indexes - which is best option?

Thanks Karsten,

> I decided first to delete all duplicates from master(iW) and then to
> insert all temporary indices(other).

I reached the same conclusion.  As your code shows, it's a simple enough solution.  You had a good point with the iW.abort() in the rollback case.

Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


______________________________________________________________________
Cet e-mail a été scanné par MessageLabs Email Security System.
Pour plus d'informations, visitez http://www.messagelabs.com/email ______________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]