Rebuilding Document from index?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Rebuilding Document from index?

Itamar Syn-Hershko
Hi,
 
Is it possible to re-create a document from an index, if its not stored?
What I'm looking for is a way to have a text document with the text AFTER it
was analyzed, so I can see how my analyzer handles certain cases. So that
means I don't care if I will not get the original document. I want to see
the document as the index knows it.
 
Thanks in advance,
 
Itamar.
Reply | Threaded
Open this post in threaded view
|

RE: Rebuilding Document from index?

spring
You can use Luke to rebuild the document. It will show you the terms of the
analyzed document, not the original content.
And this is what you want, if I understood you correctly.

> -----Original Message-----
> From: Itamar Syn-Hershko [mailto:[hidden email]]
> Sent: Freitag, 22. Februar 2008 14:02
> To: [hidden email]
> Subject: Rebuilding Document from index?
>
> Hi,
>  
> Is it possible to re-create a document from an index, if its
> not stored?
> What I'm looking for is a way to have a text document with
> the text AFTER it
> was analyzed, so I can see how my analyzer handles certain
> cases. So that
> means I don't care if I will not get the original document. I
> want to see
> the document as the index knows it.
>  
> Thanks in advance,
>  
> Itamar.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Rebuilding Document from index?

Itamar Syn-Hershko
Yes, that's exactly what I wanted. I used Luke for a while but never noticed
the Reconstruct feature, thanks.

Itamar.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Friday, February 22, 2008 3:22 PM
To: [hidden email]
Subject: RE: Rebuilding Document from index?

You can use Luke to rebuild the document. It will show you the terms of the
analyzed document, not the original content.
And this is what you want, if I understood you correctly.

> -----Original Message-----
> From: Itamar Syn-Hershko [mailto:[hidden email]]
> Sent: Freitag, 22. Februar 2008 14:02
> To: [hidden email]
> Subject: Rebuilding Document from index?
>
> Hi,
>  
> Is it possible to re-create a document from an index, if its not
> stored?
> What I'm looking for is a way to have a text document with the text
> AFTER it was analyzed, so I can see how my analyzer handles certain
> cases. So that means I don't care if I will not get the original
> document. I want to see the document as the index knows it.
>  
> Thanks in advance,
>  
> Itamar.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Rebuilding Document from index?

Itamar Syn-Hershko
In reply to this post by spring
Hello again,

If I wanted to do this programmatically, how would I do this (retrieve a
list of all terms in a field for a specific document - better if it was in
alphabettic order and with frequency data)?

Thanks,

Itamar.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Friday, February 22, 2008 3:22 PM
To: [hidden email]
Subject: RE: Rebuilding Document from index?

You can use Luke to rebuild the document. It will show you the terms of the
analyzed document, not the original content.
And this is what you want, if I understood you correctly.

> -----Original Message-----
> From: Itamar Syn-Hershko [mailto:[hidden email]]
> Sent: Freitag, 22. Februar 2008 14:02
> To: [hidden email]
> Subject: Rebuilding Document from index?
>
> Hi,
>  
> Is it possible to re-create a document from an index, if its not
> stored?
> What I'm looking for is a way to have a text document with the text
> AFTER it was analyzed, so I can see how my analyzer handles certain
> cases. So that means I don't care if I will not get the original
> document. I want to see the document as the index knows it.
>  
> Thanks in advance,
>  
> Itamar.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rebuilding Document from index?

Erick Erickson
See TermDocs/TermEnum. Or perhaps TermFreqVector. I admit I haven't
used that last, but that family of methods ought to fix you up.

What problem are you trying to solve? Perhaps there are better
solutions to suggest....

Best
Erick

On Mon, Feb 25, 2008 at 6:04 PM, Itamar Syn-Hershko <[hidden email]>
wrote:

> Hello again,
>
> If I wanted to do this programmatically, how would I do this (retrieve a
> list of all terms in a field for a specific document - better if it was in
> alphabettic order and with frequency data)?
>
> Thanks,
>
> Itamar.
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> Sent: Friday, February 22, 2008 3:22 PM
> To: [hidden email]
> Subject: RE: Rebuilding Document from index?
>
> You can use Luke to rebuild the document. It will show you the terms of
> the
> analyzed document, not the original content.
> And this is what you want, if I understood you correctly.
>
> > -----Original Message-----
> > From: Itamar Syn-Hershko [mailto:[hidden email]]
> > Sent: Freitag, 22. Februar 2008 14:02
> > To: [hidden email]
> > Subject: Rebuilding Document from index?
> >
> > Hi,
> >
> > Is it possible to re-create a document from an index, if its not
> > stored?
> > What I'm looking for is a way to have a text document with the text
> > AFTER it was analyzed, so I can see how my analyzer handles certain
> > cases. So that means I don't care if I will not get the original
> > document. I want to see the document as the index knows it.
> >
> > Thanks in advance,
> >
> > Itamar.
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Rebuilding Document from index?

Itamar Syn-Hershko

Implementing something like MoreLikeThis for Hebrew. Non-Hebrew
implementations are relevant, but much less accurate since a word like PURIM
can show up in the actual document with initials (LPURIM, BPURIM etc.) or
even with 1-4 letters after it which all reffer to the same term, and then
the score it will get upon analyzing using the current MoreLikeThis
implementation will not reflect its real importance.

I'm still trying to engineer the best possible solution for Lucene with
Hebrew, right now my path is NOT using a stemmer by default, only by
explicit request of the user. MoreLikeThis would only return relevant
results if I will use a non-stemmed scoring and lookup.

Itamar.

-----Original Message-----
From: Erick Erickson [mailto:[hidden email]]
Sent: Tuesday, February 26, 2008 4:29 PM
To: [hidden email]
Subject: Re: Rebuilding Document from index?

See TermDocs/TermEnum. Or perhaps TermFreqVector. I admit I haven't used
that last, but that family of methods ought to fix you up.

What problem are you trying to solve? Perhaps there are better solutions to
suggest....

Best
Erick

On Mon, Feb 25, 2008 at 6:04 PM, Itamar Syn-Hershko <[hidden email]>
wrote:

> Hello again,
>
> If I wanted to do this programmatically, how would I do this (retrieve
> a list of all terms in a field for a specific document - better if it
> was in alphabettic order and with frequency data)?
>
> Thanks,
>
> Itamar.
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> Sent: Friday, February 22, 2008 3:22 PM
> To: [hidden email]
> Subject: RE: Rebuilding Document from index?
>
> You can use Luke to rebuild the document. It will show you the terms
> of the analyzed document, not the original content.
> And this is what you want, if I understood you correctly.
>
> > -----Original Message-----
> > From: Itamar Syn-Hershko [mailto:[hidden email]]
> > Sent: Freitag, 22. Februar 2008 14:02
> > To: [hidden email]
> > Subject: Rebuilding Document from index?
> >
> > Hi,
> >
> > Is it possible to re-create a document from an index, if its not
> > stored?
> > What I'm looking for is a way to have a text document with the text
> > AFTER it was analyzed, so I can see how my analyzer handles certain
> > cases. So that means I don't care if I will not get the original
> > document. I want to see the document as the index knows it.
> >
> > Thanks in advance,
> >
> > Itamar.
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rebuilding Document from index?

Mathieu Lecarme
Yes, I've found a tester!
A patch was submited for this kind of job :
https://issues.apache.org/jira/browse/LUCENE-1190

And here is the svn work in progress :
https://admin.garambrogne.net/subversion/revuedepresse/trunk/src/java/lexicon

And the web version :
https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java/lexicon


Le 26 févr. 08 à 17:33, Itamar Syn-Hershko a écrit :

>
> Implementing something like MoreLikeThis for Hebrew. Non-Hebrew
> implementations are relevant, but much less accurate since a word  
> like PURIM
> can show up in the actual document with initials (LPURIM, BPURIM  
> etc.) or
> even with 1-4 letters after it which all reffer to the same term,  
> and then
> the score it will get upon analyzing using the current MoreLikeThis
> implementation will not reflect its real importance.
>
> I'm still trying to engineer the best possible solution for Lucene  
> with
> Hebrew, right now my path is NOT using a stemmer by default, only by
> explicit request of the user. MoreLikeThis would only return relevant
> results if I will use a non-stemmed scoring and lookup.
>
> Itamar.
>
> -----Original Message-----
> From: Erick Erickson [mailto:[hidden email]]
> Sent: Tuesday, February 26, 2008 4:29 PM
> To: [hidden email]
> Subject: Re: Rebuilding Document from index?
>
> See TermDocs/TermEnum. Or perhaps TermFreqVector. I admit I haven't  
> used
> that last, but that family of methods ought to fix you up.
>
> What problem are you trying to solve? Perhaps there are better  
> solutions to
> suggest....
>
> Best
> Erick
>
> On Mon, Feb 25, 2008 at 6:04 PM, Itamar Syn-Hershko <[hidden email]
> >
> wrote:
>
>> Hello again,
>>
>> If I wanted to do this programmatically, how would I do this  
>> (retrieve
>> a list of all terms in a field for a specific document - better if it
>> was in alphabettic order and with frequency data)?
>>
>> Thanks,
>>
>> Itamar.
>>
>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]]
>> Sent: Friday, February 22, 2008 3:22 PM
>> To: [hidden email]
>> Subject: RE: Rebuilding Document from index?
>>
>> You can use Luke to rebuild the document. It will show you the terms
>> of the analyzed document, not the original content.
>> And this is what you want, if I understood you correctly.
>>
>>> -----Original Message-----
>>> From: Itamar Syn-Hershko [mailto:[hidden email]]
>>> Sent: Freitag, 22. Februar 2008 14:02
>>> To: [hidden email]
>>> Subject: Rebuilding Document from index?
>>>
>>> Hi,
>>>
>>> Is it possible to re-create a document from an index, if its not
>>> stored?
>>> What I'm looking for is a way to have a text document with the text
>>> AFTER it was analyzed, so I can see how my analyzer handles certain
>>> cases. So that means I don't care if I will not get the original
>>> document. I want to see the document as the index knows it.
>>>
>>> Thanks in advance,
>>>
>>> Itamar.
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Rebuilding Document from index?

Itamar Syn-Hershko
Not to ruin your party, but I'm not sure exactly what this Lexicon object is
for and how it should work. Plus, the requirements I have for analyzing
Hebrew (not only for the MoreLikeThis functionality) are far more demanding
than what is needed for French.

But I'm open to any suggestion on this matter (BTW, if I understand what
you're trying to do correctly, this post of mine should be related as well:
http://www.mail-archive.com/java-user@.../msg18650.html).

Itamar.

-----Original Message-----
From: Mathieu Lecarme [mailto:[hidden email]]
Sent: Tuesday, February 26, 2008 11:18 PM
To: [hidden email]
Subject: Re: Rebuilding Document from index?

Yes, I've found a tester!
A patch was submited for this kind of job :
https://issues.apache.org/jira/browse/LUCENE-1190

And here is the svn work in progress :
https://admin.garambrogne.net/subversion/revuedepresse/trunk/src/java/lexico
n

And the web version :
https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java/l
exicon


Le 26 févr. 08 à 17:33, Itamar Syn-Hershko a écrit :

>
> Implementing something like MoreLikeThis for Hebrew. Non-Hebrew
> implementations are relevant, but much less accurate since a word like
> PURIM can show up in the actual document with initials (LPURIM, BPURIM
> etc.) or
> even with 1-4 letters after it which all reffer to the same term, and
> then the score it will get upon analyzing using the current
> MoreLikeThis implementation will not reflect its real importance.
>
> I'm still trying to engineer the best possible solution for Lucene
> with Hebrew, right now my path is NOT using a stemmer by default, only
> by explicit request of the user. MoreLikeThis would only return
> relevant results if I will use a non-stemmed scoring and lookup.
>
> Itamar.
>
> -----Original Message-----
> From: Erick Erickson [mailto:[hidden email]]
> Sent: Tuesday, February 26, 2008 4:29 PM
> To: [hidden email]
> Subject: Re: Rebuilding Document from index?
>
> See TermDocs/TermEnum. Or perhaps TermFreqVector. I admit I haven't
> used that last, but that family of methods ought to fix you up.
>
> What problem are you trying to solve? Perhaps there are better
> solutions to suggest....
>
> Best
> Erick
>
> On Mon, Feb 25, 2008 at 6:04 PM, Itamar Syn-Hershko
> <[hidden email]
> >
> wrote:
>
>> Hello again,
>>
>> If I wanted to do this programmatically, how would I do this
>> (retrieve a list of all terms in a field for a specific document -
>> better if it was in alphabettic order and with frequency data)?
>>
>> Thanks,
>>
>> Itamar.
>>
>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]]
>> Sent: Friday, February 22, 2008 3:22 PM
>> To: [hidden email]
>> Subject: RE: Rebuilding Document from index?
>>
>> You can use Luke to rebuild the document. It will show you the terms
>> of the analyzed document, not the original content.
>> And this is what you want, if I understood you correctly.
>>
>>> -----Original Message-----
>>> From: Itamar Syn-Hershko [mailto:[hidden email]]
>>> Sent: Freitag, 22. Februar 2008 14:02
>>> To: [hidden email]
>>> Subject: Rebuilding Document from index?
>>>
>>> Hi,
>>>
>>> Is it possible to re-create a document from an index, if its not
>>> stored?
>>> What I'm looking for is a way to have a text document with the text
>>> AFTER it was analyzed, so I can see how my analyzer handles certain
>>> cases. So that means I don't care if I will not get the original
>>> document. I want to see the document as the index knows it.
>>>
>>> Thanks in advance,
>>>
>>> Itamar.
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rebuilding Document from index?

Daniel Noll-3-2
In reply to this post by Itamar Syn-Hershko
On Wednesday 27 February 2008 03:33:53 Itamar Syn-Hershko wrote:
> I'm still trying to engineer the best possible solution for Lucene with
> Hebrew, right now my path is NOT using a stemmer by default, only by
> explicit request of the user. MoreLikeThis would only return relevant
> results if I will use a non-stemmed scoring and lookup.

This appears to be the case for all languages too, the stemming will skew
similarity and result in unrelated documents scoring higher than they need
to.

Some people seem to be working around this by having two fields where one is
stemmed and the other isn't.  You could then use the stemmed field when doing
queries but use the non-stemmed field for MoreLikeThis.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Rebuilding Document from index?

Itamar Syn-Hershko

This is exactly where Hebrew is different from all Latin languages. I did
think about the approach you mentioned, of having 2 fields - one is stemmed
and the other is not - but even with it the search will be performed on the
non-stemmed field by default. The stemmed field will only be searched upon
explicit request, since one stem in Hebrew can be related to many nouns
adjectives and verbs - too many of those, and the stemming process itself is
not deterministic enough.

I would rather use the non-stemmed field for MoreLikeThis as well, but as I
said I will need some sort of synonyms engine, so I would be able to score
related words by their real frequency and not be tricked by any initials (as
I said before - "the", "and" and other so-called stop words are initial
letters in Hebrew, and are tough to omit).

That is mainly why I'm interested in an easy and inexpensive solution.
Mathieu seems to have went off this topic unfortunately...

Itamar.

-----Original Message-----
From: Daniel Noll [mailto:[hidden email]]
Sent: Friday, February 29, 2008 5:35 AM
To: [hidden email]
Subject: Re: Rebuilding Document from index?

On Wednesday 27 February 2008 03:33:53 Itamar Syn-Hershko wrote:
> I'm still trying to engineer the best possible solution for Lucene
> with Hebrew, right now my path is NOT using a stemmer by default, only
> by explicit request of the user. MoreLikeThis would only return
> relevant results if I will use a non-stemmed scoring and lookup.

This appears to be the case for all languages too, the stemming will skew
similarity and result in unrelated documents scoring higher than they need
to.

Some people seem to be working around this by having two fields where one is
stemmed and the other isn't.  You could then use the stemmed field when
doing queries but use the non-stemmed field for MoreLikeThis.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]