Persian Implementation

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Persian Implementation

Patrick Estarian
Hi,

I am trying to get the Persian part of Lucene to work but apparently the current implementation is just a simple version of sopt word tokenizer and no stemmer, etc. I was trying to find the contact of the person who had done this but couldn't find it any where in the code.

In addition, I went through the source and even made some class diagrams out of the code just to understand the project better. In fact I was looking for a TokenStream that can give me the previous tokens in a stream but apparently all the existing classes can only traverse forward and not backward.

The problem that I am facing with a Persian Stemmer is that the verbs in Persian could be made of multiple words. A simple example of that in English would be something like the verb "give up" which has a completely different meaning than "give" or "up":

   We had given the dog up as lost.

So, a proper search query should understand this and give us the right search results. In English it is easier to find such verbs because the main verb (give) comes first and the second word (up) comes next. But in Persian it is usually other way. Something like:

   We had up the dog given as lost.

Now when you reach the token "given", you really need to know if this verb is a plain verb or a complex verb. Therefore, you have to find the token "up" in the stream in order to populate the correct verb.

So, please correct me if I am wrong. Provided this requirement, my understanding is that we need a new TokenStream that holds a few of the previous tokens in a list or an array. If this is correct, please let me know how I can make such a class. And what are the considerations that I should keep in mind. Things like memory consumption, performance, being loyal to the architecture, etc. etc.

Your help will be greatly appreciated!

Thanks,
-Patrick

Reply | Threaded
Open this post in threaded view
|

Re: Persian Implementation

Robert Muir
On Mon, Jul 18, 2011 at 6:24 PM, Patrick Estarian
<[hidden email]> wrote:
> Hi,
>
> I am trying to get the Persian part of Lucene to work but apparently the
> current implementation is just a simple version of sopt word tokenizer and
> no stemmer, etc. I was trying to find the contact of the person who had done
> this but couldn't find it any where in the code.
>

There is no stemmer intentionally, as my findings (and others) seem to
correspond with this statement:

Our various experiments clearly show that a stemming
procedure decreases retrieval effectiveness when applied
to the Persian language.

http://portal.acm.org/citation.cfm?id=1674748

But YMMV,

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Persian Implementation

Patrick Estarian
Robert,

Thank you very much for the reply.

If I understand it correctly, you have a main project and a contrib section. It is very important for us to have this Persian search work "correctly" rather than efficient. Would this be a good candidate for the contrib section if I wrote some codes for the Persian stemmer?

Thanks,
-Patrick



On Mon, Jul 18, 2011 at 6:49 PM, Robert Muir <[hidden email]> wrote:
On Mon, Jul 18, 2011 at 6:24 PM, Patrick Estarian
<[hidden email]> wrote:
> Hi,
>
> I am trying to get the Persian part of Lucene to work but apparently the
> current implementation is just a simple version of sopt word tokenizer and
> no stemmer, etc. I was trying to find the contact of the person who had done
> this but couldn't find it any where in the code.
>

There is no stemmer intentionally, as my findings (and others) seem to
correspond with this statement:

Our various experiments clearly show that a stemming
procedure decreases retrieval effectiveness when applied
to the Persian language.

http://portal.acm.org/citation.cfm?id=1674748

But YMMV,

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Persian Implementation

Robert Muir
On Mon, Jul 18, 2011 at 11:37 PM, Patrick Estarian
<[hidden email]> wrote:
> Robert,
>
> Thank you very much for the reply.
>
> If I understand it correctly, you have a main project and a contrib section.
> It is very important for us to have this Persian search work "correctly"
> rather than efficient. Would this be a good candidate for the contrib
> section if I wrote some codes for the Persian stemmer?
>

We can always have the option available!


--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Persian Implementation

Patrick Estarian
That's great! thanks!

Now, would you please answer my original questions?

1- contact of the developer who did the Persian classes?

2- from which class I can start making my own TokenStream that holds a few of the previous tokens? (or is there already something like that in the existing classes?)

3- if I find that two or more of the already-appended-tokens need to be folded into one token, how can I go back through the stream and modify those tokens (e.g. delete or edit them)?


sorry if my questions are too simple... I have spent lots of time on the code and only after I couldn't figure out how to do this, I sent the question to you.

Thanks,
-Patrick


On Tue, Jul 19, 2011 at 5:35 AM, Robert Muir <[hidden email]> wrote:
On Mon, Jul 18, 2011 at 11:37 PM, Patrick Estarian
<[hidden email]> wrote:
> Robert,
>
> Thank you very much for the reply.
>
> If I understand it correctly, you have a main project and a contrib section.
> It is very important for us to have this Persian search work "correctly"
> rather than efficient. Would this be a good candidate for the contrib
> section if I wrote some codes for the Persian stemmer?
>

We can always have the option available!


--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Persian Implementation

Robert Muir
On Tue, Jul 19, 2011 at 11:03 AM, Patrick Estarian
<[hidden email]> wrote:
> That's great! thanks!
>
> Now, would you please answer my original questions?
>
> 1- contact of the developer who did the Persian classes?

[hidden email] :)

here is the original issue with discussion:
https://issues.apache.org/jira/browse/LUCENE-1628

>
> 2- from which class I can start making my own TokenStream that holds a few
> of the previous tokens? (or is there already something like that in the
> existing classes?)

instead of look-behind, since its a tokenstream its better to use
lookahead: have a look at captureState() and restoreState()

>
> 3- if I find that two or more of the already-appended-tokens need to be
> folded into one token, how can I go back through the stream and modify those
> tokens (e.g. delete or edit them)?

hopefully my answer to #2 helps, instead you just never return them,
you lookahead as needed and return what is needed always.


--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]