lang identifier and nutch analyzer in trunk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

lang identifier and nutch analyzer in trunk

Jack.Tang
Hi All

I am wondering Analyzer of nutch in svn trunk is chosen by
languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).

In org.apache.nutch.indexer.Indexer.class line 104

writer.addDocument((Document)((ObjectWritable)value).get());

It should be

NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
writer.addDocument((Document)((ObjectWritable)value).get(), analyzer );

right?

Once more,query parsing should call AnalyzerFactory?? The query input
is multi-lingual also.

Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Jack.Tang
On 1/21/06, Jack Tang <[hidden email]> wrote:

> Hi All
>
> I am wondering Analyzer of nutch in svn trunk is chosen by
> languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
>
> In org.apache.nutch.indexer.Indexer.class line 104
>
> writer.addDocument((Document)((ObjectWritable)value).get());
>
> It should be
>
> NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
> writer.addDocument((Document)((ObjectWritable)value).get(), analyzer );

Sorry, it should be

        Document doc = (Document)((ObjectWritable)value).get();
        NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
                writer.addDocument(doc, analyzer);

> right?
>
> Once more,query parsing should call AnalyzerFactory?? The query input
> is multi-lingual also.
>
> Regards
> /Jack
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Jérôme Charron
In reply to this post by Jack.Tang
> I am wondering Analyzer of nutch in svn trunk is chosen by
> languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).

It's not really choosen by the languageidentifier, but coosen regarding the
value of the lang attribute (for now, that's right, only the
languageidentifier add this attribute).


> In org.apache.nutch.indexer.Indexer.class line 104
> writer.addDocument((Document)((ObjectWritable)value).get());
> It should be
> NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
> writer.addDocument((Document)((ObjectWritable)value).get(), analyzer );
> right?

Yes, it should.
Thanks for noticing this.
Merge problem?
(I don't remember to add this in nutch-0.7 ...)


> Once more,query parsing should call AnalyzerFactory?? The query input
> is multi-lingual also.

The query part is not yet implemented.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Jack.Tang
Hi Jérôme

On 1/21/06, Jérôme Charron <[hidden email]> wrote:

> > I am wondering Analyzer of nutch in svn trunk is chosen by
> > languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
>
> It's not really choosen by the languageidentifier, but coosen regarding the
> value of the lang attribute (for now, that's right, only the
> languageidentifier add this attribute).
>
>
> > In org.apache.nutch.indexer.Indexer.class line 104
> > writer.addDocument((Document)((ObjectWritable)value).get());
> > It should be
> > NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
> > writer.addDocument((Document)((ObjectWritable)value).get(), analyzer );
> > right?
>
> Yes, it should.
> Thanks for noticing this.
> Merge problem?
> (I don't remember to add this in nutch-0.7 ...)
>
>
> > Once more,query parsing should call AnalyzerFactory?? The query input
> > is multi-lingual also.
>
> The query part is not yet implemented.

Any plan to implement this ? I mean move LanguageIdentifier class
intto nutch core.

Thanks
/Jack

> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Jérôme Charron
> Any plan to implement this ? I mean move LanguageIdentifier class
> intto nutch core.

As I already suggested it on this list, I really would like to move the
LanguageIdentifier class (and profiles) to
an independant Lucene sub-project (and the MimeType repository too).
I don't remember why but there were some objections about this...

Here is a short status of what I have in mind for next improvements with the
LanguageIdentifier / MultiLanguage support :
* Enhance LanguageIdentifier APIs by returning something like an ordered
LangDetail[] array when guessing language (each LangDetail should contains
the language code and its score) - I have a prototype version of this on my
disk but I doesn't take time to finalize it
* I encountered some identification problems with some specific sites (with
blogger for instance), and I plan to investigate on this point.
* Another pending task : the analysis (and coding) of multilingual querying
support.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Andrzej Białecki-2
Jérôme Charron wrote:

>> Any plan to implement this ? I mean move LanguageIdentifier class
>> intto nutch core.
>>    
>
> As I already suggested it on this list, I really would like to move the
> LanguageIdentifier class (and profiles) to
> an independant Lucene sub-project (and the MimeType repository too).
> I don't remember why but there were some objections about this...
>
>  

I think most people agree that it would be worthwhile to un-tie this
component from Nutch internals. The only objections were related not to
the idea itself, but to the management aspects of creating a full-blown
sub-project, both wrt. to the initial setup and the continuing
maintenance. An alternative solution was proposed (creating a contrib/
package). This would still help to separate the code from Nutch
internals, so that it can be used in other projects, but it would
require much less effort to set up and maintain.

> Here is a short status of what I have in mind for next improvements with the
> LanguageIdentifier / MultiLanguage support :
> * Enhance LanguageIdentifier APIs by returning something like an ordered
> LangDetail[] array when guessing language (each LangDetail should contains
> the language code and its score) - I have a prototype version of this on my
> disk but I doesn't take time to finalize it
>  

+1. Other local modifications which I use frequently:

* exporting a list of supported languages,

* exporting an NGramProfile of the analyzed text,

* allow processing of chunks of input (i.e.
LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is
very useful if the text to be analyzed is already present in memory, and
the choice of sections (chunks) is made elsewhere, e.g. for documents
with clearly outlined sections, or for multi-language documents.

> * I encountered some identification problems with some specific sites (with
> blogger for instance), and I plan to investigate on this point.
> * Another pending task : the analysis (and coding) of multilingual querying
> support.
>  

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Jérôme Charron
> +1. Other local modifications which I use frequently:
>
> * exporting a list of supported languages,
>
> * exporting an NGramProfile of the analyzed text,
>
> * allow processing of chunks of input (i.e.
> LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is
> very useful if the text to be analyzed is already present in memory, and
> the choice of sections (chunks) is made elsewhere, e.g. for documents
> with clearly outlined sections, or for multi-language documents.

Thanks for these intereseting comments Andrzej => I add them to my todo
list.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Stefan Groschupf-2
In reply to this post by Andrzej Białecki-2
>> As I already suggested it on this list, I really would like to  
>> move the
>> LanguageIdentifier class (and profiles) to
>> an independant Lucene sub-project (and the MimeType repository too).
>> I don't remember why but there were some objections about this...
>>
>>
>
> I think most people agree that it would be worthwhile to un-tie  
> this component from Nutch internals. The only objections were  
> related not to the idea itself, but to the management aspects of  
> creating a full-blown sub-project, both wrt. to the initial setup  
> and the continuing maintenance. An alternative solution was  
> proposed (creating a contrib/ package). This would still help to  
> separate the code from Nutch internals, so that it can be used in  
> other projects, but it would require much less effort to set up and  
> maintain.

+1, what's about lucene sandbox or jsut open a source forge project  
with Apache 2 license, than we can use just the jar.

Stefan



Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Otis Gospodnetic-2-2
I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in the near future.

Does that sound ok?

Otis


----- Original Message ----
From: Stefan Groschupf <[hidden email]>
To: [hidden email]
Sent: Mon 23 Jan 2006 02:55:46 PM EST
Subject: Re: lang identifier and nutch analyzer in trunk

>> As I already suggested it on this list, I really would like to  
>> move the
>> LanguageIdentifier class (and profiles) to
>> an independant Lucene sub-project (and the MimeType repository too).
>> I don't remember why but there were some objections about this...
>>
>>
>
> I think most people agree that it would be worthwhile to un-tie  
> this component from Nutch internals. The only objections were  
> related not to the idea itself, but to the management aspects of  
> creating a full-blown sub-project, both wrt. to the initial setup  
> and the continuing maintenance. An alternative solution was  
> proposed (creating a contrib/ package). This would still help to  
> separate the code from Nutch internals, so that it can be used in  
> other projects, but it would require much less effort to set up and  
> maintain.

+1, what's about lucene sandbox or jsut open a source forge project  
with Apache 2 license, than we can use just the jar.

Stefan






Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Andrzej Białecki-2
[hidden email] wrote:
> I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in the near future.
>
> Does that sound ok?
>  

+1 from me.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Jack.Tang
In reply to this post by Jérôme Charron
Hi

Is it reasonable to guess language info. from target servers geographical info.?

/Jack

On 1/23/06, Jérôme Charron <[hidden email]> wrote:

> > Any plan to implement this ? I mean move LanguageIdentifier class
> > intto nutch core.
>
> As I already suggested it on this list, I really would like to move the
> LanguageIdentifier class (and profiles) to
> an independant Lucene sub-project (and the MimeType repository too).
> I don't remember why but there were some objections about this...
>
> Here is a short status of what I have in mind for next improvements with the
> LanguageIdentifier / MultiLanguage support :
> * Enhance LanguageIdentifier APIs by returning something like an ordered
> LangDetail[] array when guessing language (each LangDetail should contains
> the language code and its score) - I have a prototype version of this on my
> disk but I doesn't take time to finalize it
> * I encountered some identification problems with some specific sites (with
> blogger for instance), and I plan to investigate on this point.
> * Another pending task : the analysis (and coding) of multilingual querying
> support.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Jérôme Charron
In reply to this post by Andrzej Białecki-2
> > I would like to decouple Lang Id from Nutch and move it in Lucene
> contrib/ in the near future.
> > Does that sound ok?
> +1 from me.

+1 from me too
(if I can have a commit access to contrib code)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Jérôme Charron
In reply to this post by Jack.Tang
> Is it reasonable to guess language info. from target servers geographical
> info.?

Yes, it could be another clue to guess language.
But the problem is then to find how to use all these indices.

For instance, the actual solution is the easiest one, but certainly not the
more efficient one:
For HTML documents, the HTMLLanguageParser scans HTML documents looking at
possible indications of content language:
1. html lang attribute
2. meta dc.language
3. meta http-equiv
The first one found is assumed to be the document's language.
Then if no language is found, the statistical language identifier is
used....

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Andrzej Białecki-2
Jérôme Charron wrote:

>> Is it reasonable to guess language info. from target servers geographical
>> info.?
>>    
>
> Yes, it could be another clue to guess language.
> But the problem is then to find how to use all these indices.
>
> For instance, the actual solution is the easiest one, but certainly not the
> more efficient one:
> For HTML documents, the HTMLLanguageParser scans HTML documents looking at
> possible indications of content language:
> 1. html lang attribute
> 2. meta dc.language
> 3. meta http-equiv
> The first one found is assumed to be the document's language.
> Then if no language is found, the statistical language identifier is
> used....
>  

We're going back to the old discussion - most web pages out there either
don't have these tags at all, or even if they have it it contains wrong
values ... so, I think this policy is not going to give the best results.

IMHO we should always try to guess the language if we have enough text,
unless we can be sure that we deal with properly marked documents (not
such uncommon case in Intranets).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Jérôme Charron
> We're going back to the old discussion - most web pages out there either
> don't have these tags at all, or even if they have it it contains wrong
> values ... so, I think this policy is not going to give the best results.

Yes I know Andrzej, it was just to explain to Jack how it actually works


> IMHO we should always try to guess the language if we have enough text,
> unless we can be sure that we deal with properly marked documents (not
> such uncommon case in Intranets).

I think we should have something like in the MimeType detection:
If a meta data is found, then checks that it is the correct value regarding
the score of this language (statistical analyis).
If the score is too low or no meta data is found, then we perform a full
statistical analysis.
No?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: lang identifier and nutch analyzer in trunk

Andrzej Białecki-2
Jérôme Charron wrote:
>> We're going back to the old discussion - most web pages out there either
>> don't have these tags at all, or even if they have it it contains wrong
>> values ... so, I think this policy is not going to give the best results.
>>    
>
> Yes I know Andrzej, it was just to explain to Jack how it actually works
>
>  

Ok.

>> IMHO we should always try to guess the language if we have enough text,
>> unless we can be sure that we deal with properly marked documents (not
>> such uncommon case in Intranets).
>>    
>
> I think we should have something like in the MimeType detection:
> If a meta data is found, then checks that it is the correct value regarding
> the score of this language (statistical analyis).
> If the score is too low or no meta data is found, then we perform a full
> statistical analysis.
> No?
>  
Yes :-)


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com