Looking for more information about Lucene

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Looking for more information about Lucene

BABAUD Alexandre

Good afternoon everyone,

 

I am working for a French company and in the scope of my work I am collecting information on open source NLP tools available on the “market” worldwide.  

I was looking for such intel on the internet and by reading some users’ comments but I figured, why not contact the persons directly involved?

I would especially need information about these contents:

 

·         Is the software managing French speaking texts for every features?

·         What exactly are the type of files the software is able to deal with?

·         What about data storage? Is it stock in-house? (I am very concerned about data privacy)

·         Is it easily customizable?

·         Finally, what exactly are the features of Natural Language Processing handled by Lucene?

 

If you want we can discuss it by phone if you have some time in the coming week.

 

I am looking forward to hearing from you soon and in the meantime I wish you a great day.

 

Best regards,

 

 


Alexandre BABAUD

Consultant Big Data

Sopra Steria
20 avenue de Pythagore - Le Galilée Bâtiment A
Domaine de Pelus
33700 Merignac - France
Phone: +33 (0)6 86 35 60 05
[hidden email] - www.soprasteria.com


    

Le contenu de cet e-mail est susceptible d'être confidentiel, soumis au secret professionnel ou protégé par la loi. L'utilisation, la copie et la divulgation non autorisées d'une partie ou de l'intégralité de ce message sont susceptibles d'être illégales. Si vous avez reçu ce message par erreur, supprimez-le après avoir averti l'expéditeur. Les pièces jointes du présent e-mail ont fait l'objet d'un contrôle antivirus. Néanmoins, nous déclinons toute responsabilité concernant les dommages causés par d'éventuels virus.
Pensez à l'environnement avant d'imprimer.

 

Reply | Threaded
Open this post in threaded view
|

Re: Looking for more information about Lucene

Adrien Grand
Hi Alexandre,

I don't have time for a call, but to give you some pointers, Lucene does
the following that may be related to natural language processing:
 - Word segmentation via the `Tokenizer` class. It is rather simple for
western languages (including French, see StandardTokenizer), but less for
eg. Japanese or Korean which we also support.
 - We have a couple stemmers implemented via `TokenFilter`s, including for
French, see the `org.apache.lucene.analysis.fr` package.

More answers inline below:


Le mar. 22 mai 2018 à 17:33, BABAUD Alexandre <
[hidden email]> a écrit :

> ·         What exactly are the type of files the software is able to deal
> with?
>

Lucene doesn't deal with file types directly, you need to be able to pass a
string or a stream of characters. If you have a text file, this is easy. If
you have PDF files, you will need to use 3rd-party libraries such as Tika
to extract content.


> ·         What about data storage? Is it stock in-house? (I am very
> concerned about data privacy)
>
Not really relevant: it's up to you to decide where you store your data.

> ·         Is it easily customizable?
>
Being a library, I guess the answer is yes.