Using Lucene for technical documentation

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Lucene for technical documentation

Trevor Nicholls
Hello, I'd better begin by identifying myself as a newbie.

 

I am investigating using Lucene as a search tool for a library of technical
documents, much of which consists of pieces of source code and discussion of
the content.

 

The standard analyzer does an adequate job with normal text but strips out
non-alpha characters in code fragments; the whitespace analyzer does an
adequate job with source code but at the expense of treating punctuation
characters as significant text.

 

As a couple of trivial examples, the line "The !F1 key." ideally needs to be
analyzed as  [the] [!f1] [key]. The standard analyzer turns it into [the]
[f1] [key] while the Whitespace analyzer turns it into [the] [!f1] [key.].

 

Similarly "the abort() function, or the stop() function." ideally needs to
be analyzed as [the] [abort()] [function] [or] [the] [stop()] [function].
But no analyzer will retain the parentheses while discarding the comma and
full stop.

 

Are there examples of analyzers for technical documentation around, or any
helpful pointers? Or am I barking up a rotten tree here?

 

cheers

T

 

Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene for technical documentation

Paul Libbrecht-7
Hello Trevor,

I don’t know of an analyzer for mixes of code and text but I know of
an analyser for mixes of code and formulæ.

Clearly, you could build a custom analyzer that would tokenize
differently depending on weather you’re in code or in text. That’s
no super hard.

However, where things get complicated is at mixing and that happens
latest at querying: If you query `while` you want to find matches for
the real world and the stemmed word too. If you use Lucene for other
tasks than searches, however, this may be a problem (e.g. clustering,
LSA…).

In the case of the formula-enabled search I built, the query modalities
were different (two different input-fields) so that you knew how to
transform the query (for math, span queries were used).

I’m suspecting that you should decide on this first: if you want to
just search and query by a mix then I’d recommend simply using
different field-names with a whitespace and a standard-analyzer. Later
on the code-oriented one is able to, say, enrich code-tokens by
alternative names (e.g. use “loop” as an weaker alternative of the
“for” token). Solr and lucene can do this really well (eDismax
provides an easy parametrisation).

But I’d be happy to read of others’ works on this!

In the Math working group of W3C at the time, work stopped when
considering the complexity of compound documents: the alternatives as
above (mix words or recognise math pieces?) certainly made things
difficult.

paul


PS: [paper for my math search
here](https://hoplahup.net/paul_pubs/AccessRetrievalAM.html). Please ask
for the source code, it is old and built on Lucene 3.5 so would need
quite some upgrade.

On 23 Nov 2020, at 8:42, Trevor Nicholls wrote:

> Hello, I'd better begin by identifying myself as a newbie.
>
>
>
> I am investigating using Lucene as a search tool for a library of
> technical
> documents, much of which consists of pieces of source code and
> discussion of
> the content.
>
>
>
> The standard analyzer does an adequate job with normal text but strips
> out
> non-alpha characters in code fragments; the whitespace analyzer does
> an
> adequate job with source code but at the expense of treating
> punctuation
> characters as significant text.
>
>
>
> As a couple of trivial examples, the line "The !F1 key." ideally needs
> to be
> analyzed as  [the] [!f1] [key]. The standard analyzer turns it into
> [the]
> [f1] [key] while the Whitespace analyzer turns it into [the] [!f1]
> [key.].
>
>
>
> Similarly "the abort() function, or the stop() function." ideally
> needs to
> be analyzed as [the] [abort()] [function] [or] [the] [stop()]
> [function].
> But no analyzer will retain the parentheses while discarding the comma
> and
> full stop.
>
>
>
> Are there examples of analyzers for technical documentation around, or
> any
> helpful pointers? Or am I barking up a rotten tree here?
>
>
>
> cheers
>
> T
Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene for technical documentation

Erick Erickson
In reply to this post by Trevor Nicholls
You might be able to get something “good enough” with one of the pattern tokenizers, see: https://lucene.apache.org/solr/guide/8_6/tokenizers.html.

Won’t be 100% of course.

And Paul’s comments are well taken, especially since your input will be inconsistent I’d guess. How much you want to bet that the same document will have "the abort() function” in one paragraph and "the abort function” in the next with abort italicized?

Best,
Erick

> On Nov 23, 2020, at 2:42 AM, Trevor Nicholls <[hidden email]> wrote:
>
> the abort() function


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Using Lucene for technical documentation

Trevor Nicholls
In reply to this post by Paul Libbrecht-7
Hi Paul

My apologies for not acknowledging your message earlier.

I had not thought of indexing the same content twice, with the WS analyser as one field and with the standard analyser as the other, but that may be sufficient for our needs, at least to begin with. Then I can do a crude test of each search pattern to decide which field to query against.

When I get stuck I will be back!

Cheers
T

-----Original Message-----
From: Paul Libbrecht <[hidden email]>
Sent: Monday, 23 November 2020 21:23
To: [hidden email]
Subject: Re: Using Lucene for technical documentation

Hello Trevor,

I don’t know of an analyzer for mixes of code and text but I know of an analyser for mixes of code and formulæ.

Clearly, you could build a custom analyzer that would tokenize differently depending on weather you’re in code or in text. That’s no super hard.

However, where things get complicated is at mixing and that happens latest at querying: If you query `while` you want to find matches for the real world and the stemmed word too. If you use Lucene for other tasks than searches, however, this may be a problem (e.g. clustering, LSA…).

In the case of the formula-enabled search I built, the query modalities were different (two different input-fields) so that you knew how to transform the query (for math, span queries were used).

I’m suspecting that you should decide on this first: if you want to just search and query by a mix then I’d recommend simply using different field-names with a whitespace and a standard-analyzer. Later on the code-oriented one is able to, say, enrich code-tokens by alternative names (e.g. use “loop” as an weaker alternative of the “for” token). Solr and lucene can do this really well (eDismax provides an easy parametrisation).

But I’d be happy to read of others’ works on this!

In the Math working group of W3C at the time, work stopped when considering the complexity of compound documents: the alternatives as above (mix words or recognise math pieces?) certainly made things difficult.

paul


PS: [paper for my math search
here](https://hoplahup.net/paul_pubs/AccessRetrievalAM.html). Please ask for the source code, it is old and built on Lucene 3.5 so would need quite some upgrade.

On 23 Nov 2020, at 8:42, Trevor Nicholls wrote:

> Hello, I'd better begin by identifying myself as a newbie.
>
>
>
> I am investigating using Lucene as a search tool for a library of
> technical documents, much of which consists of pieces of source code
> and discussion of the content.
>
>
>
> The standard analyzer does an adequate job with normal text but strips
> out non-alpha characters in code fragments; the whitespace analyzer
> does an adequate job with source code but at the expense of treating
> punctuation characters as significant text.
>
>
>
> As a couple of trivial examples, the line "The !F1 key." ideally needs
> to be analyzed as  [the] [!f1] [key]. The standard analyzer turns it
> into [the] [f1] [key] while the Whitespace analyzer turns it into
> [the] [!f1] [key.].
>
>
>
> Similarly "the abort() function, or the stop() function." ideally
> needs to be analyzed as [the] [abort()] [function] [or] [the] [stop()]
> [function].
> But no analyzer will retain the parentheses while discarding the comma
> and full stop.
>
>
>
> Are there examples of analyzers for technical documentation around, or
> any helpful pointers? Or am I barking up a rotten tree here?
>
>
>
> cheers
>
> T


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]