Hello Trevor,
I don’t know of an analyzer for mixes of code and text but I know of
an analyser for mixes of code and formulæ.
Clearly, you could build a custom analyzer that would tokenize
differently depending on weather you’re in code or in text. That’s
no super hard.
However, where things get complicated is at mixing and that happens
latest at querying: If you query `while` you want to find matches for
the real world and the stemmed word too. If you use Lucene for other
tasks than searches, however, this may be a problem (e.g. clustering,
LSA…).
In the case of the formula-enabled search I built, the query modalities
were different (two different input-fields) so that you knew how to
transform the query (for math, span queries were used).
I’m suspecting that you should decide on this first: if you want to
just search and query by a mix then I’d recommend simply using
different field-names with a whitespace and a standard-analyzer. Later
on the code-oriented one is able to, say, enrich code-tokens by
alternative names (e.g. use “loop” as an weaker alternative of the
“for” token). Solr and lucene can do this really well (eDismax
provides an easy parametrisation).
But I’d be happy to read of others’ works on this!
In the Math working group of W3C at the time, work stopped when
considering the complexity of compound documents: the alternatives as
above (mix words or recognise math pieces?) certainly made things
difficult.
paul
PS: [paper for my math search
here](
https://hoplahup.net/paul_pubs/AccessRetrievalAM.html). Please ask
for the source code, it is old and built on Lucene 3.5 so would need
quite some upgrade.
On 23 Nov 2020, at 8:42, Trevor Nicholls wrote:
> Hello, I'd better begin by identifying myself as a newbie.
>
>
>
> I am investigating using Lucene as a search tool for a library of
> technical
> documents, much of which consists of pieces of source code and
> discussion of
> the content.
>
>
>
> The standard analyzer does an adequate job with normal text but strips
> out
> non-alpha characters in code fragments; the whitespace analyzer does
> an
> adequate job with source code but at the expense of treating
> punctuation
> characters as significant text.
>
>
>
> As a couple of trivial examples, the line "The !F1 key." ideally needs
> to be
> analyzed as [the] [!f1] [key]. The standard analyzer turns it into
> [the]
> [f1] [key] while the Whitespace analyzer turns it into [the] [!f1]
> [key.].
>
>
>
> Similarly "the abort() function, or the stop() function." ideally
> needs to
> be analyzed as [the] [abort()] [function] [or] [the] [stop()]
> [function].
> But no analyzer will retain the parentheses while discarding the comma
> and
> full stop.
>
>
>
> Are there examples of analyzers for technical documentation around, or
> any
> helpful pointers? Or am I barking up a rotten tree here?
>
>
>
> cheers
>
> T