[More Like This] I would like to contribute

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[More Like This] I would like to contribute

Alessandro Benedetti
Hi gents,
I have spent some time in the last year or so working on the Lucene More Like This ( and related Solr components ) .

Initially I just wanted to improve it, adding BM25[1] but then I noted a lot of areas of possible improvements.

I started then with a refactor of the functionality with these objectives in mind :

1) make the MLT more readable
2) make the MLT more modular and easy to extend
3) make the MLT more tested

This is just a start, I want to invest significant time with my company to work on the functionality and contribute it back.

I split my effort in small Pull Requests to make it easy a review and possible contribution.

Unfortunately I didn't get much feedback so far.
The More Like This functionality seems mostly abandoned.
I tried also to contact one of the last committers that apparently got involved in the developments ( Mark Harwood [hidden email] ), but I had no luck.

This is the current Jira Issue that start with a first small refactor + tests :


I would love to contribute it and much more, but I need some feedback and review ( unfortunately I am not a committer yet).

Let me know what can I do to speed up the process from my side.

Regards


--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: [More Like This] I would like to contribute

Robert Muir
For proposed api, behavior changes or refactoring to these classes, I really recommend using LUCENE issues for those instead of SOLR ones. Otherwise they can get missed.

As far as feedback, personally I tried to give it on LUCENE-7498 a year ago but wasn't sure what happened as further comments dropped off. As I mentioned there, I definitely think changing the algorithm to MoreLikeThis is a big deal and really shouldn't be mixed in with refactorings or api changes: it makes for too much to worry about at once. Just changing the algorithm is a big deal: since this class supports blind relevance feedback it means we can do some rough measurements with relevance tests before doing that. As I have personally not seen the BM25 algorithm used for these purposes anywhere, that's why I was concerned/curious about performance.

On Mon, May 21, 2018 at 7:23 AM, Alessandro Benedetti <[hidden email]> wrote:
Hi gents,
I have spent some time in the last year or so working on the Lucene More Like This ( and related Solr components ) .

Initially I just wanted to improve it, adding BM25[1] but then I noted a lot of areas of possible improvements.

I started then with a refactor of the functionality with these objectives in mind :

1) make the MLT more readable
2) make the MLT more modular and easy to extend
3) make the MLT more tested

This is just a start, I want to invest significant time with my company to work on the functionality and contribute it back.

I split my effort in small Pull Requests to make it easy a review and possible contribution.

Unfortunately I didn't get much feedback so far.
The More Like This functionality seems mostly abandoned.
I tried also to contact one of the last committers that apparently got involved in the developments ( Mark Harwood [hidden email] ), but I had no luck.

This is the current Jira Issue that start with a first small refactor + tests :


I would love to contribute it and much more, but I need some feedback and review ( unfortunately I am not a committer yet).

Let me know what can I do to speed up the process from my side.

Regards


--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director

Reply | Threaded
Open this post in threaded view
|

Re: [More Like This] I would like to contribute

Alessandro Benedetti
Hi Robert,
thanks for the feedback.
I read your comment last year and you I agreed completely.
So I started step by step, with the refactor first.

The first contribution is isolating a part of the refactor, so no functional change in the algorithms nor a complete refactor in place.
I basically tried to decompose the refactor is unitary pull requests as small as possible.
It just focused on the MLT parameters first to reduce the size of the original MoreLikeThis class ( and relegating the parameter modelling responsibility to a separate class)


The reason I used SOLR is because the refactor affects some Solr components using the MLT.
But I agree with you, it can (should) be moved to LUCENE ( I tried via JIRA but I don't think I have the right permissions).

Should I just create a new JIRA issue completely ( closing the SOLR one) or some JIRA admin can directly move the Jira to the LUCENE project ?

Thank you again for your support,

Regards

--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director

On Tue, May 22, 2018 at 10:18 AM, Robert Muir <[hidden email]> wrote:
For proposed api, behavior changes or refactoring to these classes, I really recommend using LUCENE issues for those instead of SOLR ones. Otherwise they can get missed.

As far as feedback, personally I tried to give it on LUCENE-7498 a year ago but wasn't sure what happened as further comments dropped off. As I mentioned there, I definitely think changing the algorithm to MoreLikeThis is a big deal and really shouldn't be mixed in with refactorings or api changes: it makes for too much to worry about at once. Just changing the algorithm is a big deal: since this class supports blind relevance feedback it means we can do some rough measurements with relevance tests before doing that. As I have personally not seen the BM25 algorithm used for these purposes anywhere, that's why I was concerned/curious about performance.

On Mon, May 21, 2018 at 7:23 AM, Alessandro Benedetti <[hidden email]> wrote:
Hi gents,
I have spent some time in the last year or so working on the Lucene More Like This ( and related Solr components ) .

Initially I just wanted to improve it, adding BM25[1] but then I noted a lot of areas of possible improvements.

I started then with a refactor of the functionality with these objectives in mind :

1) make the MLT more readable
2) make the MLT more modular and easy to extend
3) make the MLT more tested

This is just a start, I want to invest significant time with my company to work on the functionality and contribute it back.

I split my effort in small Pull Requests to make it easy a review and possible contribution.

Unfortunately I didn't get much feedback so far.
The More Like This functionality seems mostly abandoned.
I tried also to contact one of the last committers that apparently got involved in the developments ( Mark Harwood [hidden email] ), but I had no luck.

This is the current Jira Issue that start with a first small refactor + tests :


I would love to contribute it and much more, but I need some feedback and review ( unfortunately I am not a committer yet).

Let me know what can I do to speed up the process from my side.

Regards


--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director


---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: [More Like This] I would like to contribute

Robert Muir


On Tue, May 22, 2018 at 7:03 AM, Alessandro Benedetti <[hidden email]> wrote:
Hi Robert,
thanks for the feedback.
I read your comment last year and you I agreed completely.
So I started step by step, with the refactor first.

The first contribution is isolating a part of the refactor, so no functional change in the algorithms nor a complete refactor in place.
I basically tried to decompose the refactor is unitary pull requests as small as possible.
It just focused on the MLT parameters first to reduce the size of the original MoreLikeThis class ( and relegating the parameter modelling responsibility to a separate class)


The reason I used SOLR is because the refactor affects some Solr components using the MLT.
But I agree with you, it can (should) be moved to LUCENE ( I tried via JIRA but I don't think I have the right permissions).

Should I just create a new JIRA issue completely ( closing the SOLR one) or some JIRA admin can directly move the Jira to the LUCENE project ?



Reply | Threaded
Open this post in threaded view
|

Re: [More Like This] I would like to contribute

Alessandro Benedetti
Thank you very much!
i will update the patch name and open a Pull Request with the correct Jira issue key!
Thank you again, this is definitely a start!

Cheers

--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director

On Tue, May 22, 2018 at 1:29 PM, Robert Muir <[hidden email]> wrote:


On Tue, May 22, 2018 at 7:03 AM, Alessandro Benedetti <[hidden email]> wrote:
Hi Robert,
thanks for the feedback.
I read your comment last year and you I agreed completely.
So I started step by step, with the refactor first.

The first contribution is isolating a part of the refactor, so no functional change in the algorithms nor a complete refactor in place.
I basically tried to decompose the refactor is unitary pull requests as small as possible.
It just focused on the MLT parameters first to reduce the size of the original MoreLikeThis class ( and relegating the parameter modelling responsibility to a separate class)


The reason I used SOLR is because the refactor affects some Solr components using the MLT.
But I agree with you, it can (should) be moved to LUCENE ( I tried via JIRA but I don't think I have the right permissions).

Should I just create a new JIRA issue completely ( closing the SOLR one) or some JIRA admin can directly move the Jira to the LUCENE project ?




---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: [More Like This] I would like to contribute

Alessandro Benedetti
Any update on this from the community?
I would like to move forward with the contribution.
The first part is involving just the More Like This parameters, but the rest will involve the interesting terms retrieval and quite a big piece of testing.
No hard position on this, I was thinking that extracting the parameters was a good idea to improve the readability ( I noticed a similar approach in Solr for the edismax for example :
DisMaxParams, ExtendedDismaxConfiguration ... ) but if the community prefers to keep them there, it's ok and I can move to the next steps.

I attach here my considerations in response to Robert ones :
"First of all thanks for your feedback, much appreciated.

My initial goal was to make More Like This original 1000 lines class:

  • More Readable
  • Easier to Maintain and Extend
  • More Testable

So I started identifying the different responsibilities in the More Like This class, to separate them, in the way that if I just need to change the scoring algorithm for the interesting terms I just touch the TermScorer ect ect :
You see the overall idea in these slides : https://www.slideshare.net/AlessandroBenedetti/advanced-document-similarity-with-apache-lucene

I tried to modify as less as possible the logic and tests at this stage.

So let me wrap my considerations under different topics :

Parameters Abstraction
I see your point for just additional indirection/abstraction ( it is exactly just that with readability in mind).
My scope for this was :
"We have 600 lines of code of default and parameters for the MLT. How to make them isolated, more readable and extendable ?"
And I initially thought to just put them in a separate class to remove that responsibility from the original MLT .
So the focus was exclusively better readability and easy maintenance at this stage.
Can you elaborate why you think this is undesired ?
I don't have any strong feeling regarding this bit, so I am open to suggestions with the forementioned objective in mind.

MLT Immutable
I didn't consider it , but I am completely open to do that.
In such case it could be worth to add a MoreLikeThis factory that manages the parameters and create the immutable MLT object ?

MoreLikeThisQuery
It was not in the scope of this refactor but I am absolutely happy to tackle that immediately as a next step, it could give it a try to see if there is space for improvement there.

Solr Tests

I completely agree, indeed as part of my additional tests which I have ready for the sequent refactors I introducedmuch more tests Lucene side than Solr side.
We can elaborate this further at the right moment "


Regards

--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director

On Tue, May 22, 2018 at 1:32 PM, Alessandro Benedetti <[hidden email]> wrote:
Thank you very much!
i will update the patch name and open a Pull Request with the correct Jira issue key!
Thank you again, this is definitely a start!

Cheers

--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director

On Tue, May 22, 2018 at 1:29 PM, Robert Muir <[hidden email]> wrote:


On Tue, May 22, 2018 at 7:03 AM, Alessandro Benedetti <[hidden email]> wrote:
Hi Robert,
thanks for the feedback.
I read your comment last year and you I agreed completely.
So I started step by step, with the refactor first.

The first contribution is isolating a part of the refactor, so no functional change in the algorithms nor a complete refactor in place.
I basically tried to decompose the refactor is unitary pull requests as small as possible.
It just focused on the MLT parameters first to reduce the size of the original MoreLikeThis class ( and relegating the parameter modelling responsibility to a separate class)


The reason I used SOLR is because the refactor affects some Solr components using the MLT.
But I agree with you, it can (should) be moved to LUCENE ( I tried via JIRA but I don't think I have the right permissions).

Should I just create a new JIRA issue completely ( closing the SOLR one) or some JIRA admin can directly move the Jira to the LUCENE project ?





---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io