Announcement: Lucene powering Monster job search index (Beta)

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
I am pleased to announce the launch of Monster's new job search Beta web
site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice the
Lucene logo at the bottom of the page!).

The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows (AMD
and Intel processors)

Here are some of the new features:

1. 'Improve your search by'...

The job search results page allows you to browse and 'drill down' through
the results by job category, status, type and salary. The number of matching
jobs in each facet is displayed. There will likely be many more facets to
browse by in the future.

This feature is currently implemented with a custom HitCollector and the
DocSet class from Solr.

2. 'More like this'

Find more jobs like the one you see by clicking on the 'MORE LIKE THIS'
link, which is visible when you hover the mouse over the job title.

This feature is implemented with Lucene's term vectors and the
'MoreLikeThis' contribution class. If you are in 'detailed view', the term
vectors from the job description are used. In 'brief' view, the job title's
term vectors are used.

3. 'Related Titles'

When you do a 'keywords' search, click on a 'related titles' link to filter
you search by similar job titles.

This feature is implemented via a separate Lucene.Net index.

4. Sort by 'Miles'

Find jobs close to you via zip code/radius search. In the search results
page, click on the 'Miles' column to sort the results by distance from your
zip code/radius.

This custom sorting feature is implemented via Lucene's
'SortComparatorSource' interface.

5. Search by date, salary, distance.

Find jobs posted in the last day (or 2,3, etc) or by salary range or
distance.

Numeric range search is one of Lucene's weak points (performance-wise) so we
have implemented this with a custom HitCollector and an extension to the
Lucene index files that stores the numeric field values for all documents.

It is important to point out that this has all been implemented with the
stock Lucene 2.0 library. No code changes were made to the Lucene core.

If you have any feedback regarding the UI, please use the link on the web
page ("send us your feedback"). You can hit me with any other
questions/comments.

Peter
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

chrislusf
Hi, Peter,

Really great job!

I am interested to know how you implemented "4. Sort by 'Miles'". For
example, if starting from a zip code, how to match items within 20
miles?

--
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com

On 10/27/06, Peter Keegan <[hidden email]> wrote:

> I am pleased to announce the launch of Monster's new job search Beta web
> site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice the
> Lucene logo at the bottom of the page!).
>
> The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows (AMD
> and Intel processors)
>
> Here are some of the new features:
>
> 1. 'Improve your search by'...
>
> The job search results page allows you to browse and 'drill down' through
> the results by job category, status, type and salary. The number of matching
> jobs in each facet is displayed. There will likely be many more facets to
> browse by in the future.
>
> This feature is currently implemented with a custom HitCollector and the
> DocSet class from Solr.
>
> 2. 'More like this'
>
> Find more jobs like the one you see by clicking on the 'MORE LIKE THIS'
> link, which is visible when you hover the mouse over the job title.
>
> This feature is implemented with Lucene's term vectors and the
> 'MoreLikeThis' contribution class. If you are in 'detailed view', the term
> vectors from the job description are used. In 'brief' view, the job title's
> term vectors are used.
>
> 3. 'Related Titles'
>
> When you do a 'keywords' search, click on a 'related titles' link to filter
> you search by similar job titles.
>
> This feature is implemented via a separate Lucene.Net index.
>
> 4. Sort by 'Miles'
>
> Find jobs close to you via zip code/radius search. In the search results
> page, click on the 'Miles' column to sort the results by distance from your
> zip code/radius.
>
> This custom sorting feature is implemented via Lucene's
> 'SortComparatorSource' interface.
>
> 5. Search by date, salary, distance.
>
> Find jobs posted in the last day (or 2,3, etc) or by salary range or
> distance.
>
> Numeric range search is one of Lucene's weak points (performance-wise) so we
> have implemented this with a custom HitCollector and an extension to the
> Lucene index files that stores the numeric field values for all documents.
>
> It is important to point out that this has all been implemented with the
> stock Lucene 2.0 library. No code changes were made to the Lucene core.
>
> If you have any feedback regarding the UI, please use the link on the web
> page ("send us your feedback"). You can hit me with any other
> questions/comments.
>
> Peter
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
On 10/27/06, Chris Lu <[hidden email]> wrote:
>
> Hi, Peter,
>
> Really great job!


Thanks. (I'll tell the team)

I am interested to know how you implemented "4. Sort by 'Miles'". For
> example, if starting from a zip code, how to match items within 20
> miles?


I can tell you how we use Lucene to accomplish this.
At indexing time, each job's location is indexed as a special field. How you
represent the location is up to you. Each time a new index is built the
location data for all documents in the index are fetched via TermEnum and
TermDocs. This is practical because the searcher refresh is done at
predictable times. At query time, a custom SortComparatorSource is created,
using the 'reference' location (the zip/radius). The 'compare' method
performs the calculation to compare the 2 documents' location values (saved
from above) to the reference location.

I believe this can also be accomplished with Solr's FunctionQuery, but I
haven't tried that yet.

Peter

--

> Chris Lu
> -------------------------
> Instant Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
>
> On 10/27/06, Peter Keegan <[hidden email]> wrote:
> > I am pleased to announce the launch of Monster's new job search Beta web
> > site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice
> the
> > Lucene logo at the bottom of the page!).
> >
> > The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows
> (AMD
> > and Intel processors)
> >
> > Here are some of the new features:
> >
> > 1. 'Improve your search by'...
> >
> > The job search results page allows you to browse and 'drill down'
> through
> > the results by job category, status, type and salary. The number of
> matching
> > jobs in each facet is displayed. There will likely be many more facets
> to
> > browse by in the future.
> >
> > This feature is currently implemented with a custom HitCollector and the
> > DocSet class from Solr.
> >
> > 2. 'More like this'
> >
> > Find more jobs like the one you see by clicking on the 'MORE LIKE THIS'
> > link, which is visible when you hover the mouse over the job title.
> >
> > This feature is implemented with Lucene's term vectors and the
> > 'MoreLikeThis' contribution class. If you are in 'detailed view', the
> term
> > vectors from the job description are used. In 'brief' view, the job
> title's
> > term vectors are used.
> >
> > 3. 'Related Titles'
> >
> > When you do a 'keywords' search, click on a 'related titles' link to
> filter
> > you search by similar job titles.
> >
> > This feature is implemented via a separate Lucene.Net index.
> >
> > 4. Sort by 'Miles'
> >
> > Find jobs close to you via zip code/radius search. In the search results
> > page, click on the 'Miles' column to sort the results by distance from
> your
> > zip code/radius.
> >
> > This custom sorting feature is implemented via Lucene's
> > 'SortComparatorSource' interface.
> >
> > 5. Search by date, salary, distance.
> >
> > Find jobs posted in the last day (or 2,3, etc) or by salary range or
> > distance.
> >
> > Numeric range search is one of Lucene's weak points (performance-wise)
> so we
> > have implemented this with a custom HitCollector and an extension to the
> > Lucene index files that stores the numeric field values for all
> documents.
> >
> > It is important to point out that this has all been implemented with the
> > stock Lucene 2.0 library. No code changes were made to the Lucene core.
> >
> > If you have any feedback regarding the UI, please use the link on the
> web
> > page ("send us your feedback"). You can hit me with any other
> > questions/comments.
> >
> > Peter
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Otis Gospodnetic-2
Hi,

--- Peter Keegan <[hidden email]> wrote:

> On 10/27/06, Chris Lu <[hidden email]> wrote:
> >
> > Hi, Peter,
> >
> > Really great job!
>
>
> Thanks. (I'll tell the team)

If it's not a secret, can you tell us a bit more about what's behind
the search in terms of hardware, and how much pounding that hardware
takes in terms of QPS?  People always ask about this stuff.

Thanks,
Otis


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Alex Popescu
Peter it looks impressive. Congrats! A small suggestion, though, after
performing a search the filtering criteria is not displayed anywhere.
I guess it would make sense to write it in a read-only form somewhere
on the result pages:

Jobs 1-50 of 7896 matches to Jobs 1-50 of 7896 matching criteria (a
small hidden stuff showing the criteria).

./alex
--
.w( the_mindstorm )p.


On 10/29/06, Otis Gospodnetic <[hidden email]> wrote:

> Hi,
>
> --- Peter Keegan <[hidden email]> wrote:
>
> > On 10/27/06, Chris Lu <[hidden email]> wrote:
> > >
> > > Hi, Peter,
> > >
> > > Really great job!
> >
> >
> > Thanks. (I'll tell the team)
>
> If it's not a secret, can you tell us a bit more about what's behind
> the search in terms of hardware, and how much pounding that hardware
> takes in terms of QPS?  People always ask about this stuff.
>
> Thanks,
> Otis
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
In reply to this post by Otis Gospodnetic-2
Otis,

The Lucene components for this beta are running on 4 dual core AMD Opteron (
2.6GHZ) processors, for a total of 8 CPUs. It has 32GB RAM, although 16GB
would probably suffice. The query rate is currently quite low probably
because of the low visibility of the beta page. We haven't measured QPS
rates for this configuration, yet, but if you look at some of my previous
posts, you'll see some QPS data on somewhat similar hardware. I think that
actual rates will be lower, though, because the complexity of the queries,
counting, sorting, etc have increased.

Peter

On 10/28/06, Otis Gospodnetic <[hidden email]> wrote:

>
> Hi,
>
> --- Peter Keegan <[hidden email]> wrote:
>
> > On 10/27/06, Chris Lu <[hidden email]> wrote:
> > >
> > > Hi, Peter,
> > >
> > > Really great job!
> >
> >
> > Thanks. (I'll tell the team)
>
> If it's not a secret, can you tell us a bit more about what's behind
> the search in terms of hardware, and how much pounding that hardware
> takes in terms of QPS?  People always ask about this stuff.
>
> Thanks,
> Otis
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
In reply to this post by Alex Popescu
Alex,

I like your suggestion (I've found myself wondering what the last search
was, too), and I've forwarded it to the UI developer.

Thanks,
Peter


On 10/29/06, Alexandru Popescu <[hidden email]> wrote:

>
> Peter it looks impressive. Congrats! A small suggestion, though, after
> performing a search the filtering criteria is not displayed anywhere.
> I guess it would make sense to write it in a read-only form somewhere
> on the result pages:
>
> Jobs 1-50 of 7896 matches to Jobs 1-50 of 7896 matching criteria (a
> small hidden stuff showing the criteria).
>
> ./alex
> --
> .w( the_mindstorm )p.
>
>
> On 10/29/06, Otis Gospodnetic <[hidden email]> wrote:
> > Hi,
> >
> > --- Peter Keegan <[hidden email]> wrote:
> >
> > > On 10/27/06, Chris Lu <[hidden email]> wrote:
> > > >
> > > > Hi, Peter,
> > > >
> > > > Really great job!
> > >
> > >
> > > Thanks. (I'll tell the team)
> >
> > If it's not a secret, can you tell us a bit more about what's behind
> > the search in terms of hardware, and how much pounding that hardware
> > takes in terms of QPS?  People always ask about this stuff.
> >
> > Thanks,
> > Otis
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Joe Shaw
In reply to this post by Peter Keegan
Hi Peter,

On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
> Numeric range search is one of Lucene's weak points (performance-wise) so we
> have implemented this with a custom HitCollector and an extension to the
> Lucene index files that stores the numeric field values for all documents.
>
> It is important to point out that this has all been implemented with the
> stock Lucene 2.0 library. No code changes were made to the Lucene core.

Can you give some technical details on the extension to the Lucene index
files?  How did you do it without making any changes to the Lucene core?

Thanks,
Joe


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

KEGan
In reply to this post by Peter Keegan
Peter,

Congratulation on the beta launch :)

If you dont mind, I would like to ask you more on the feature "4. Sort by
Miles".

When you search by "4. Sort by Miles", I suppose the sorting by relevance
(of the search keyword) is lost? Since this is implemented using a custom
SortComparatorSource.

Also, I suppose, if FunctionQuery were used, we can make "job distance by
miles" part of the relavancy of the search results?

Could you comment or confirm my assertion ? Thanks :)


On 10/28/06, Peter Keegan <[hidden email]> wrote:

>
> On 10/27/06, Chris Lu <[hidden email]> wrote:
> >
> > Hi, Peter,
> >
> > Really great job!
>
>
> Thanks. (I'll tell the team)
>
> I am interested to know how you implemented "4. Sort by 'Miles'". For
> > example, if starting from a zip code, how to match items within 20
> > miles?
>
>
> I can tell you how we use Lucene to accomplish this.
> At indexing time, each job's location is indexed as a special field. How
> you
> represent the location is up to you. Each time a new index is built the
> location data for all documents in the index are fetched via TermEnum and
> TermDocs. This is practical because the searcher refresh is done at
> predictable times. At query time, a custom SortComparatorSource is
> created,
> using the 'reference' location (the zip/radius). The 'compare' method
> performs the calculation to compare the 2 documents' location values
> (saved
> from above) to the reference location.
>
> I believe this can also be accomplished with Solr's FunctionQuery, but I
> haven't tried that yet.
>
> Peter
>
> --
> > Chris Lu
> > -------------------------
> > Instant Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> >
> > On 10/27/06, Peter Keegan <[hidden email]> wrote:
> > > I am pleased to announce the launch of Monster's new job search Beta
> web
> > > site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice
> > the
> > > Lucene logo at the bottom of the page!).
> > >
> > > The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows
> > (AMD
> > > and Intel processors)
> > >
> > > Here are some of the new features:
> > >
> > > 1. 'Improve your search by'...
> > >
> > > The job search results page allows you to browse and 'drill down'
> > through
> > > the results by job category, status, type and salary. The number of
> > matching
> > > jobs in each facet is displayed. There will likely be many more facets
> > to
> > > browse by in the future.
> > >
> > > This feature is currently implemented with a custom HitCollector and
> the
> > > DocSet class from Solr.
> > >
> > > 2. 'More like this'
> > >
> > > Find more jobs like the one you see by clicking on the 'MORE LIKE
> THIS'
> > > link, which is visible when you hover the mouse over the job title.
> > >
> > > This feature is implemented with Lucene's term vectors and the
> > > 'MoreLikeThis' contribution class. If you are in 'detailed view', the
> > term
> > > vectors from the job description are used. In 'brief' view, the job
> > title's
> > > term vectors are used.
> > >
> > > 3. 'Related Titles'
> > >
> > > When you do a 'keywords' search, click on a 'related titles' link to
> > filter
> > > you search by similar job titles.
> > >
> > > This feature is implemented via a separate Lucene.Net index.
> > >
> > > 4. Sort by 'Miles'
> > >
> > > Find jobs close to you via zip code/radius search. In the search
> results
> > > page, click on the 'Miles' column to sort the results by distance from
> > your
> > > zip code/radius.
> > >
> > > This custom sorting feature is implemented via Lucene's
> > > 'SortComparatorSource' interface.
> > >
> > > 5. Search by date, salary, distance.
> > >
> > > Find jobs posted in the last day (or 2,3, etc) or by salary range or
> > > distance.
> > >
> > > Numeric range search is one of Lucene's weak points (performance-wise)
> > so we
> > > have implemented this with a custom HitCollector and an extension to
> the
> > > Lucene index files that stores the numeric field values for all
> > documents.
> > >
> > > It is important to point out that this has all been implemented with
> the
> > > stock Lucene 2.0 library. No code changes were made to the Lucene
> core.
> > >
> > > If you have any feedback regarding the UI, please use the link on the
> > web
> > > page ("send us your feedback"). You can hit me with any other
> > > questions/comments.
> > >
> > > Peter
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
In reply to this post by Joe Shaw
Joe,

Fields with numeric values are stored in a separate file as binary values in
an internal format. Lucene is unaware of this file and unaware of the range
expression in the query. The range expression is parsed outside of Lucene
and used in a custom HitCollector to filter out documents that aren't in the
requested range(s). A goal was to do this without having to modify Lucene.
Our scheme is pretty efficient, but not very general purpose in its current
form, though.

Peter


On 10/30/06, Joe Shaw <[hidden email]> wrote:

>
> Hi Peter,
>
> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
> > Numeric range search is one of Lucene's weak points (performance-wise)
> so we
> > have implemented this with a custom HitCollector and an extension to the
> > Lucene index files that stores the numeric field values for all
> documents.
> >
> > It is important to point out that this has all been implemented with the
> > stock Lucene 2.0 library. No code changes were made to the Lucene core.
>
> Can you give some technical details on the extension to the Lucene index
> files?  How did you do it without making any changes to the Lucene core?
>
> Thanks,
> Joe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
In reply to this post by KEGan
KEGan,

>When you search by "4. Sort by Miles", I suppose the sorting by relevance
>(of the search keyword) is lost? Since this is implemented using a custom
>SortComparatorSource.

Sorting by miles becomes the primary sort key, score and date become
secondary sort fields (in the case of ties).

>Also, I suppose, if FunctionQuery were used, we can make "job distance by
>miles" part of the relavancy of the search results?

Yes, this is my understanding of the power of FunctionQuery.

Peter

On 10/30/06, KEGan <[hidden email]> wrote:

>
> Peter,
>
> Congratulation on the beta launch :)
>
> If you dont mind, I would like to ask you more on the feature "4. Sort by
> Miles".
>
> When you search by "4. Sort by Miles", I suppose the sorting by relevance
> (of the search keyword) is lost? Since this is implemented using a custom
> SortComparatorSource.
>
> Also, I suppose, if FunctionQuery were used, we can make "job distance by
> miles" part of the relavancy of the search results?
>
> Could you comment or confirm my assertion ? Thanks :)
>
>
> On 10/28/06, Peter Keegan <[hidden email]> wrote:
> >
> > On 10/27/06, Chris Lu <[hidden email]> wrote:
> > >
> > > Hi, Peter,
> > >
> > > Really great job!
> >
> >
> > Thanks. (I'll tell the team)
> >
> > I am interested to know how you implemented "4. Sort by 'Miles'". For
> > > example, if starting from a zip code, how to match items within 20
> > > miles?
> >
> >
> > I can tell you how we use Lucene to accomplish this.
> > At indexing time, each job's location is indexed as a special field. How
> > you
> > represent the location is up to you. Each time a new index is built the
> > location data for all documents in the index are fetched via TermEnum
> and
> > TermDocs. This is practical because the searcher refresh is done at
> > predictable times. At query time, a custom SortComparatorSource is
> > created,
> > using the 'reference' location (the zip/radius). The 'compare' method
> > performs the calculation to compare the 2 documents' location values
> > (saved
> > from above) to the reference location.
> >
> > I believe this can also be accomplished with Solr's FunctionQuery, but I
> > haven't tried that yet.
> >
> > Peter
> >
> > --
> > > Chris Lu
> > > -------------------------
> > > Instant Full-Text Search On Any Database/Application
> > > site: http://www.dbsight.net
> > > demo: http://search.dbsight.com
> > >
> > > On 10/27/06, Peter Keegan <[hidden email]> wrote:
> > > > I am pleased to announce the launch of Monster's new job search Beta
> > web
> > > > site, powered by Lucene, at: http://jobsearch.beta.monster.com(notice
> > > the
> > > > Lucene logo at the bottom of the page!).
> > > >
> > > > The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows
> > > (AMD
> > > > and Intel processors)
> > > >
> > > > Here are some of the new features:
> > > >
> > > > 1. 'Improve your search by'...
> > > >
> > > > The job search results page allows you to browse and 'drill down'
> > > through
> > > > the results by job category, status, type and salary. The number of
> > > matching
> > > > jobs in each facet is displayed. There will likely be many more
> facets
> > > to
> > > > browse by in the future.
> > > >
> > > > This feature is currently implemented with a custom HitCollector and
> > the
> > > > DocSet class from Solr.
> > > >
> > > > 2. 'More like this'
> > > >
> > > > Find more jobs like the one you see by clicking on the 'MORE LIKE
> > THIS'
> > > > link, which is visible when you hover the mouse over the job title.
> > > >
> > > > This feature is implemented with Lucene's term vectors and the
> > > > 'MoreLikeThis' contribution class. If you are in 'detailed view',
> the
> > > term
> > > > vectors from the job description are used. In 'brief' view, the job
> > > title's
> > > > term vectors are used.
> > > >
> > > > 3. 'Related Titles'
> > > >
> > > > When you do a 'keywords' search, click on a 'related titles' link to
> > > filter
> > > > you search by similar job titles.
> > > >
> > > > This feature is implemented via a separate Lucene.Net index.
> > > >
> > > > 4. Sort by 'Miles'
> > > >
> > > > Find jobs close to you via zip code/radius search. In the search
> > results
> > > > page, click on the 'Miles' column to sort the results by distance
> from
> > > your
> > > > zip code/radius.
> > > >
> > > > This custom sorting feature is implemented via Lucene's
> > > > 'SortComparatorSource' interface.
> > > >
> > > > 5. Search by date, salary, distance.
> > > >
> > > > Find jobs posted in the last day (or 2,3, etc) or by salary range or
> > > > distance.
> > > >
> > > > Numeric range search is one of Lucene's weak points
> (performance-wise)
> > > so we
> > > > have implemented this with a custom HitCollector and an extension to
> > the
> > > > Lucene index files that stores the numeric field values for all
> > > documents.
> > > >
> > > > It is important to point out that this has all been implemented with
> > the
> > > > stock Lucene 2.0 library. No code changes were made to the Lucene
> > core.
> > > >
> > > > If you have any feedback regarding the UI, please use the link on
> the
> > > web
> > > > page ("send us your feedback"). You can hit me with any other
> > > > questions/comments.
> > > >
> > > > Peter
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

sri-2
In reply to this post by Peter Keegan
Hi Peter

When I use the CustomHitCollector, it affect the application performance.
Also how you accomplish the grouping the results with out affecting
performance. Also If possible give some code snippet for custome
hitcollector.

TIA

Sri

"Peter Keegan" <[hidden email]> wrote in message
news:[hidden email]...

> Joe,
>
> Fields with numeric values are stored in a separate file as binary values
> in
> an internal format. Lucene is unaware of this file and unaware of the
> range
> expression in the query. The range expression is parsed outside of Lucene
> and used in a custom HitCollector to filter out documents that aren't in
> the
> requested range(s). A goal was to do this without having to modify Lucene.
> Our scheme is pretty efficient, but not very general purpose in its
> current
> form, though.
>
> Peter
>
>
> On 10/30/06, Joe Shaw <[hidden email]> wrote:
>>
>> Hi Peter,
>>
>> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
>> > Numeric range search is one of Lucene's weak points (performance-wise)
>> so we
>> > have implemented this with a custom HitCollector and an extension to
>> > the
>> > Lucene index files that stores the numeric field values for all
>> documents.
>> >
>> > It is important to point out that this has all been implemented with
>> > the
>> > stock Lucene 2.0 library. No code changes were made to the Lucene core.
>>
>> Can you give some technical details on the extension to the Lucene index
>> files?  How did you do it without making any changes to the Lucene core?
>>
>> Thanks,
>> Joe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Daniel Rosher-2
In reply to this post by Peter Keegan
Hi Peter,

Does this mean you are calculating the euclidean distance twice ... once for
the HitCollecter to filter
'out of range' documents, and then again for the custom Comparator to sort
the returned documents?
especially since the filtering is done outside Lucene?

Regards,
Dan


>Joe,
>
>Fields with numeric values are stored in a separate file as binary values
in
>an internal format. Lucene is unaware of this file and unaware of the range
>expression in the query. The range expression is parsed outside of Lucene
>and used in a custom HitCollector to filter out documents that aren't in
the

>requested range(s). A goal was to do this without having to modify Lucene.
>Our scheme is pretty efficient, but not very general purpose in its current
>form, though.
>
>Peter
>
>
>On 10/30/06, Joe Shaw <[hidden email]> wrote:
>>
>> Hi Peter,
>>
>> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
>> > Numeric range search is one of Lucene's weak points (performance-wise)
>> so we
>> > have implemented this with a custom HitCollector and an extension to
the
>> > Lucene index files that stores the numeric field values for all
>> documents.
>> >
>> > It is important to point out that this has all been implemented with
the

>> > stock Lucene 2.0 library. No code changes were made to the Lucene core.
>>
>> Can you give some technical details on the extension to the Lucene index
>> files?  How did you do it without making any changes to the Lucene core?
>>
>> Thanks,
>> Joe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
In reply to this post by sri-2
Paramasivam,

Take a look at Solr, in particular the DocSetHitCollector class. The
collector simply sets a bit in a BitSet, or saves the docIds in an array
(for low hit counts). Solr's BitSet was optimized (by Yonik, I believe) to
be faster than Java's BitSet, so this HitCollector is very fast. This is
essentially what we are doing for counting.

Peter

On 11/2/06, Paramasivam Srinivasan <[hidden email]> wrote:

>
> Hi Peter
>
> When I use the CustomHitCollector, it affect the application performance.
> Also how you accomplish the grouping the results with out affecting
> performance. Also If possible give some code snippet for custome
> hitcollector.
>
> TIA
>
> Sri
>
> "Peter Keegan" <[hidden email]> wrote in message
> news:[hidden email]...
> > Joe,
> >
> > Fields with numeric values are stored in a separate file as binary
> values
> > in
> > an internal format. Lucene is unaware of this file and unaware of the
> > range
> > expression in the query. The range expression is parsed outside of
> Lucene
> > and used in a custom HitCollector to filter out documents that aren't in
> > the
> > requested range(s). A goal was to do this without having to modify
> Lucene.
> > Our scheme is pretty efficient, but not very general purpose in its
> > current
> > form, though.
> >
> > Peter
> >
> >
> > On 10/30/06, Joe Shaw <[hidden email]> wrote:
> >>
> >> Hi Peter,
> >>
> >> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
> >> > Numeric range search is one of Lucene's weak points
> (performance-wise)
> >> so we
> >> > have implemented this with a custom HitCollector and an extension to
> >> > the
> >> > Lucene index files that stores the numeric field values for all
> >> documents.
> >> >
> >> > It is important to point out that this has all been implemented with
> >> > the
> >> > stock Lucene 2.0 library. No code changes were made to the Lucene
> core.
> >>
> >> Can you give some technical details on the extension to the Lucene
> index
> >> files?  How did you do it without making any changes to the Lucene
> core?
> >>
> >> Thanks,
> >> Joe
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
In reply to this post by Daniel Rosher-2
Daniel,
Yes, this is correct if you happen to be doing a radius search and sorting
by mileage.
Peter

On 11/3/06, Daniel Rosher <[hidden email]> wrote:

>
> Hi Peter,
>
> Does this mean you are calculating the euclidean distance twice ... once
> for
> the HitCollecter to filter
> 'out of range' documents, and then again for the custom Comparator to sort
> the returned documents?
> especially since the filtering is done outside Lucene?
>
> Regards,
> Dan
>
>
> >Joe,
> >
> >Fields with numeric values are stored in a separate file as binary values
> in
> >an internal format. Lucene is unaware of this file and unaware of the
> range
> >expression in the query. The range expression is parsed outside of Lucene
> >and used in a custom HitCollector to filter out documents that aren't in
> the
> >requested range(s). A goal was to do this without having to modify
> Lucene.
> >Our scheme is pretty efficient, but not very general purpose in its
> current
> >form, though.
> >
> >Peter
> >
> >
> >On 10/30/06, Joe Shaw <[hidden email]> wrote:
> >>
> >> Hi Peter,
> >>
> >> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
> >> > Numeric range search is one of Lucene's weak points
> (performance-wise)
> >> so we
> >> > have implemented this with a custom HitCollector and an extension to
> the
> >> > Lucene index files that stores the numeric field values for all
> >> documents.
> >> >
> >> > It is important to point out that this has all been implemented with
> the
> >> > stock Lucene 2.0 library. No code changes were made to the Lucene
> core.
> >>
> >> Can you give some technical details on the extension to the Lucene
> index
> >> files?  How did you do it without making any changes to the Lucene
> core?
> >>
> >> Thanks,
> >> Joe
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

dr fence
Isn't this extremely ineffecient to do the euclidean distance twice?
Perhaps not a huge deal if a small search result set.  I at times have
13,000 results that match my search terms of an index with 1.2 million docs.

Can't you do some simple radian math first to ensure it's way out of bounds,
then do the euclidian distance for the subset within bounds?  I'm currently
only doing the distance calc once (post hit collector). I don't have any
performance numbers with the double vs single distance calc.

I'm still working out the sort by radius myself.

Mark

On 11/3/06, Peter Keegan <[hidden email]> wrote:
>
> Daniel,
> Yes, this is correct if you happen to be doing a radius search and sorting
> by mileage.
> Peter
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
Correction:
We only do the euclidan computation during sorting. For filtering, a simple
bounding box is computed to approximate the radius, and 2 range comparisons
are made to exclude documents. Because these comparisons are done outside of
Lucene as integer comparisons, it is pretty fast. With 13000 results, the
seach time with distance sort is about 200 msec (compared to 30 ms for a
simple non-radius, date-sorted keyword search).

Peter

On 1/27/07, no spam <[hidden email]> wrote:

>
> Isn't this extremely ineffecient to do the euclidean distance twice?
> Perhaps not a huge deal if a small search result set.  I at times have
> 13,000 results that match my search terms of an index with 1.2 million
> docs.
>
> Can't you do some simple radian math first to ensure it's way out of
> bounds,
> then do the euclidian distance for the subset within bounds?  I'm
> currently
> only doing the distance calc once (post hit collector). I don't have any
> performance numbers with the double vs single distance calc.
>
> I'm still working out the sort by radius myself.
>
> Mark
>
> On 11/3/06, Peter Keegan <[hidden email]> wrote:
> >
> > Daniel,
> > Yes, this is correct if you happen to be doing a radius search and
> sorting
> > by mileage.
> > Peter
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

dr fence
This is very similar to what I do.  I use a hit collector to gather the
results, then filter outside a bounding box, then calculate the euclidian
distance.

Last time I tried to check your search it was down.  We were talking the
other day at work how job search was lacking among the big boards.  I'm
excited to check out your new page.

Mark

On 1/28/07, Peter Keegan <[hidden email]> wrote:

>
> Correction:
> We only do the euclidan computation during sorting. For filtering, a
> simple
> bounding box is computed to approximate the radius, and 2 range
> comparisons
> are made to exclude documents. Because these comparisons are done outside
> of
> Lucene as integer comparisons, it is pretty fast. With 13000 results, the
> seach time with distance sort is about 200 msec (compared to 30 ms for a
> simple non-radius, date-sorted keyword search).
>
> Peter
>
> On 1/27/07, no spam <[hidden email]> wrote:
> >
> > Isn't this extremely ineffecient to do the euclidean distance twice?
> > Perhaps not a huge deal if a small search result set.  I at times have
> > 13,000 results that match my search terms of an index with 1.2 million
> > docs.
> >
> > Can't you do some simple radian math first to ensure it's way out of
> > bounds,
> > then do the euclidian distance for the subset within bounds?  I'm
> > currently
> > only doing the distance calc once (post hit collector). I don't have any
> > performance numbers with the double vs single distance calc.
> >
> > I'm still working out the sort by radius myself.
> >
> > Mark
> >
> > On 11/3/06, Peter Keegan <[hidden email]> wrote:
> > >
> > > Daniel,
> > > Yes, this is correct if you happen to be doing a radius search and
> > sorting
> > > by mileage.
> > > Peter
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Peter Keegan
Mark,

I'm sorry to hear that you weren't able to get to the job search site today.
I heard of a problem, but I can assure you that it had nothing to do with
Lucene and our back end tiers. Can you tell me what you think is lacking for
job search among the big boards? There is clearly a lot of room for
improvement.
How is the performance of your distance search and sort?

Peter


On 1/30/07, no spam <[hidden email]> wrote:

>
> This is very similar to what I do.  I use a hit collector to gather the
> results, then filter outside a bounding box, then calculate the euclidian
> distance.
>
> Last time I tried to check your search it was down.  We were talking the
> other day at work how job search was lacking among the big boards.  I'm
> excited to check out your new page.
>
> Mark
>
> On 1/28/07, Peter Keegan <[hidden email]> wrote:
> >
> > Correction:
> > We only do the euclidan computation during sorting. For filtering, a
> > simple
> > bounding box is computed to approximate the radius, and 2 range
> > comparisons
> > are made to exclude documents. Because these comparisons are done
> outside
> > of
> > Lucene as integer comparisons, it is pretty fast. With 13000 results,
> the
> > seach time with distance sort is about 200 msec (compared to 30 ms for a
> > simple non-radius, date-sorted keyword search).
> >
> > Peter
> >
> > On 1/27/07, no spam <[hidden email]> wrote:
> > >
> > > Isn't this extremely ineffecient to do the euclidean distance twice?
> > > Perhaps not a huge deal if a small search result set.  I at times have
> > > 13,000 results that match my search terms of an index with 1.2 million
> > > docs.
> > >
> > > Can't you do some simple radian math first to ensure it's way out of
> > > bounds,
> > > then do the euclidian distance for the subset within bounds?  I'm
> > > currently
> > > only doing the distance calc once (post hit collector). I don't have
> any
> > > performance numbers with the double vs single distance calc.
> > >
> > > I'm still working out the sort by radius myself.
> > >
> > > Mark
> > >
> > > On 11/3/06, Peter Keegan <[hidden email]> wrote:
> > > >
> > > > Daniel,
> > > > Yes, this is correct if you happen to be doing a radius search and
> > > sorting
> > > > by mileage.
> > > > Peter
> > > >
> > > >
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Announcement: Lucene powering Monster job search index (Beta)

Daniel Rosher-2
In reply to this post by Peter Keegan
Hi Peter,

Shouldn't the search perform the euclidean distance during filtering as well
though, otherwise you will obtain perhaps highly relevant hits reported to
the user outside the range they specified? Particularly as the search radius
gets larger.

Cheers,
Dan

On 1/28/07, Peter Keegan <[hidden email]> wrote:

>
> Correction:
> We only do the euclidan computation during sorting. For filtering, a
> simple
> bounding box is computed to approximate the radius, and 2 range
> comparisons
> are made to exclude documents. Because these comparisons are done outside
> of
> Lucene as integer comparisons, it is pretty fast. With 13000 results, the
> seach time with distance sort is about 200 msec (compared to 30 ms for a
> simple non-radius, date-sorted keyword search).
>
> Peter
>
> On 1/27/07, no spam <[hidden email]> wrote:
> >
> > Isn't this extremely ineffecient to do the euclidean distance twice?
> > Perhaps not a huge deal if a small search result set.  I at times have
> > 13,000 results that match my search terms of an index with 1.2 million
> > docs.
> >
> > Can't you do some simple radian math first to ensure it's way out of
> > bounds,
> > then do the euclidian distance for the subset within bounds?  I'm
> > currently
> > only doing the distance calc once (post hit collector). I don't have any
> > performance numbers with the double vs single distance calc.
> >
> > I'm still working out the sort by radius myself.
> >
> > Mark
> >
> > On 11/3/06, Peter Keegan <[hidden email]> wrote:
> > >
> > > Daniel,
> > > Yes, this is correct if you happen to be doing a radius search and
> > sorting
> > > by mileage.
> > > Peter
> > >
> > >
> >
> >
>
12