[jira] Created: (TIKA-235) Site search powered by Lucene/Solr

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-235) Site search powered by Lucene/Solr

JIRA jira@apache.org
Site search powered by Lucene/Solr
----------------------------------

                 Key: TIKA-235
                 URL: https://issues.apache.org/jira/browse/TIKA-235
             Project: Tika
          Issue Type: New Feature
            Reporter: Grant Ingersoll
            Priority: Minor


For a number of years now, the Lucene community has been criticized for not eating our own "dog food" when it comes to search. My company has built and hosts a site search (http://search.lucidimagination.com/) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Tika content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org

A sample of what it _might_ look like is at http://people.apache.org/~gsingers/tika/    Note, however, I am not entirely sure how Tika deploys just yet, so there are a few issues w/ the display

Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site.

The following patch adds the basics to Tika to support the search, but isn't entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-235) Site search powered by Lucene/Solr

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated TIKA-235:
---------------------------------

    Attachment: TIKA-235.patch

First draft of a patch.  See also MAHOUT-120

> Site search powered by Lucene/Solr
> ----------------------------------
>
>                 Key: TIKA-235
>                 URL: https://issues.apache.org/jira/browse/TIKA-235
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: TIKA-235.patch
>
>
> For a number of years now, the Lucene community has been criticized for not eating our own "dog food" when it comes to search. My company has built and hosts a site search (http://search.lucidimagination.com/) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Tika content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
> A sample of what it _might_ look like is at http://people.apache.org/~gsingers/tika/    Note, however, I am not entirely sure how Tika deploys just yet, so there are a few issues w/ the display
> Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site.
> The following patch adds the basics to Tika to support the search, but isn't entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-235) Site search powered by Lucene/Solr

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715344#action_12715344 ]

Jukka Zitting commented on TIKA-235:
------------------------------------

Nice! Committed a somewhat modified version in revision 780908.

> Site search powered by Lucene/Solr
> ----------------------------------
>
>                 Key: TIKA-235
>                 URL: https://issues.apache.org/jira/browse/TIKA-235
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: TIKA-235.patch
>
>
> For a number of years now, the Lucene community has been criticized for not eating our own "dog food" when it comes to search. My company has built and hosts a site search (http://search.lucidimagination.com/) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Tika content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
> A sample of what it _might_ look like is at http://people.apache.org/~gsingers/tika/    Note, however, I am not entirely sure how Tika deploys just yet, so there are a few issues w/ the display
> Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site.
> The following patch adds the basics to Tika to support the search, but isn't entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-235) Site search powered by Lucene/Solr

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715414#action_12715414 ]

Jukka Zitting commented on TIKA-235:
------------------------------------

Some comments:

Is there a chance to boost selected parts of the Tika site? For example, queries like "Office" and "OpenDocument" return lots of dev stuff like mailing list traffic, apidocs and Jira issues, while the main "Supported Formats" page is buried deep within the search results. It would be nice if we could control the boosting for example by setting some specific <meta/> header tags in the HTML source. Alternatively, can we set a "Source" criteria that only covers the web site?

I have a bit mixed feelings about whether documentation from the Lucid web site should be included in the search results. It's useful stuff, but may well become a problem if we another similar company comes up in this space. In any case, the links to the Lucid web site are broken, s/search/www/ should fix that.

The Jira crawl contains multiple copies of the same documents, with links like http://issues.apache.org/jira/browse/TIKA-120?focusedCommentId=NNN#action_NNN, where just http://issues.apache.org/jira/browse/TIKA-120 would have been sufficient. This is IMHO a design mistake in Jira, but since we can't do much about that it would be nice to work around this issue in the crawler.


> Site search powered by Lucene/Solr
> ----------------------------------
>
>                 Key: TIKA-235
>                 URL: https://issues.apache.org/jira/browse/TIKA-235
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: TIKA-235.patch
>
>
> For a number of years now, the Lucene community has been criticized for not eating our own "dog food" when it comes to search. My company has built and hosts a site search (http://search.lucidimagination.com/) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Tika content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
> A sample of what it _might_ look like is at http://people.apache.org/~gsingers/tika/    Note, however, I am not entirely sure how Tika deploys just yet, so there are a few issues w/ the display
> Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site.
> The following patch adds the basics to Tika to support the search, but isn't entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-235) Site search powered by Lucene/Solr

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715506#action_12715506 ]

Grant Ingersoll commented on TIKA-235:
--------------------------------------

bq. I have a bit mixed feelings about whether documentation from the Lucid web site should be included in the search results. It's useful stuff, but may well become a problem if we another similar company comes up in this space. In any case, the links to the Lucid web site are broken, s/search/www/ should fix that.

That's easy enough to take out by just preselecting the facets in the URL, as in something like:
{quote}
http://search.lucidimagination.com/search/p:tika/s:email,issues
{quote}

In other words, feel free to pick whatever facets you feel are appropriate for Tika.

> Site search powered by Lucene/Solr
> ----------------------------------
>
>                 Key: TIKA-235
>                 URL: https://issues.apache.org/jira/browse/TIKA-235
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: TIKA-235.patch
>
>
> For a number of years now, the Lucene community has been criticized for not eating our own "dog food" when it comes to search. My company has built and hosts a site search (http://search.lucidimagination.com/) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Tika content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
> A sample of what it _might_ look like is at http://people.apache.org/~gsingers/tika/    Note, however, I am not entirely sure how Tika deploys just yet, so there are a few issues w/ the display
> Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site.
> The following patch adds the basics to Tika to support the search, but isn't entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-235) Site search powered by Lucene/Solr

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715545#action_12715545 ]

Jukka Zitting commented on TIKA-235:
------------------------------------

Is there a facet for the web site content?

> Site search powered by Lucene/Solr
> ----------------------------------
>
>                 Key: TIKA-235
>                 URL: https://issues.apache.org/jira/browse/TIKA-235
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: TIKA-235.patch
>
>
> For a number of years now, the Lucene community has been criticized for not eating our own "dog food" when it comes to search. My company has built and hosts a site search (http://search.lucidimagination.com/) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Tika content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
> A sample of what it _might_ look like is at http://people.apache.org/~gsingers/tika/    Note, however, I am not entirely sure how Tika deploys just yet, so there are a few issues w/ the display
> Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site.
> The following patch adds the basics to Tika to support the search, but isn't entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (TIKA-235) Site search powered by Lucene/Solr

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll closed TIKA-235.
--------------------------------

    Resolution: Fixed

> Site search powered by Lucene/Solr
> ----------------------------------
>
>                 Key: TIKA-235
>                 URL: https://issues.apache.org/jira/browse/TIKA-235
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: TIKA-235.patch
>
>
> For a number of years now, the Lucene community has been criticized for not eating our own "dog food" when it comes to search. My company has built and hosts a site search (http://search.lucidimagination.com/) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Tika content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
> A sample of what it _might_ look like is at http://people.apache.org/~gsingers/tika/    Note, however, I am not entirely sure how Tika deploys just yet, so there are a few issues w/ the display
> Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site.
> The following patch adds the basics to Tika to support the search, but isn't entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.