FetchedSegments.getSummary() for a PDF

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

FetchedSegments.getSummary() for a PDF

Lucas Rockwell
Hi all,

I have enabled the parse-pdf and index-more plugins and reindexed my
segments and then enabled those plus the query-more plugin in my
front-end application and when I do a query I still can not get at the
contents of the PDFs in the index. And even when I search for "pdf" --
which gets me all PDF files because of the url -- and use
FetchedSegments.getSummary() there is nothing there. Any idea what I am
doing wrong?

Thanks.

-lucas

Reply | Threaded
Open this post in threaded view
|

Re: FetchedSegments.getSummary() for a PDF

Piotr Kosiorowski
As I understand if you had parse-pdf disabled you have to reparse (snd
then reindex) segments. There is no standard way to do it (I think it
might be done with some tricks). The easiest way would be to refetch it
with pdf parsing enabled.
Piotr
Lucas Rockwell wrote:

> Hi all,
>
> I have enabled the parse-pdf and index-more plugins and reindexed my
> segments and then enabled those plus the query-more plugin in my
> front-end application and when I do a query I still can not get at the
> contents of the PDFs in the index. And even when I search for "pdf" --
> which gets me all PDF files because of the url -- and use
> FetchedSegments.getSummary() there is nothing there. Any idea what I am
> doing wrong?
>
> Thanks.
>
> -lucas
>
>

Reply | Threaded
Open this post in threaded view
|

Re: FetchedSegments.getSummary() for a PDF

Lucas Rockwell
Hi Piotr,

Thanks for the response.

So, I can't use:

    bin/nutch parse <segment directory>

and then reindex?

-lucas

On Aug 25, 2005, at 11:28 AM, Piotr Kosiorowski wrote:

> As I understand if you had parse-pdf disabled you have to reparse (snd
> then reindex) segments. There is no standard way to do it (I think it
> might be done with some tricks). The easiest way would be to refetch
> it with pdf parsing enabled.
> Piotr
> Lucas Rockwell wrote:
>> Hi all,
>> I have enabled the parse-pdf and index-more plugins and reindexed my
>> segments and then enabled those plus the query-more plugin in my
>> front-end application and when I do a query I still can not get at
>> the contents of the PDFs in the index. And even when I search for
>> "pdf" -- which gets me all PDF files because of the url -- and use
>> FetchedSegments.getSummary() there is nothing there. Any idea what I
>> am doing wrong?
>> Thanks.
>> -lucas
>

Reply | Threaded
Open this post in threaded view
|

Re: FetchedSegments.getSummary() for a PDF

Piotr Kosiorowski
You can try it out but I think parsing separately expects some
directories in segment have different names than you have after standard
fetch with parsing.
Regards
Piotr
Lucas Rockwell wrote:

> Hi Piotr,
>
> Thanks for the response.
>
> So, I can't use:
>
>    bin/nutch parse <segment directory>
>
> and then reindex?
>
> -lucas
>
> On Aug 25, 2005, at 11:28 AM, Piotr Kosiorowski wrote:
>
>> As I understand if you had parse-pdf disabled you have to reparse (snd
>> then reindex) segments. There is no standard way to do it (I think it
>> might be done with some tricks). The easiest way would be to refetch
>> it with pdf parsing enabled.
>> Piotr
>> Lucas Rockwell wrote:
>>
>>> Hi all,
>>> I have enabled the parse-pdf and index-more plugins and reindexed my
>>> segments and then enabled those plus the query-more plugin in my
>>> front-end application and when I do a query I still can not get at
>>> the contents of the PDFs in the index. And even when I search for
>>> "pdf" -- which gets me all PDF files because of the url -- and use
>>> FetchedSegments.getSummary() there is nothing there. Any idea what I
>>> am doing wrong?
>>> Thanks.
>>> -lucas
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: FetchedSegments.getSummary() for a PDF

Lucas Rockwell
Piotr,

I will give it a try and see what happens. If it fails, I'll refetch.

Thanks.

-lucas

On Aug 25, 2005, at 12:32 PM, Piotr Kosiorowski wrote:

> You can try it out but I think parsing separately expects some
> directories in segment have different names than you have after
> standard fetch with parsing.
> Regards
> Piotr
> Lucas Rockwell wrote:
>> Hi Piotr,
>> Thanks for the response.
>> So, I can't use:
>>    bin/nutch parse <segment directory>
>> and then reindex?
>> -lucas
>> On Aug 25, 2005, at 11:28 AM, Piotr Kosiorowski wrote:
>>> As I understand if you had parse-pdf disabled you have to reparse
>>> (snd then reindex) segments. There is no standard way to do it (I
>>> think it might be done with some tricks). The easiest way would be
>>> to refetch it with pdf parsing enabled.
>>> Piotr
>>> Lucas Rockwell wrote:
>>>
>>>> Hi all,
>>>> I have enabled the parse-pdf and index-more plugins and reindexed
>>>> my segments and then enabled those plus the query-more plugin in my
>>>> front-end application and when I do a query I still can not get at
>>>> the contents of the PDFs in the index. And even when I search for
>>>> "pdf" -- which gets me all PDF files because of the url -- and use
>>>> FetchedSegments.getSummary() there is nothing there. Any idea what
>>>> I am doing wrong?
>>>> Thanks.
>>>> -lucas
>>>
>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: FetchedSegments.getSummary() for a PDF

Andrzej Białecki-2
In reply to this post by Piotr Kosiorowski
Piotr Kosiorowski wrote:
> You can try it out but I think parsing separately expects some
> directories in segment have different names than you have after standard
> fetch with parsing.

Yes, just rename fetcher/ to fetcher_output/ .


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: FetchedSegments.getSummary() for a PDF

Lucas Rockwell
Andrzej,

Great. Thanks!

-lucas

On Aug 25, 2005, at 12:48 PM, Andrzej Bialecki wrote:

> Piotr Kosiorowski wrote:
>> You can try it out but I think parsing separately expects some
>> directories in segment have different names than you have after
>> standard fetch with parsing.
>
> Yes, just rename fetcher/ to fetcher_output/ .
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Reply | Threaded
Open this post in threaded view
|

Re: FetchedSegments.getSummary() for a PDF

Lucas Rockwell
In reply to this post by Andrzej Białecki-2
Hi Andrzej and Piotr,

I just want to confirm that this worked.

I renamed fetcher/ to fetcher_output/ and removed index/ and index.done
and then ran the following:

    % bin/nutch parse <path to segment>
    % bin/nutch index <path to segment>

All of the PDFs appear to be parsed -- however when I did a search I
still got a few that would not show the summary, but it clearly
searched the contents of the PDF as the query string I used did not
appear in the doc title or the URL.

Again, thanks.

-lucas

On Aug 25, 2005, at 12:48 PM, Andrzej Bialecki wrote:

> Piotr Kosiorowski wrote:
>> You can try it out but I think parsing separately expects some
>> directories in segment have different names than you have after
>> standard fetch with parsing.
>
> Yes, just rename fetcher/ to fetcher_output/ .
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>