Extracting html pages from db

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting html pages from db

LoneEagle70
Hi,

I was able to install Nutch 0.9 and crawl a site and use the Web Page to do full text search of my db.

But we need to extract informations from all HTML page.

So, is there a way to extract HTML pages from the db?
Reply | Threaded
Open this post in threaded view
|

Re: Extracting html pages from db

Dennis Kubes-2
It depends on what you are trying to do.  Content in segments stores the
full content (html, etc.) of each page.  The cached.jsp page displays
full content.

Dennis Kubes


LoneEagle70 wrote:
> Hi,
>
> I was able to install Nutch 0.9 and crawl a site and use the Web Page to do
> full text search of my db.
>
> But we need to extract informations from all HTML page.
>
> So, is there a way to extract HTML pages from the db?
Reply | Threaded
Open this post in threaded view
|

Re: Extracting html pages from db

LoneEagle70
I do not want it using the WebApp.

Is there a way to extract all html files from command line in a directory? Like displaying stats. I tried the dump but was not what I wanted. I really want only html pages so I can take information from them.

Here my problem: We are looking for a program that will do Web Crawling but must be customized for each site that we need because those pages are generated based on parameters. Also, we need to extract information (product, price, manufacturer, ...). So, if you have experience with Nutch, you could help me out. Can I customized it through Hooks? What can/can't I do?

Thanks for your help! :)
Dennis Kubes-2 wrote
It depends on what you are trying to do.  Content in segments stores the
full content (html, etc.) of each page.  The cached.jsp page displays
full content.

Dennis Kubes


LoneEagle70 wrote:
> Hi,
>
> I was able to install Nutch 0.9 and crawl a site and use the Web Page to do
> full text search of my db.
>
> But we need to extract informations from all HTML page.
>
> So, is there a way to extract HTML pages from the db?
Reply | Threaded
Open this post in threaded view
|

Re: Extracting html pages from db

Dennis Kubes-2
Pulling out specific information for each site could be done through
HtmlParseFilter implementations.  Look at
org.apache.nutch.parse.HtmlParseFilter and its implementations.  The
specific fields you extract can be stored in MetaData in ParseData.  You
can then access that information in other jobs such as indexer.  Hope
this helps.

Dennis Kubes

LoneEagle70 wrote:

> I do not want it using the WebApp.
>
> Is there a way to extract all html files from command line in a directory?
> Like displaying stats. I tried the dump but was not what I wanted. I really
> want only html pages so I can take information from them.
>
> Here my problem: We are looking for a program that will do Web Crawling but
> must be customized for each site that we need because those pages are
> generated based on parameters. Also, we need to extract information
> (product, price, manufacturer, ...). So, if you have experience with Nutch,
> you could help me out. Can I customized it through Hooks? What can/can't I
> do?
>
> Thanks for your help! :)
>
> Dennis Kubes-2 wrote:
>> It depends on what you are trying to do.  Content in segments stores the
>> full content (html, etc.) of each page.  The cached.jsp page displays
>> full content.
>>
>> Dennis Kubes
>>
>>
>> LoneEagle70 wrote:
>>> Hi,
>>>
>>> I was able to install Nutch 0.9 and crawl a site and use the Web Page to
>>> do
>>> full text search of my db.
>>>
>>> But we need to extract informations from all HTML page.
>>>
>>> So, is there a way to extract HTML pages from the db?
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting html pages from db

LoneEagle70
Do you have any idea how to extract from command line all my html files stored in the db?
Dennis Kubes-2 wrote
Pulling out specific information for each site could be done through
HtmlParseFilter implementations.  Look at
org.apache.nutch.parse.HtmlParseFilter and its implementations.  The
specific fields you extract can be stored in MetaData in ParseData.  You
can then access that information in other jobs such as indexer.  Hope
this helps.

Dennis Kubes

LoneEagle70 wrote:
> I do not want it using the WebApp.
>
> Is there a way to extract all html files from command line in a directory?
> Like displaying stats. I tried the dump but was not what I wanted. I really
> want only html pages so I can take information from them.
>
> Here my problem: We are looking for a program that will do Web Crawling but
> must be customized for each site that we need because those pages are
> generated based on parameters. Also, we need to extract information
> (product, price, manufacturer, ...). So, if you have experience with Nutch,
> you could help me out. Can I customized it through Hooks? What can/can't I
> do?
>
> Thanks for your help! :)
>
> Dennis Kubes-2 wrote:
>> It depends on what you are trying to do.  Content in segments stores the
>> full content (html, etc.) of each page.  The cached.jsp page displays
>> full content.
>>
>> Dennis Kubes
>>
>>
>> LoneEagle70 wrote:
>>> Hi,
>>>
>>> I was able to install Nutch 0.9 and crawl a site and use the Web Page to
>>> do
>>> full text search of my db.
>>>
>>> But we need to extract informations from all HTML page.
>>>
>>> So, is there a way to extract HTML pages from the db?
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting html pages from db

Dennis Kubes-2
There is currently no way to do that.  You would need to write a map job
to pull the data from Content within Segments.

Dennis Kubes

LoneEagle70 wrote:

> Do you have any idea how to extract from command line all my html files
> stored in the db?
>
> Dennis Kubes-2 wrote:
>> Pulling out specific information for each site could be done through
>> HtmlParseFilter implementations.  Look at
>> org.apache.nutch.parse.HtmlParseFilter and its implementations.  The
>> specific fields you extract can be stored in MetaData in ParseData.  You
>> can then access that information in other jobs such as indexer.  Hope
>> this helps.
>>
>> Dennis Kubes
>>
>> LoneEagle70 wrote:
>>> I do not want it using the WebApp.
>>>
>>> Is there a way to extract all html files from command line in a
>>> directory?
>>> Like displaying stats. I tried the dump but was not what I wanted. I
>>> really
>>> want only html pages so I can take information from them.
>>>
>>> Here my problem: We are looking for a program that will do Web Crawling
>>> but
>>> must be customized for each site that we need because those pages are
>>> generated based on parameters. Also, we need to extract information
>>> (product, price, manufacturer, ...). So, if you have experience with
>>> Nutch,
>>> you could help me out. Can I customized it through Hooks? What can/can't
>>> I
>>> do?
>>>
>>> Thanks for your help! :)
>>>
>>> Dennis Kubes-2 wrote:
>>>> It depends on what you are trying to do.  Content in segments stores the
>>>> full content (html, etc.) of each page.  The cached.jsp page displays
>>>> full content.
>>>>
>>>> Dennis Kubes
>>>>
>>>>
>>>> LoneEagle70 wrote:
>>>>> Hi,
>>>>>
>>>>> I was able to install Nutch 0.9 and crawl a site and use the Web Page
>>>>> to
>>>>> do
>>>>> full text search of my db.
>>>>>
>>>>> But we need to extract informations from all HTML page.
>>>>>
>>>>> So, is there a way to extract HTML pages from the db?
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting html pages from db

misc
In reply to this post by LoneEagle70

Hello-

    I've done this, I think it is

    nutch readseg -dump <segment_dir> <dumpfile>

to dump all the html of everything in a segment.  You can also specify what
url you are interested in, type nutch readseg for details.

                        see you
                            -Jim


----- Original Message -----
From: "LoneEagle70" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, October 17, 2007 5:53 AM
Subject: Extracting html pages from db


>
> Hi,
>
> I was able to install Nutch 0.9 and crawl a site and use the Web Page to
> do
> full text search of my db.
>
> But we need to extract informations from all HTML page.
>
> So, is there a way to extract HTML pages from the db?
> --
> View this message in context:
> http://www.nabble.com/Extracting-html-pages-from-db-tf4640373.html#a13253122
> Sent from the Nutch - User mailing list archive at Nabble.com.
>