Get Crawled Data in Java or C# Collections

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Get Crawled Data in Java or C# Collections

Bing Li
Hi, all,

I am a new Nutch user. Before knowing Nutch, I designed a crawler myself.
However, the quality is not good. So I decide to try Nutch.

However, after reading some materials about Nutch, I notice that Nutch puts
all of crawled pages into persistent Lucene indexes. In my project, I hope I
could get crawled data in memory. So I can manipulate them in Java or C#
collections. I don't want to retrieve the indexes crawled by Nutch.

Could you give me a solution to that? Thanks so much!

Best regards,
Li Bing
Reply | Threaded
Open this post in threaded view
|

Re: Get Crawled Data in Java or C# Collections

xiao yang
Hi, Bing,

Nutch puts all the crawled pages in HDFS or local FS, in "segments" directory.
It provide APIs to retrieve the page content. You can find them in the
Web app part of Nutch. The "cache" of search results is read through
those APIs.
To process the content while crawling, you can try to write a Nutch
Plug-in. You can find a tutorial on the official site of Nutch.

Or you can try Nutch 2.0. It's under developing. you can check out
from SVN. It puts crawled data in HBase or other Database Systems,
which is easier for you to manipulate.

Thanks!
Xiao

On Wed, Dec 15, 2010 at 12:25 PM, Bing Li <[hidden email]> wrote:

> Hi, all,
>
> I am a new Nutch user. Before knowing Nutch, I designed a crawler myself.
> However, the quality is not good. So I decide to try Nutch.
>
> However, after reading some materials about Nutch, I notice that Nutch puts
> all of crawled pages into persistent Lucene indexes. In my project, I hope I
> could get crawled data in memory. So I can manipulate them in Java or C#
> collections. I don't want to retrieve the indexes crawled by Nutch.
>
> Could you give me a solution to that? Thanks so much!
>
> Best regards,
> Li Bing
>
Reply | Threaded
Open this post in threaded view
|

Re: Get Crawled Data in Java or C# Collections

Anurag
In reply to this post by Bing Li
Can you tell how u designed crawler ? Is it by by writing code like this CrawlDb.java?

Actually wring your own Crawler is important stuff, I want to know.

Thanks

On Wed, Dec 15, 2010 at 9:56 AM, Bing Li [via Lucene] <[hidden email]> wrote:
Hi, all,

I am a new Nutch user. Before knowing Nutch, I designed a crawler myself.
However, the quality is not good. So I decide to try Nutch.

However, after reading some materials about Nutch, I notice that Nutch puts
all of crawled pages into persistent Lucene indexes. In my project, I hope I
could get crawled data in memory. So I can manipulate them in Java or C#
collections. I don't want to retrieve the indexes crawled by Nutch.

Could you give me a solution to that? Thanks so much!

Best regards,
Li Bing



View message @ http://lucene.472066.n3.nabble.com/Get-Crawled-Data-in-Java-or-C-Collections-tp2089972p2089972.html
To start a new topic under Nutch - User, email [hidden email]
To unsubscribe from Nutch - User, click here.



--
Kumar Anurag
Kumar Anurag
Reply | Threaded
Open this post in threaded view
|

Re: Get Crawled Data in Java or C# Collections

Bing Li
Hi, Kumar,

To design a crawler is not an easy job. It depends on your goals. The most
complicated one is to crawl the entire Web.

http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669

This book might give you a hand.

Thanks,
LB

On Thu, Dec 16, 2010 at 12:28 AM, Anurag <[hidden email]> wrote:

>
> Can you tell how u designed crawler ? Is it by by writing code like this
> CrawlDb.java<
> http://www.docjar.com/html/api/org/apache/nutch/crawl/CrawlDb.java.html>
> ?
>
> Actually wring your own Crawler is important stuff, I want to know.
>
> Thanks
>
> On Wed, Dec 15, 2010 at 9:56 AM, Bing Li [via Lucene] <
> [hidden email]<ml-node%[hidden email]>
> <ml-node%[hidden email]<ml-node%[hidden email]>
> >
> > wrote:
>
> > Hi, all,
> >
> > I am a new Nutch user. Before knowing Nutch, I designed a crawler myself.
> > However, the quality is not good. So I decide to try Nutch.
> >
> > However, after reading some materials about Nutch, I notice that Nutch
> puts
> >
> > all of crawled pages into persistent Lucene indexes. In my project, I
> hope
> > I
> > could get crawled data in memory. So I can manipulate them in Java or C#
> > collections. I don't want to retrieve the indexes crawled by Nutch.
> >
> > Could you give me a solution to that? Thanks so much!
> >
> > Best regards,
> > Li Bing
> >
> >
> > ------------------------------
> >  View message @
> >
> http://lucene.472066.n3.nabble.com/Get-Crawled-Data-in-Java-or-C-Collections-tp2089972p2089972.html
> > To start a new topic under Nutch - User, email
> > [hidden email]<ml-node%[hidden email]>
> <ml-node%[hidden email]<ml-node%[hidden email]>
> >
> > To unsubscribe from Nutch - User, click here<
> >.
> >
> >
>
>
>
> --
> Kumar Anurag
>
>
> -----
> Kumar Anurag
>
> --
> View this message in context:
>
http://lucene.472066.n3.nabble.com/Get-Crawled-Data-in-Java-or-C-Collections-tp2089972p2092990.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Get Crawled Data in Java or C# Collections

Anurag

Thanks Li
On Thu, Dec 16, 2010 at 3:44 AM, Bing Li [via Lucene] <[hidden email]> wrote:
Hi, Kumar,

To design a crawler is not an easy job. It depends on your goals. The most
complicated one is to crawl the entire Web.

http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669

This book might give you a hand.

Thanks,
LB

On Thu, Dec 16, 2010 at 12:28 AM, Anurag <[hidden email]> wrote:

>
> Can you tell how u designed crawler ? Is it by by writing code like this
> CrawlDb.java<
> http://www.docjar.com/html/api/org/apache/nutch/crawl/CrawlDb.java.html>
> ?
>
> Actually wring your own Crawler is important stuff, I want to know.
>
> Thanks
>
> On Wed, Dec 15, 2010 at 9:56 AM, Bing Li [via Lucene] <
> [hidden email]<[hidden email]>
> <[hidden email]<[hidden email]>
> >

> > wrote:
>
> > Hi, all,
> >
> > I am a new Nutch user. Before knowing Nutch, I designed a crawler myself.
> > However, the quality is not good. So I decide to try Nutch.
> >
> > However, after reading some materials about Nutch, I notice that Nutch
> puts
> >
> > all of crawled pages into persistent Lucene indexes. In my project, I
> hope
> > I
> > could get crawled data in memory. So I can manipulate them in Java or C#
> > collections. I don't want to retrieve the indexes crawled by Nutch.
> >
> > Could you give me a solution to that? Thanks so much!
> >
> > Best regards,
> > Li Bing
> >
> >
> > ------------------------------
> >  View message @
> >
> http://lucene.472066.n3.nabble.com/Get-Crawled-Data-in-Java-or-C-Collections-tp2089972p2089972.html
> > To start a new topic under Nutch - User, email
> > [hidden email]<[hidden email]>
> <[hidden email]<[hidden email]>
> >
> > To unsubscribe from Nutch - User, click here<
> > >. > Sent from the Nutch - User mailing list archive at Nabble.com.
>



View message @ http://lucene.472066.n3.nabble.com/Get-Crawled-Data-in-Java-or-C-Collections-tp2089972p2094945.html

To start a new topic under Nutch - User, email [hidden email]
To unsubscribe from Nutch - User, click here.



--
Kumar Anurag
Kumar Anurag