Website Visualization Questions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Website Visualization Questions

Nils Hoeller
Hi,

I m actually working on a "service" that gives
you the ability to enter a url an visualizes this domain
(only inner links).
Then there ll be some kind of adaptive behaviour
so that the graph will be adapted to your wishes
(searches, ranks ...)

I have a prototype that uses:

1. Arachnid as a crawler
2. Lucene as the indexer
3. Touchgraph for Visualization.

It works as a standalone client,
though it seems to be slow, when you
enter a new url for visualization
(which is ok, because of the crawling and indexing ..)

Now I d like to change the Application:
(Arachnid and Lucene should be replaced
by Nutch)

My wish is a Service that:
1. Visualizes existing crawled and indexed sites
2. Gives you the feature of entering a new url
and works for you while you are online.

So my questions:

1. Is it possible to do such things with nutch.
I mean: Can I start a process that works along
a list with urls (does crawling, indexing, and creation of a file that
represents the graph structure)
, while Clients can enter URLs that will be inserted in this TO-DO list.

2. I ve read about the web database (including full link graph)
Where can I read more of it ? Does it do kind of representation
of the site for me automatically?

I mean I need (have done in this former application)
something as:

Node{
ID=2144181430
Title=Institute of Information Systems Universit?t zu L?beck
Schleswig-Holstein
URL=http://www.ifis.uni-luebeck.de/index.html
Number of Request=0
}
Edge{
Node1=2144181430
Node2=-66623770
}
Edge{
Node1=2144181430
Node2=150343685
} .....

So I create for every site a node and every link an edge.
Is this done with this full link graph database?

Thats all for now.

I ll be so glad , if someone can help me

Thanks Nils
Reply | Threaded
Open this post in threaded view
|

Re: Website Visualization Questions

Fredrik Andersson-2-2
Hi!

The crawler and link-structure information comes "free" with Nutch.
Once you have crawled a site, you can use the WebDBReader class to
extract the link information for further processing in a visualization
step. Simply put: Iterate crawled pages with the SegmentReader class
(open the segment you just crawled), extract the url from each page
(as an MD5Hash object), get the links to/from that url with the
WebDBReader and pass an appropriate structure to your visualization
application.

The structure that you suggested, with edges and nodes, would be very
easy to implement once you get the hang of the Reader-classes for
accessing Nutch's gut.

Fredrik

On 7/11/05, Nils Hoeller <[hidden email]> wrote:

> Hi,
>
> I m actually working on a "service" that gives
> you the ability to enter a url an visualizes this domain
> (only inner links).
> Then there ll be some kind of adaptive behaviour
> so that the graph will be adapted to your wishes
> (searches, ranks ...)
>
> I have a prototype that uses:
>
> 1. Arachnid as a crawler
> 2. Lucene as the indexer
> 3. Touchgraph for Visualization.
>
> It works as a standalone client,
> though it seems to be slow, when you
> enter a new url for visualization
> (which is ok, because of the crawling and indexing ..)
>
> Now I d like to change the Application:
> (Arachnid and Lucene should be replaced
> by Nutch)
>
> My wish is a Service that:
> 1. Visualizes existing crawled and indexed sites
> 2. Gives you the feature of entering a new url
> and works for you while you are online.
>
> So my questions:
>
> 1. Is it possible to do such things with nutch.
> I mean: Can I start a process that works along
> a list with urls (does crawling, indexing, and creation of a file that
> represents the graph structure)
> , while Clients can enter URLs that will be inserted in this TO-DO list.
>
> 2. I ve read about the web database (including full link graph)
> Where can I read more of it ? Does it do kind of representation
> of the site for me automatically?
>
> I mean I need (have done in this former application)
> something as:
>
> Node{
> ID=2144181430
> Title=Institute of Information Systems Universität zu Lübeck
> Schleswig-Holstein
> URL=http://www.ifis.uni-luebeck.de/index.html
> Number of Request=0
> }
> Edge{
> Node1=2144181430
> Node2=-66623770
> }
> Edge{
> Node1=2144181430
> Node2=150343685
> } .....
>
> So I create for every site a node and every link an edge.
> Is this done with this full link graph database?
>
> Thats all for now.
>
> I ll be so glad , if someone can help me
>
> Thanks Nils
>
Reply | Threaded
Open this post in threaded view
|

Re: Website Visualization Questions

Nils Hoeller
In reply to this post by Nils Hoeller

Hi Fredrik,

thanks for that information.
That sounds really good to me.
I mean it woult be perfect to
handle just one product instead
of different ones for every single task.

Anyway, can you tell me if it is possible that
Clients will insert their "ask for a url" into a url list,
out of which Nutch takes everytime the next url to
do indexing and so on.

I mean: I have read about this :
1. add to url list
2. start nutch for the list

in the FAQ, but I d like to know
if it is possible to have nutch
run as a permanent process, that looks into a specific
file all the time he is ready to do a new job.
And beside of that, Clients inserting url wishes
into that list.

Is Nutch smart enough to only index sites
that he has not indexed yet?
So that if an url is prepared he won t start the
indexing and in that case the user will
be presented the results.
(In think of a method that presents
you the graph if the url is indexed or
a site "come back later when nutch is finished"
when it is a job for nutch (which means the url
is put into that url list.)


Thanks very much
Nils

[hidden email] schrieb am 11.07.05 16:51:28:

>
> Hi!
>
> The crawler and link-structure information comes "free" with Nutch.
> Once you have crawled a site, you can use the WebDBReader class to
> extract the link information for further processing in a visualization
> step. Simply put: Iterate crawled pages with the SegmentReader class
> (open the segment you just crawled), extract the url from each page
> (as an MD5Hash object), get the links to/from that url with the
> WebDBReader and pass an appropriate structure to your visualization
> application.
>
> The structure that you suggested, with edges and nodes, would be very
> easy to implement once you get the hang of the Reader-classes for
> accessing Nutch's gut.
>
> Fredrik
>
> On 7/11/05, Nils Hoeller <[hidden email]> wrote:
> > Hi,
> >
>
> I m actually working on a "service" that gives
> > you the ability to enter a url an visualizes this domain
> > (only inner links).
> > Then there ll be some kind of adaptive behaviour
> > so that the graph will be adapted to your wishes
> > (searches, ranks ...)
> >
> > I have a prototype that uses:
> >
> > 1. Arachnid as a crawler
> > 2. Lucene as the indexer
> > 3. Touchgraph for Visualization.
> >
> > It works as a standalone client,
> > though it seems to be slow, when you
> > enter a new url for visualization
> > (which is ok, because of the crawling and indexing ..)
> >
> > Now I d like to change the Application:
> > (Arachnid and Lucene should be replaced
> > by Nutch)
> >
> > My wish is a Service that:
> > 1. Visualizes existing crawled and indexed sites
> > 2. Gives you the feature of entering a new url
> > and works for you while you are online.
> >
> > So my questions:
> >
> > 1. Is it possible to do such things with nutch.
> > I mean: Can I start a process that works along
> > a list with urls (does crawling, indexing, and creation of a file that
> > represents the graph structure)
> > , while Clients can enter URLs that will be inserted in this TO-DO list.
> >
> > 2. I ve read about the web database (including full link graph)
> > Where can I read more of it ? Does it do kind of representation
> > of the site for me automatically?
> >
> > I mean I need (have done in this former application)
> > something as:
> >
> > Node{
> > ID=2144181430
> > Title=Institute of Information Systems Universität zu Lübeck
> > Schleswig-Holstein
> > URL=http://www.ifis.uni-luebeck.de/index.html
> > Number of Request=0
> > }
> > Edge{
> > Node1=2144181430
> > Node2=-66623770
> > }
> > Edge{
> > Node1=2144181430
> > Node2=150343685
> > } .....
> >
> > So I create for every site a node and every link an edge.
> > Is this done with this full link graph database?
> >
> > Thats all for now.
> >
> > I ll be so glad , if someone can help me
> >
> > Thanks Nils
> >

--
-------------------------------------------------------------------
Nils Höller

[hidden email]
[hidden email]


______________________________________________________________
Verschicken Sie romantische, coole und witzige Bilder per SMS!
Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193

Reply | Threaded
Open this post in threaded view
|

Re: Website Visualization Questions

Fredrik Andersson-2-2
Hi Nils!

If I am not totally off track, the 0.7 version (currently 0.7-dev, in
the CVS trunk) runs as a daemon process. I.e, it will poll the file
with the URL:s when it has nothing else to do, so that will solve your
problem.
Regarding the duplicate content, as you can see in the tutorial there
is a very simple action for deleting duplicates once your crawl has
finised. Personally, I don't see why the Nutch crawler does not keep a
hashset or similar of visited pages. I often get loops where the same
site is crawled over and over again, so if you want to restrict it,
this is not a hard modification to perform if you have ever written
some code. I'm sure Doug has a perfectly good reason as to why the
crawler runs the way it does, I just haven't figured it out (I'm also
quite new to Nutch).

Hope it helps,
Fidde

On 7/11/05, Nils Höller <[hidden email]> wrote:

>
> Hi Fredrik,
>
> thanks for that information.
> That sounds really good to me.
> I mean it woult be perfect to
> handle just one product instead
> of different ones for every single task.
>
> Anyway, can you tell me if it is possible that
> Clients will insert their "ask for a url" into a url list,
> out of which Nutch takes everytime the next url to
> do indexing and so on.
>
> I mean: I have read about this :
> 1. add to url list
> 2. start nutch for the list
>
> in the FAQ, but I d like to know
> if it is possible to have nutch
> run as a permanent process, that looks into a specific
> file all the time he is ready to do a new job.
> And beside of that, Clients inserting url wishes
> into that list.
>
> Is Nutch smart enough to only index sites
> that he has not indexed yet?
> So that if an url is prepared he won t start the
> indexing and in that case the user will
> be presented the results.
> (In think of a method that presents
> you the graph if the url is indexed or
> a site "come back later when nutch is finished"
> when it is a job for nutch (which means the url
> is put into that url list.)
>
>
> Thanks very much
> Nils
>
> [hidden email] schrieb am 11.07.05 16:51:28:
> >
> > Hi!
> >
> > The crawler and link-structure information comes "free" with Nutch.
> > Once you have crawled a site, you can use the WebDBReader class to
> > extract the link information for further processing in a visualization
> > step. Simply put: Iterate crawled pages with the SegmentReader class
> > (open the segment you just crawled), extract the url from each page
> > (as an MD5Hash object), get the links to/from that url with the
> > WebDBReader and pass an appropriate structure to your visualization
> > application.
> >
> > The structure that you suggested, with edges and nodes, would be very
> > easy to implement once you get the hang of the Reader-classes for
> > accessing Nutch's gut.
> >
> > Fredrik
> >
> > On 7/11/05, Nils Hoeller <[hidden email]> wrote:
> > > Hi,
> > >
> >
> > I m actually working on a "service" that gives
> > > you the ability to enter a url an visualizes this domain
> > > (only inner links).
> > > Then there ll be some kind of adaptive behaviour
> > > so that the graph will be adapted to your wishes
> > > (searches, ranks ...)
> > >
> > > I have a prototype that uses:
> > >
> > > 1. Arachnid as a crawler
> > > 2. Lucene as the indexer
> > > 3. Touchgraph for Visualization.
> > >
> > > It works as a standalone client,
> > > though it seems to be slow, when you
> > > enter a new url for visualization
> > > (which is ok, because of the crawling and indexing ..)
> > >
> > > Now I d like to change the Application:
> > > (Arachnid and Lucene should be replaced
> > > by Nutch)
> > >
> > > My wish is a Service that:
> > > 1. Visualizes existing crawled and indexed sites
> > > 2. Gives you the feature of entering a new url
> > > and works for you while you are online.
> > >
> > > So my questions:
> > >
> > > 1. Is it possible to do such things with nutch.
> > > I mean: Can I start a process that works along
> > > a list with urls (does crawling, indexing, and creation of a file that
> > > represents the graph structure)
> > > , while Clients can enter URLs that will be inserted in this TO-DO list.
> > >
> > > 2. I ve read about the web database (including full link graph)
> > > Where can I read more of it ? Does it do kind of representation
> > > of the site for me automatically?
> > >
> > > I mean I need (have done in this former application)
> > > something as:
> > >
> > > Node{
> > > ID=2144181430
> > > Title=Institute of Information Systems Universität zu Lübeck
> > > Schleswig-Holstein
> > > URL=http://www.ifis.uni-luebeck.de/index.html
> > > Number of Request=0
> > > }
> > > Edge{
> > > Node1=2144181430
> > > Node2=-66623770
> > > }
> > > Edge{
> > > Node1=2144181430
> > > Node2=150343685
> > > } .....
> > >
> > > So I create for every site a node and every link an edge.
> > > Is this done with this full link graph database?
> > >
> > > Thats all for now.
> > >
> > > I ll be so glad , if someone can help me
> > >
> > > Thanks Nils
> > >
>
> --
> -------------------------------------------------------------------
> Nils Höller
>
> [hidden email]
> [hidden email]
>
>
> ______________________________________________________________
> Verschicken Sie romantische, coole und witzige Bilder per SMS!
> Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193
>
>