How to walk a webgraph?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

How to walk a webgraph?

Dennis Kubes-2
Does anybody know how to efficiently (non-exponentially) walk a web
graph to detect cycles.  This would be very useful in identifying spammy
webpage and tight knit communities.

I have a program that I will be releasing soon that does the detection
through converting a webgraph into a tree and walking the tree nodes,
but it is exponential in terms of intermediate map reduce output and
computation.

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: How to walk a webgraph?

hank williams
AFAIK this is not a nutch or hadoop issue, but a basic math problem, which
is inherent in walking trees. You have to figure out how many levels of the
graph you are willing to walk, and live with the mathematical consequences.
The further away you walk, the more untenable the math gets, but for a few
levels it can be OK. But it is nothing but a brute force problem.

Hank

On Mon, Jul 14, 2008 at 11:57 AM, Dennis Kubes <[hidden email]> wrote:

> Does anybody know how to efficiently (non-exponentially) walk a web graph
> to detect cycles.  This would be very useful in identifying spammy webpage
> and tight knit communities.
>
> I have a program that I will be releasing soon that does the detection
> through converting a webgraph into a tree and walking the tree nodes, but it
> is exponential in terms of intermediate map reduce output and computation.
>
> Dennis
>



--
blog: whydoeseverythingsuck.com
Reply | Threaded
Open this post in threaded view
|

Re: How to walk a webgraph?

Dennis Kubes-2
You're correct, it isn't a nutch or hadoop problem, just related :). I
was hoping there was some mathematical trick out there which I wasn't
aware of which might reduce the intermediate permutations.  Guess not.
Thanks for the reply.

Dennis

hank williams wrote:

> AFAIK this is not a nutch or hadoop issue, but a basic math problem, which
> is inherent in walking trees. You have to figure out how many levels of the
> graph you are willing to walk, and live with the mathematical consequences.
> The further away you walk, the more untenable the math gets, but for a few
> levels it can be OK. But it is nothing but a brute force problem.
>
> Hank
>
> On Mon, Jul 14, 2008 at 11:57 AM, Dennis Kubes <[hidden email]> wrote:
>
>> Does anybody know how to efficiently (non-exponentially) walk a web graph
>> to detect cycles.  This would be very useful in identifying spammy webpage
>> and tight knit communities.
>>
>> I have a program that I will be releasing soon that does the detection
>> through converting a webgraph into a tree and walking the tree nodes, but it
>> is exponential in terms of intermediate map reduce output and computation.
>>
>> Dennis
>>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to walk a webgraph?

Andrzej Białecki-2
In reply to this post by Dennis Kubes-2
Dennis Kubes wrote:
> Does anybody know how to efficiently (non-exponentially) walk a web
> graph to detect cycles.  This would be very useful in identifying spammy
> webpage and tight knit communities.
>
> I have a program that I will be releasing soon that does the detection
> through converting a webgraph into a tree and walking the tree nodes,
> but it is exponential in terms of intermediate map reduce output and
> computation.

Perhaps this could be helpful:

* http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf see the
last section.

* http://citeseer.ist.psu.edu/395003.html

* http://en.wikipedia.org/wiki/Cycle_detection


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: How to walk a webgraph?

Dennis Kubes-2
Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.

Dennis

Andrzej Bialecki wrote:

> Dennis Kubes wrote:
>> Does anybody know how to efficiently (non-exponentially) walk a web
>> graph to detect cycles.  This would be very useful in identifying
>> spammy webpage and tight knit communities.
>>
>> I have a program that I will be releasing soon that does the detection
>> through converting a webgraph into a tree and walking the tree nodes,
>> but it is exponential in terms of intermediate map reduce output and
>> computation.
>
> Perhaps this could be helpful:
>
> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf see the
> last section.
>
> * http://citeseer.ist.psu.edu/395003.html
>
> * http://en.wikipedia.org/wiki/Cycle_detection
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to walk a webgraph?

hank williams
Note that these algorithms will *not* resolve the fundamentally exponential
nature of the problem, they simply provide algorithms for doing it as
efficiently as possible, but this still involves essentially visiting every
node in the tree at least once until you hit a cycle in the tree or decide
you have traveled too far from the start. Moreover, after quickly perusing
(in other words I could have missed it) I did not see specific mention of
setting a depth of search limit. it is critical that you set a limit or you
will end up attempting to spider the entire internet in search of cycles.
The examples shown in the references are (for good reason) constrained in
terms of node count, but on the real web the pool is (obviously) much larger
and the problem becomes fundamentally different.

Hank

On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <[hidden email]> wrote:

> Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.
>
> Dennis
>
>
> Andrzej Bialecki wrote:
>
>> Dennis Kubes wrote:
>>
>>> Does anybody know how to efficiently (non-exponentially) walk a web graph
>>> to detect cycles.  This would be very useful in identifying spammy webpage
>>> and tight knit communities.
>>>
>>> I have a program that I will be releasing soon that does the detection
>>> through converting a webgraph into a tree and walking the tree nodes, but it
>>> is exponential in terms of intermediate map reduce output and computation.
>>>
>>
>> Perhaps this could be helpful:
>>
>> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last section.
>>
>> * http://citeseer.ist.psu.edu/395003.html
>>
>> * http://en.wikipedia.org/wiki/Cycle_detection
>>
>>
>>


--
blog: whydoeseverythingsuck.com
Reply | Threaded
Open this post in threaded view
|

Re: How to walk a webgraph?

brainstorm-2-2
On Tue, Jul 15, 2008 at 4:12 PM, hank williams <[hidden email]> wrote:
> Note that these algorithms will *not* resolve the fundamentally exponential
> nature of the problem,


O(|V|+|E|), IIRC


> efficiently as possible, but this still involves essentially visiting every
> node in the tree at least once until you hit a cycle in the tree or decide
> you have traveled too far from the start. Moreover, after quickly perusing
> (in other words I could have missed it) I did not see specific mention of
> setting a depth of search limit. it is critical that you set a limit or you
> will end up attempting to spider the entire internet in search of cycles.
> The examples shown in the references are (for good reason) constrained in
> terms of node count, but on the real web the pool is (obviously) much larger
> and the problem becomes fundamentally different.
>
> Hank
>
> On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <[hidden email]> wrote:
>
>> Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.
>>
>> Dennis
>>
>>
>> Andrzej Bialecki wrote:
>>
>>> Dennis Kubes wrote:
>>>
>>>> Does anybody know how to efficiently (non-exponentially) walk a web graph
>>>> to detect cycles.  This would be very useful in identifying spammy webpage
>>>> and tight knit communities.
>>>>
>>>> I have a program that I will be releasing soon that does the detection
>>>> through converting a webgraph into a tree and walking the tree nodes, but it
>>>> is exponential in terms of intermediate map reduce output and computation.
>>>>
>>>
>>> Perhaps this could be helpful:
>>>
>>> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last section.
>>>
>>> * http://citeseer.ist.psu.edu/395003.html
>>>
>>> * http://en.wikipedia.org/wiki/Cycle_detection
>>>
>>>
>>>
>
>
> --
> blog: whydoeseverythingsuck.com
>
Reply | Threaded
Open this post in threaded view
|

Re: How to walk a webgraph?

hank williams
On Tue, Jul 15, 2008 at 10:20 AM, brainstorm <[hidden email]> wrote:

> On Tue, Jul 15, 2008 at 4:12 PM, hank williams <[hidden email]> wrote:
> > Note that these algorithms will *not* resolve the fundamentally
> exponential
> > nature of the problem,
>
>
> O(|V|+|E|), IIRC
>

True. The problem is not actually exponential. It is the *dataset* that is
when you talk about the web graph. The problem is essentially linear with
the size of the dataset. If your dataset is constrained (by setting a limit)
then so is your execution.

>
>
> > efficiently as possible, but this still involves essentially visiting
> every
> > node in the tree at least once until you hit a cycle in the tree or
> decide
> > you have traveled too far from the start. Moreover, after quickly
> perusing
> > (in other words I could have missed it) I did not see specific mention of
> > setting a depth of search limit. it is critical that you set a limit or
> you
> > will end up attempting to spider the entire internet in search of cycles.
> > The examples shown in the references are (for good reason) constrained in
> > terms of node count, but on the real web the pool is (obviously) much
> larger
> > and the problem becomes fundamentally different.
> >
> > Hank
> >
> > On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <[hidden email]> wrote:
> >
> >> Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.
> >>
> >> Dennis
> >>
> >>
> >> Andrzej Bialecki wrote:
> >>
> >>> Dennis Kubes wrote:
> >>>
> >>>> Does anybody know how to efficiently (non-exponentially) walk a web
> graph
> >>>> to detect cycles.  This would be very useful in identifying spammy
> webpage
> >>>> and tight knit communities.
> >>>>
> >>>> I have a program that I will be releasing soon that does the detection
> >>>> through converting a webgraph into a tree and walking the tree nodes,
> but it
> >>>> is exponential in terms of intermediate map reduce output and
> computation.
> >>>>
> >>>
> >>> Perhaps this could be helpful:
> >>>
> >>> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>
> <http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last
> section.
> >>>
> >>> * http://citeseer.ist.psu.edu/395003.html
> >>>
> >>> * http://en.wikipedia.org/wiki/Cycle_detection
> >>>
> >>>
> >>>
> >
> >
> > --
> > blog: whydoeseverythingsuck.com
> >
>



--
blog: whydoeseverythingsuck.com
Reply | Threaded
Open this post in threaded view
|

Re: How to walk a webgraph?

Dennis Kubes-2
True you would definitely want to set a depth limit.  Not just to limit
run time but also because of the fundamental nature of the problem.
Reciprocal links and link loops of 2,3,4,5, maybe even six tend to show
tight knit communities.  Beyond that you start getting into the random
nature of the web connections.

Right now the problem is not so much the run time as the intermediate
output.  A two depth run, which identifies reciprocal and 2-node cycles
such as A->B->C->A creates a few hundred gigs of output on an 800M url
database.  A three depth run to identify A->B->C->D->A generates > 10T
of intermediate output.

What I am really trying to do is find a more efficient way to either:

1) Store the webgraph to apply a computation
2) Process the graph as in walking a tree in a way that doesn't lead to
storing (processing is fine) outlinks of outlinks of outlinks, etc.

Don't know if that helps to explain the problem or not.

Dennis

hank williams wrote:

> On Tue, Jul 15, 2008 at 10:20 AM, brainstorm <[hidden email]> wrote:
>
>> On Tue, Jul 15, 2008 at 4:12 PM, hank williams <[hidden email]> wrote:
>>> Note that these algorithms will *not* resolve the fundamentally
>> exponential
>>> nature of the problem,
>>
>> O(|V|+|E|), IIRC
>>
>
> True. The problem is not actually exponential. It is the *dataset* that is
> when you talk about the web graph. The problem is essentially linear with
> the size of the dataset. If your dataset is constrained (by setting a limit)
> then so is your execution.
>
>>
>>> efficiently as possible, but this still involves essentially visiting
>> every
>>> node in the tree at least once until you hit a cycle in the tree or
>> decide
>>> you have traveled too far from the start. Moreover, after quickly
>> perusing
>>> (in other words I could have missed it) I did not see specific mention of
>>> setting a depth of search limit. it is critical that you set a limit or
>> you
>>> will end up attempting to spider the entire internet in search of cycles.
>>> The examples shown in the references are (for good reason) constrained in
>>> terms of node count, but on the real web the pool is (obviously) much
>> larger
>>> and the problem becomes fundamentally different.
>>>
>>> Hank
>>>
>>> On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <[hidden email]> wrote:
>>>
>>>> Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.
>>>>
>>>> Dennis
>>>>
>>>>
>>>> Andrzej Bialecki wrote:
>>>>
>>>>> Dennis Kubes wrote:
>>>>>
>>>>>> Does anybody know how to efficiently (non-exponentially) walk a web
>> graph
>>>>>> to detect cycles.  This would be very useful in identifying spammy
>> webpage
>>>>>> and tight knit communities.
>>>>>>
>>>>>> I have a program that I will be releasing soon that does the detection
>>>>>> through converting a webgraph into a tree and walking the tree nodes,
>> but it
>>>>>> is exponential in terms of intermediate map reduce output and
>> computation.
>>>>> Perhaps this could be helpful:
>>>>>
>>>>> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>
>> <http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last
>> section.
>>>>> * http://citeseer.ist.psu.edu/395003.html
>>>>>
>>>>> * http://en.wikipedia.org/wiki/Cycle_detection
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> blog: whydoeseverythingsuck.com
>>>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Remote connection from search.jsp to nutchbean

Fritz Bein
Hi,

can anybody tell me if it is possible to run the search interface (i.e. search.jsp) on a different computer than that computer where the nutchbean runs on?

I want to setup a server for the purpose of delivering search results to another server that generates the html-pages using the results.

Any help is very much appreciated.
Regards,
Fritz

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer