General question about subdomains

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

General question about subdomains

Joseph Naegele
This is more of a general question, not Nutch-specific:

Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be gibberish but they do have robots.txt files and the URLs appear to serve HTML. A few days later I found that our crawler machine was no longer able to resolve these subdomains, as if it was blocked by their DNS servers, significantly slowing our crawl (due to DNS timeouts). This led me to investigate and find that 40% of all our known URLs were hosts on this same parent domain.

Since the hosts are actually different, is Nutch able to prevent this trap-like behavior? Are there any established methods for preventing similar issues in web crawlers?

Thanks

---
Joe Naegele
Grier Forensics


Reply | Threaded
Open this post in threaded view
|

Re: General question about subdomains

Julien Nioche-4
Hi Joe,

Do these subdomains point to the same IP address? Did they blacklist your
server i.e. can you connect to these domains from the crawl server using a
different tool like curl?

Not a silver bullet but a way of preventing this is to group by IP or
domain (fetcher.queue.mode and partition.url.mode) so that the politeness
settings are applied to all the subdomains. This will reduce the risk of
being blacklisted - assuming you were - and slow down the discovery of URLs
for the TLD.

fetcher.max.exceptions.per.queue should also help by preventing a long tail
of fetch errors during the fetch step

HTH

Julien


On 11 January 2017 at 14:21, Joseph Naegele <[hidden email]>
wrote:

> This is more of a general question, not Nutch-specific:
>
> Our crawler discovered some URLs pointing to a number of subdomains of a
> Chinese-owned [strmy domain. It then proceeded to discover millions more
> URLs pointing to other subdomains (hosts) of the same domain. Most of the
> names appear to be gibberish but they do have robots.txt files and the URLs
> appear to serve HTML. A few days later I found that our crawler machine was
> no longer able to resolve these subdomains, as if it was blocked by their
> DNS servers, significantly slowing our crawl (due to DNS timeouts). This
> led me to investigate and find that 40% of all our known URLs were hosts on
> this same parent domain.
>
> Since the hosts are actually different, is Nutch able to prevent this
> trap-like behavior? Are there any established methods for preventing
> similar issues in web crawlers?
>
> Thanks
>
> ---
> Joe Naegele
> Grier Forensics
>
>
>


--

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>
Reply | Threaded
Open this post in threaded view
|

RE: General question about subdomains

Markus Jelsma-2
In reply to this post by Joseph Naegele
Hello Joseph,

The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter.

We have seen these examples as well and they caused similar problems but we lost track of them, those domains don't exist anymore. Can you send me the domains that cause you trouble, we could use them for our classification training sets.

Regards,
Markus
 
-----Original message-----

> From:Joseph Naegele <[hidden email]>
> Sent: Wednesday 11th January 2017 15:21
> To: [hidden email]
> Subject: General question about subdomains
>
> This is more of a general question, not Nutch-specific:
>
> Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be gibberish but they do have robots.txt files and the URLs appear to serve HTML. A few days later I found that our crawler machine was no longer able to resolve these subdomains, as if it was blocked by their DNS servers, significantly slowing our crawl (due to DNS timeouts). This led me to investigate and find that 40% of all our known URLs were hosts on this same parent domain.
>
> Since the hosts are actually different, is Nutch able to prevent this trap-like behavior? Are there any established methods for preventing similar issues in web crawlers?
>
> Thanks
>
> ---
> Joe Naegele
> Grier Forensics
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: General question about subdomains

Joseph Naegele
In reply to this post by Julien Nioche-4
Thanks Julien,

The subdomains do, in fact, point to the same IP address. In the end the issue was that our DNS service flagged our DNS traffic since we were resolving millions of subdomains using the domain's authoritative nameservers (we use OpenDNS to filter inappropriate content).

Partitioning and fetching by IP is definitely a step in the right direction.

---
Joe Naegele
Grier Forensics

-----Original Message-----
From: Julien Nioche [mailto:[hidden email]]
Sent: Wednesday, January 11, 2017 9:32 AM
To: [hidden email]
Subject: Re: General question about subdomains

Hi Joe,

Do these subdomains point to the same IP address? Did they blacklist your
server i.e. can you connect to these domains from the crawl server using a
different tool like curl?

Not a silver bullet but a way of preventing this is to group by IP or
domain (fetcher.queue.mode and partition.url.mode) so that the politeness
settings are applied to all the subdomains. This will reduce the risk of
being blacklisted - assuming you were - and slow down the discovery of URLs
for the TLD.

fetcher.max.exceptions.per.queue should also help by preventing a long tail
of fetch errors during the fetch step

HTH

Julien


On 11 January 2017 at 14:21, Joseph Naegele <[hidden email]>
wrote:

> This is more of a general question, not Nutch-specific:
>
> Our crawler discovered some URLs pointing to a number of subdomains of a
> Chinese-owned [strmy domain. It then proceeded to discover millions more
> URLs pointing to other subdomains (hosts) of the same domain. Most of the
> names appear to be gibberish but they do have robots.txt files and the URLs
> appear to serve HTML. A few days later I found that our crawler machine was
> no longer able to resolve these subdomains, as if it was blocked by their
> DNS servers, significantly slowing our crawl (due to DNS timeouts). This
> led me to investigate and find that 40% of all our known URLs were hosts on
> this same parent domain.
>
> Since the hosts are actually different, is Nutch able to prevent this
> trap-like behavior? Are there any established methods for preventing
> similar issues in web crawlers?
>
> Thanks
>
> ---
> Joe Naegele
> Grier Forensics
>
>
>


--

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Reply | Threaded
Open this post in threaded view
|

RE: General question about subdomains

Joseph Naegele
In reply to this post by Markus Jelsma-2
Markus,

Interestingly enough, we do use OpenDNS to filter undesirable content, including parked content. In this case, however, the domain in question isn't tagged in OpenDNS and is therefore "allowed", along with all its subdomains.

This particular domain is "hjsjp.com". It's Chinese-owned and the URLs appear to all point to the same link-filled content, possibly a domain park site. Example URLs:
- http://e2qya.hjsjp.com/
- http://ml081.hjsjp.com/xzudb
- http://www.ch8yu.hjsjp.com/1805/8371.html

As Julien mentioned, partitioning and fetching by IP would help.

---
Joe Naegele
Grier Forensics

-----Original Message-----
From: Markus Jelsma [mailto:[hidden email]]
Sent: Wednesday, January 11, 2017 9:43 AM
To: [hidden email]
Subject: RE: General question about subdomains

Hello Joseph,

The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter.

We have seen these examples as well and they caused similar problems but we lost track of them, those domains don't exist anymore. Can you send me the domains that cause you trouble, we could use them for our classification training sets.

Regards,
Markus
 
-----Original message-----

> From:Joseph Naegele <[hidden email]>
> Sent: Wednesday 11th January 2017 15:21
> To: [hidden email]
> Subject: General question about subdomains
>
> This is more of a general question, not Nutch-specific:
>
> Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be gibberish but they do have robots.txt files and the URLs appear to serve HTML. A few days later I found that our crawler machine was no longer able to resolve these subdomains, as if it was blocked by their DNS servers, significantly slowing our crawl (due to DNS timeouts). This led me to investigate and find that 40% of all our known URLs were hosts on this same parent domain.
>
> Since the hosts are actually different, is Nutch able to prevent this trap-like behavior? Are there any established methods for preventing similar issues in web crawlers?
>
> Thanks
>
> ---
> Joe Naegele
> Grier Forensics
>
>
>

Reply | Threaded
Open this post in threaded view
|

RE: General question about subdomains

Markus Jelsma-2
In reply to this post by Joseph Naegele
Joseph - thank you very much!

This is exactly the crap we are looking for, now we can train our classifiers to detect at least these bastards.

But how would partitioning by IP really help if they don't all point to the same IP? All hosts i manually checked are indeed on the same subnet, but many have a different 4th octet.

Regards,
Markus

 
 
-----Original message-----

> From:Joseph Naegele <[hidden email]>
> Sent: Friday 13th January 2017 15:11
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Markus,
>
> Interestingly enough, we do use OpenDNS to filter undesirable content, including parked content. In this case, however, the domain in question isn't tagged in OpenDNS and is therefore "allowed", along with all its subdomains.
>
> This particular domain is "hjsjp.com". It's Chinese-owned and the URLs appear to all point to the same link-filled content, possibly a domain park site. Example URLs:
> - http://e2qya.hjsjp.com/
> - http://ml081.hjsjp.com/xzudb
> - http://www.ch8yu.hjsjp.com/1805/8371.html
>
> As Julien mentioned, partitioning and fetching by IP would help.
>
> ---
> Joe Naegele
> Grier Forensics
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: Wednesday, January 11, 2017 9:43 AM
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Hello Joseph,
>
> The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter.
>
> We have seen these examples as well and they caused similar problems but we lost track of them, those domains don't exist anymore. Can you send me the domains that cause you trouble, we could use them for our classification training sets.
>
> Regards,
> Markus
>  
> -----Original message-----
> > From:Joseph Naegele <[hidden email]>
> > Sent: Wednesday 11th January 2017 15:21
> > To: [hidden email]
> > Subject: General question about subdomains
> >
> > This is more of a general question, not Nutch-specific:
> >
> > Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be gibberish but they do have robots.txt files and the URLs appear to serve HTML. A few days later I found that our crawler machine was no longer able to resolve these subdomains, as if it was blocked by their DNS servers, significantly slowing our crawl (due to DNS timeouts). This led me to investigate and find that 40% of all our known URLs were hosts on this same parent domain.
> >
> > Since the hosts are actually different, is Nutch able to prevent this trap-like behavior? Are there any established methods for preventing similar issues in web crawlers?
> >
> > Thanks
> >
> > ---
> > Joe Naegele
> > Grier Forensics
> >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: General question about subdomains

Joseph Naegele
Markus,

The example URLs I sent are all the same IP address. This isn't always the case, however, so you're correct that partitioning by IP won't help us. Additionally, we'd like to avoid resolving the IPs of these domains in the first place since most of them resolve to the same IP.

We're now finding many webs of these spam/parked domains, all interconnected. Do you have more information on classifying domains? This is something we're now very interested in doing.

I'm still working on putting together a list of "bad" domains.

Thanks
---
Joe Naegele
Grier Forensics

-----Original Message-----
From: Markus Jelsma [mailto:[hidden email]]
Sent: Friday, January 13, 2017 10:00 AM
To: [hidden email]
Subject: RE: General question about subdomains

Joseph - thank you very much!

This is exactly the crap we are looking for, now we can train our classifiers to detect at least these bastards.

But how would partitioning by IP really help if they don't all point to the same IP? All hosts i manually checked are indeed on the same subnet, but many have a different 4th octet.

Regards,
Markus

 
 
-----Original message-----

> From:Joseph Naegele <[hidden email]>
> Sent: Friday 13th January 2017 15:11
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Markus,
>
> Interestingly enough, we do use OpenDNS to filter undesirable content, including parked content. In this case, however, the domain in question isn't tagged in OpenDNS and is therefore "allowed", along with all its subdomains.
>
> This particular domain is "hjsjp.com". It's Chinese-owned and the URLs appear to all point to the same link-filled content, possibly a domain park site. Example URLs:
> - http://e2qya.hjsjp.com/
> - http://ml081.hjsjp.com/xzudb
> - http://www.ch8yu.hjsjp.com/1805/8371.html
>
> As Julien mentioned, partitioning and fetching by IP would help.
>
> ---
> Joe Naegele
> Grier Forensics
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: Wednesday, January 11, 2017 9:43 AM
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Hello Joseph,
>
> The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter.
>
> We have seen these examples as well and they caused similar problems but we lost track of them, those domains don't exist anymore. Can you send me the domains that cause you trouble, we could use them for our classification training sets.
>
> Regards,
> Markus
>  
> -----Original message-----
> > From:Joseph Naegele <[hidden email]>
> > Sent: Wednesday 11th January 2017 15:21
> > To: [hidden email]
> > Subject: General question about subdomains
> >
> > This is more of a general question, not Nutch-specific:
> >
> > Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be gibberish but they do have robots.txt files and the URLs appear to serve HTML. A few days later I found that our crawler machine was no longer able to resolve these subdomains, as if it was blocked by their DNS servers, significantly slowing our crawl (due to DNS timeouts). This led me to investigate and find that 40% of all our known URLs were hosts on this same parent domain.
> >
> > Since the hosts are actually different, is Nutch able to prevent this trap-like behavior? Are there any established methods for preventing similar issues in web crawlers?
> >
> > Thanks
> >
> > ---
> > Joe Naegele
> > Grier Forensics
> >
> >
> >
>
>

Reply | Threaded
Open this post in threaded view
|

RE: General question about subdomains

Markus Jelsma-2
In reply to this post by Joseph Naegele
Hello Joseph,

My colleague has not yet started to build a model for these crappy pages, but would still like to. We are going to run into this again soon enough so if you have any set of distinct crap sites would be most helpful. Possibly sites that are not closely interconnected, so we can model and evaluate nicely at the same time.

The classifier is a sub project of our custom parser/stuff detector and extractor, packed as a Nutch parser plugin. It does hierarchical classification, first detecting host type then page type, models are build using feature selection via a genetic algorithm to have it perform and keep it as lightweight as possible. A crap/spam host type is one we'd love to add.

Any set, even small, will do.

Thanks,
Markus
 
-----Original message-----

> From:Joseph Naegele <[hidden email]>
> Sent: Wednesday 8th February 2017 18:20
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Markus,
>
> The example URLs I sent are all the same IP address. This isn't always the case, however, so you're correct that partitioning by IP won't help us. Additionally, we'd like to avoid resolving the IPs of these domains in the first place since most of them resolve to the same IP.
>
> We're now finding many webs of these spam/parked domains, all interconnected. Do you have more information on classifying domains? This is something we're now very interested in doing.
>
> I'm still working on putting together a list of "bad" domains.
>
> Thanks
> ---
> Joe Naegele
> Grier Forensics
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: Friday, January 13, 2017 10:00 AM
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Joseph - thank you very much!
>
> This is exactly the crap we are looking for, now we can train our classifiers to detect at least these bastards.
>
> But how would partitioning by IP really help if they don't all point to the same IP? All hosts i manually checked are indeed on the same subnet, but many have a different 4th octet.
>
> Regards,
> Markus
>


> -----Original message-----
> > From:Joseph Naegele <[hidden email]>
> > Sent: Friday 13th January 2017 15:11
> > To: [hidden email]
> > Subject: RE: General question about subdomains
> >
> > Markus,
> >
> > Interestingly enough, we do use OpenDNS to filter undesirable content, including parked content. In this case, however, the domain in question isn't tagged in OpenDNS and is therefore "allowed", along with all its subdomains.
> >
> > This particular domain is "hjsjp.com". It's Chinese-owned and the URLs appear to all point to the same link-filled content, possibly a domain park site. Example URLs:
> > - http://e2qya.hjsjp.com/
> > - http://ml081.hjsjp.com/xzudb
> > - http://www.ch8yu.hjsjp.com/1805/8371.html
> >
> > As Julien mentioned, partitioning and fetching by IP would help.
> >
> > ---
> > Joe Naegele
> > Grier Forensics
> >
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[hidden email]]
> > Sent: Wednesday, January 11, 2017 9:43 AM
> > To: [hidden email]
> > Subject: RE: General question about subdomains
> >
> > Hello Joseph,
> >
> > The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter.
> >
> > We have seen these examples as well and they caused similar problems but we lost track of them, those domains don't exist anymore. Can you send me the domains that cause you trouble, we could use them for our classification training sets.
> >
> > Regards,
> > Markus
> > 
> > -----Original message-----
> > > From:Joseph Naegele <[hidden email]>
> > > Sent: Wednesday 11th January 2017 15:21
> > > To: [hidden email]
> > > Subject: General question about subdomains
> > >
> > > This is more of a general question, not Nutch-specific:
> > >
> > > Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be gibberish but they do have robots.txt files and the URLs appear to serve HTML. A few days later I found that our crawler machine was no longer able to resolve these subdomains, as if it was blocked by their DNS servers, significantly slowing our crawl (due to DNS timeouts). This led me to investigate and find that 40% of all our known URLs were hosts on this same parent domain.
> > >
> > > Since the hosts are actually different, is Nutch able to prevent this trap-like behavior? Are there any established methods for preventing similar issues in web crawlers?
> > >
> > > Thanks
> > >
> > > ---
> > > Joe Naegele
> > > Grier Forensics
> > >
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: General question about subdomains

Joseph Naegele
Thanks Markus. I'll put together a list shortly. Is your classifier plugin open-source or available to share? It sounds interesting and very useful.

---
Joe Naegele
Grier Forensics

-----Original Message-----
From: Markus Jelsma [mailto:[hidden email]]
Sent: Thursday, February 09, 2017 3:36 AM
To: [hidden email]
Subject: RE: General question about subdomains

Hello Joseph,

My colleague has not yet started to build a model for these crappy pages, but would still like to. We are going to run into this again soon enough so if you have any set of distinct crap sites would be most helpful. Possibly sites that are not closely interconnected, so we can model and evaluate nicely at the same time.

The classifier is a sub project of our custom parser/stuff detector and extractor, packed as a Nutch parser plugin. It does hierarchical classification, first detecting host type then page type, models are build using feature selection via a genetic algorithm to have it perform and keep it as lightweight as possible. A crap/spam host type is one we'd love to add.

Any set, even small, will do.

Thanks,
Markus
 
-----Original message-----

> From:Joseph Naegele <[hidden email]>
> Sent: Wednesday 8th February 2017 18:20
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Markus,
>
> The example URLs I sent are all the same IP address. This isn't always the case, however, so you're correct that partitioning by IP won't help us. Additionally, we'd like to avoid resolving the IPs of these domains in the first place since most of them resolve to the same IP.
>
> We're now finding many webs of these spam/parked domains, all interconnected. Do you have more information on classifying domains? This is something we're now very interested in doing.
>
> I'm still working on putting together a list of "bad" domains.
>
> Thanks
> ---
> Joe Naegele
> Grier Forensics
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: Friday, January 13, 2017 10:00 AM
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Joseph - thank you very much!
>
> This is exactly the crap we are looking for, now we can train our classifiers to detect at least these bastards.
>
> But how would partitioning by IP really help if they don't all point to the same IP? All hosts i manually checked are indeed on the same subnet, but many have a different 4th octet.
>
> Regards,
> Markus
>
>  
>  
> -----Original message-----
> > From:Joseph Naegele <[hidden email]>
> > Sent: Friday 13th January 2017 15:11
> > To: [hidden email]
> > Subject: RE: General question about subdomains
> >
> > Markus,
> >
> > Interestingly enough, we do use OpenDNS to filter undesirable content, including parked content. In this case, however, the domain in question isn't tagged in OpenDNS and is therefore "allowed", along with all its subdomains.
> >
> > This particular domain is "hjsjp.com". It's Chinese-owned and the URLs appear to all point to the same link-filled content, possibly a domain park site. Example URLs:
> > - http://e2qya.hjsjp.com/
> > - http://ml081.hjsjp.com/xzudb
> > - http://www.ch8yu.hjsjp.com/1805/8371.html
> >
> > As Julien mentioned, partitioning and fetching by IP would help.
> >
> > ---
> > Joe Naegele
> > Grier Forensics
> >
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[hidden email]]
> > Sent: Wednesday, January 11, 2017 9:43 AM
> > To: [hidden email]
> > Subject: RE: General question about subdomains
> >
> > Hello Joseph,
> >
> > The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter.
> >
> > We have seen these examples as well and they caused similar problems but we lost track of them, those domains don't exist anymore. Can you send me the domains that cause you trouble, we could use them for our classification training sets.
> >
> > Regards,
> > Markus
> >  
> > -----Original message-----
> > > From:Joseph Naegele <[hidden email]>
> > > Sent: Wednesday 11th January 2017 15:21
> > > To: [hidden email]
> > > Subject: General question about subdomains
> > >
> > > This is more of a general question, not Nutch-specific:
> > >
> > > Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be gibberish but they do have robots.txt files and the URLs appear to serve HTML. A few days later I found that our crawler machine was no longer able to resolve these subdomains, as if it was blocked by their DNS servers, significantly slowing our crawl (due to DNS timeouts). This led me to investigate and find that 40% of all our known URLs were hosts on this same parent domain.
> > >
> > > Since the hosts are actually different, is Nutch able to prevent this trap-like behavior? Are there any established methods for preventing similar issues in web crawlers?
> > >
> > > Thanks
> > >
> > > ---
> > > Joe Naegele
> > > Grier Forensics
> > >
> > >
> > >
> >
> >
>
>

Reply | Threaded
Open this post in threaded view
|

RE: General question about subdomains

Markus Jelsma-2
In reply to this post by Joseph Naegele
Joseph - thank you very much, we can use the data or stumble upon it ourselves once again.

The work is, unfortunately, not FOSS. It is among the few things we have to bear the costs of continuous R&D. You'd have to contact us off-list for further inquiries.

Besides everything, there may be more list subscribers interested in the set you have, so please share that with the list if you can.

Thanks,
Markus
 
 
-----Original message-----

> From:Joseph Naegele <[hidden email]>
> Sent: Thursday 9th February 2017 23:39
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Thanks Markus. I'll put together a list shortly. Is your classifier plugin open-source or available to share? It sounds interesting and very useful.
>
> ---
> Joe Naegele
> Grier Forensics
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: Thursday, February 09, 2017 3:36 AM
> To: [hidden email]
> Subject: RE: General question about subdomains
>
> Hello Joseph,
>
> My colleague has not yet started to build a model for these crappy pages, but would still like to. We are going to run into this again soon enough so if you have any set of distinct crap sites would be most helpful. Possibly sites that are not closely interconnected, so we can model and evaluate nicely at the same time.
>
> The classifier is a sub project of our custom parser/stuff detector and extractor, packed as a Nutch parser plugin. It does hierarchical classification, first detecting host type then page type, models are build using feature selection via a genetic algorithm to have it perform and keep it as lightweight as possible. A crap/spam host type is one we'd love to add.
>
> Any set, even small, will do.
>
> Thanks,
> Markus
>  
> -----Original message-----
> > From:Joseph Naegele <[hidden email]>
> > Sent: Wednesday 8th February 2017 18:20
> > To: [hidden email]
> > Subject: RE: General question about subdomains
> >
> > Markus,
> >
> > The example URLs I sent are all the same IP address. This isn't always the case, however, so you're correct that partitioning by IP won't help us. Additionally, we'd like to avoid resolving the IPs of these domains in the first place since most of them resolve to the same IP.
> >
> > We're now finding many webs of these spam/parked domains, all interconnected. Do you have more information on classifying domains? This is something we're now very interested in doing.
> >
> > I'm still working on putting together a list of "bad" domains.
> >
> > Thanks
> > ---
> > Joe Naegele
> > Grier Forensics
> >
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[hidden email]]
> > Sent: Friday, January 13, 2017 10:00 AM
> > To: [hidden email]
> > Subject: RE: General question about subdomains
> >
> > Joseph - thank you very much!
> >
> > This is exactly the crap we are looking for, now we can train our classifiers to detect at least these bastards.
> >
> > But how would partitioning by IP really help if they don't all point to the same IP? All hosts i manually checked are indeed on the same subnet, but many have a different 4th octet.
> >
> > Regards,
> > Markus
> >
> >  
> >  
> > -----Original message-----
> > > From:Joseph Naegele <[hidden email]>
> > > Sent: Friday 13th January 2017 15:11
> > > To: [hidden email]
> > > Subject: RE: General question about subdomains
> > >
> > > Markus,
> > >
> > > Interestingly enough, we do use OpenDNS to filter undesirable content, including parked content. In this case, however, the domain in question isn't tagged in OpenDNS and is therefore "allowed", along with all its subdomains.
> > >
> > > This particular domain is "hjsjp.com". It's Chinese-owned and the URLs appear to all point to the same link-filled content, possibly a domain park site. Example URLs:
> > > - http://e2qya.hjsjp.com/
> > > - http://ml081.hjsjp.com/xzudb
> > > - http://www.ch8yu.hjsjp.com/1805/8371.html
> > >
> > > As Julien mentioned, partitioning and fetching by IP would help.
> > >
> > > ---
> > > Joe Naegele
> > > Grier Forensics
> > >
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:[hidden email]]
> > > Sent: Wednesday, January 11, 2017 9:43 AM
> > > To: [hidden email]
> > > Subject: RE: General question about subdomains
> > >
> > > Hello Joseph,
> > >
> > > The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter.
> > >
> > > We have seen these examples as well and they caused similar problems but we lost track of them, those domains don't exist anymore. Can you send me the domains that cause you trouble, we could use them for our classification training sets.
> > >
> > > Regards,
> > > Markus
> > >  
> > > -----Original message-----
> > > > From:Joseph Naegele <[hidden email]>
> > > > Sent: Wednesday 11th January 2017 15:21
> > > > To: [hidden email]
> > > > Subject: General question about subdomains
> > > >
> > > > This is more of a general question, not Nutch-specific:
> > > >
> > > > Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be gibberish but they do have robots.txt files and the URLs appear to serve HTML. A few days later I found that our crawler machine was no longer able to resolve these subdomains, as if it was blocked by their DNS servers, significantly slowing our crawl (due to DNS timeouts). This led me to investigate and find that 40% of all our known URLs were hosts on this same parent domain.
> > > >
> > > > Since the hosts are actually different, is Nutch able to prevent this trap-like behavior? Are there any established methods for preventing similar issues in web crawlers?
> > > >
> > > > Thanks
> > > >
> > > > ---
> > > > Joe Naegele
> > > > Grier Forensics
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>