Posts in Tech

How to Deal with Web Scraping

Hummmm since i have several galleries, one thing i encounter often is web scrapers, personally i don’t mind if anybody takes the entire gallery home or even if they re-post it somewhere else, that is fine with me, this is the web, if i wanted the gallery to be private i would made it so, if its public and free then go right ahead…



However the content itself is not the problem, the problem here is that the vast majority of web scrapers has bad default settings or the users that use them put too aggressive settings, its not uncommon for the load on the server to go from 0.10 to 1 in a heartbeat or even go down, i know its partly my fault i personally like to restrict as little as possible the server or the software (i could use several methods to restrict connections or to ban ip’s if there are too many connections), however because i don’t, sometimes i get into trouble, so this is what i normally do.

First of all i have this on the .htaccces (mod_rewrite), that helps blocking most of scrapping software (unless it spoofs as a browsers hehehe):

RewriteEngine OnRewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]

RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]

RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]

RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]

RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]

RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]

RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]

RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]

RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]

RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]

RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]

RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]

RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]

RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]

RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]

RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]

RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]

RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]

RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]

RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]

RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]

RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]

RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]

RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]

RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]

RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]

RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]

RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]

RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]

RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]

RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]

RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]

RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]

RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]

RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]

RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]

RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]

RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]

RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]

RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]

RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]

RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]

RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]

RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]

RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]

RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]

RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]

RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]

RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]

RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]

RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]

RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]

RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]

RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]

RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]

RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]

RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]

RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]

RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]

RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]

RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]

RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]

RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]

RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]

RewriteCond %{HTTP_USER_AGENT} ^Zeus

RewriteRule ^.* – [F,L]

I monitor the load of the server if the load as a spike for more than a couple of minutes i check the apache log, if there are lots of connections to the same site from the same ip, i ban this ip with .htaccess, adding this line:

Order allow,deny
Deny from 100.100.100.100
Allow from all

(100.100.100.100 is the ip on the logs) and check the load after a couple of minutes, if its down fine, if they jumped ip’s, i’ll do one of two things, if they keep on the same ip range i’ll just block that ip range, like so:

Order allow,deny
Deny from 100.100.100.
Allow from all

If they aren’t then i limit the connections to 10 on the cache or image directory (not on all the site), i know it will hurt all users but its better than nothing, just adding this line:

MaxClients 10

If it still persists, i’ll just close the image directory,adding this line again on the cache or image directory (depending on the type of image gallery software you are using):

deny from all

So the site stays up as well as the thumbnails, just the images won’t be accessible for a while, all of these are temporary measures, but for now, for me they do the trick, most of the times banning the ip is enough of a cure, and those i always leave on the .htaccess the other options i normally remove them the next day after the connections storm has passed, bottom line is if you want to scrape instead of bombing the server for an hour, make it so it downloads slowly during a couple of hours it makes a big diference and everyone gets what they want.

ImageBoard Spam List

Since i run several successful anonymous imageboards, one of the ways to prevent spam is to have a sort of spam list (i would say blacklist) of all the domains url’s that are not able to be posted on the board (since one of the main reasons to spam the boards is to post links to fishing or illegal sites), this way this prevents them being posted by bots altogether.

So its an essential tool to have a clean board and to keep things clean, even if later on more automated tools like akismet or defensio are included, this is still a nice clean and fast way to keep most spam or idiotic posts from the site, so the file is a simple spam.txt file on the root of the site, one domain/site per line, this of course is most useful to other imageboard hosts that use the same system (wakaba or kusaba clone software), so here is our very own custom spam.txt list hehehe for free yayy 1131 domains (i’m actually thinking about making our own system, so anyone can add links and have the latest version for their board so their system is always protected heheheh)

UPDATE (from tentakle and neechan):

Uhh since the last post we have added a bunch more, so there ya go an updated spam.txt this new list has 1372 domains that are/were used in spam posting on imageboards, the old link as well hehehe, you can use something like WinMerge or TextOpus to help merge your existing spam file with my awesome one ^^

How to Change Hosts Files

The hosts file is a computer file used to store information on where to find a node on a computer network. This file maps hostnames to IP addresses (like 10.0.0.1 points to www.google.com). The hosts file is used as a supplement to (or a replacement of) the Domain Name System (DNS) on networks of varying sizes. Unlike DNS, the hosts file is under the control of the local computer’s administrator (as in you). The hosts file has no extension and can be edited using most text editors.

The hosts file is loaded into memory (cache) at startup, then Windows checks the Hosts file before it queries any DNS servers, which enables it to override addresses in the DNS. This prevents access to the listed sites by redirecting any connection attempts back to the local (your) machine (ip address 127.0.0.1). Another feature of the HOSTS file is its ability to block other applications from connecting to the Internet, providing the entry exists.

So you can use a HOSTS file to block ads, banners, 3rd party Cookies, 3rd party page counters, web bugs, and even most hijackers, so here are some instructions to do so and some sites with already made hosts files (you just overwrite your own hosts file):

The host files location is:
Windows XP at c:\Windows\System32\Drivers\etc
Windows Vista at c:\Windows\System32\Drivers\etc
Windows 2000 at c:\Winnt\System32\Drivers\etc

There you will find a file named hosts (no extension), like we said above you can edit it with any text editor, and function is simple, you map ip addresses to hostnames, so the files will be mostly like this…

127.0.0.1    localhost
127.0.0.1    www.bad-spyware-site.com
127.0.0.1    www.site-with-virus.com
127.0.0.1    www.publicity-ads-site.com

if you want to add any domain, just add a new line right 127.0.0.1 for the localhost (this way when that domain comes up in the browser the browser will search for it on your computer and not online, because the hosts file told him that), so for example:

127.0.0.1    localhost
127.0.0.1    www.bad-spyware-site.com
127.0.0.1    www.site-with-virus.com
127.0.0.1    www.publicity-ads-site.com
127.0.0.1    google.com

so now if i put google.com on the address bar of the browser it will give me a blank page and google.com wont work anymore, if you want to delete a entry, just delete the line or put a # in front

127.0.0.1    localhost
127.0.0.1    www.bad-spyware-site.com
127.0.0.1    www.site-with-virus.com
127.0.0.1    www.publicity-ads-site.com
#127.0.0.1    google.com (google.com will work now)

so the idea is to use the hosts file to block unwanted or bad sites ^-^ clean and easy hehehe

Here are some sites that provide awesome host files ^_^ .oO (choose one of them)

Hostman : Its an automated hosts file updating software
Host File : Pretty cool and clean hosts file
Someone Who Cares : A compreensive hosts file
MVPS : A hosts towards blocking unwanted stuff

Shared Hosting VS Cloud Share VS Virtual Private Server

Shared hosting, including cloud shared hosting, and virtual private server (VPS) are some of the most popular options for hosting websites and applications. While they both serve the purpose of making your content accessible on the internet, there are significant differences in terms of performance, control, and scalability. We here on Hostcult use all 3, so i think its a nice write up to compare each with pros and cons, so lets check this out.

Shared Hosting

Shared hosting is a hosting environment where multiple websites are hosted on a single server. It is a cost-effective solution, making it suitable for small businesses, personal websites, and entry-level projects. In this setup, server resources such as CPU, RAM, and storage are shared among the websites hosted on the server. This sharing of resources allows hosting providers to offer affordable plans to a large number of customers. It also shares the hosting software so it makes it super easy to run anything.

Pros of Shared Hosting

  • Cost-effective: Shared hosting plans are generally one of the most affordable options, making them ideal for individuals and small businesses with limited budgets.
  • Easy to manage: The hosting provider handles server maintenance, security updates, and technical support, relieving users of the burden of server management.
  • User-friendly: Shared hosting often comes with a user-friendly control panel that simplifies website management, domain setup, and email configuration.

Cons of Shared Hosting

  • Limited resources: Since resources are shared among multiple websites, the performance of your website can be affected by the activities of other users on the server. If one website experiences a sudden surge in traffic, it can impact the overall server performance, potentially slowing down your site.
  • Limited customisation: Shared hosting environments typically have limitations on software installations and configurations since they aim to provide a standardised setup for all users.
  • Security concerns: As multiple websites share the same server, if one site is compromised, there is a potential risk of other sites on the server being affected as well.

Cloud Shared Hosting

Cloud shared hosting builds upon the shared hosting model by utilising cloud infrastructure. Instead of relying on a single physical server, cloud hosting distributes resources across multiple servers in a network. This offers improved scalability and reliability compared to traditional shared hosting.

Pros of Cloud Shared Hosting

  • Scalability: Cloud hosting allows for easy scaling of resources, ensuring that your website can handle sudden traffic spikes without performance degradation.
  • Reliability: With multiple servers in a network, if one server fails, your website can be instantly migrated to another server, minimising downtime.
  • Flexibility: Cloud hosting often provides more advanced features, such as load balancing and automatic backups, to enhance website performance and data protection.

Cons of Cloud Shared Hosting

  • Cost variation: While cloud shared hosting can be cost-effective for moderate traffic, the usage-based pricing model can result in higher costs if your website experiences significant traffic or resource usage.
  • Technical complexity: Cloud hosting may require more technical knowledge and expertise to set up and manage compared to traditional shared hosting.
  • Unique infrastructures – By default a cloud infrastructure can be setup from pretty simple with a couple of servers or incredibly complex with thousands, that makes it hard to compare the benefits or each cloud provider, since one can have better performance and another reality and another connections and another replication.

Virtual Private Server (VPS)

A VPS is a hosting environment where a physical server is divided into multiple virtual servers, each acting as an independent server environment. Each VPS has dedicated resources allocated to it, providing more control and performance compared to shared hosting.

Pros of VPS Hosting

  • Dedicated resources: With a VPS, you have guaranteed resources that are not shared with other users, ensuring consistent performance for your website and application.
  • Customisation and control: VPS hosting grants you root access, allowing you to install and configure software as per your requirements. You have more control over server settings and can tailor the environment to suit your specific needs.
  • Scalability: VPS hosting offers scalability options, allowing you to easily adjust your resource allocation as your website’s traffic and demands grow.

Cons of VPS Hosting

  • Cost: VPS hosting tends to be more expensive than shared hosting due to the dedicated resources and increased control it provides. It may not be the most cost-effective option for websites with low traffic or limited budgets.
  • Server management: While VPS hosting grants more control, it also requires a higher level of technical expertise to manage the server effectively. Users are responsible for tasks like server maintenance, security updates, and software installations.
  • Performance limitations: Although VPS hosting provides dedicated resources, the overall performance can still be affected by the physical server’s hardware limitations. If the physical server is overloaded, it can impact the performance of all VPS instances hosted on it.

So shared hosting, including cloud shared hosting, is suitable for entry-level websites and projects with budget constraints. It offers cost-effective pricing and user-friendly management, but resource limitations and potential complexity concerns should be considered. Cloud shared hosting provides enhanced scalability and reliability compared to traditional shared hosting but can be costlier and requires more technical expertise to run.

On the other hand, VPS hosting offers dedicated resources, increased customisation, and scalability, making it a preferable option for websites with higher traffic and specific requirements. It provides more control and performance, but at a higher cost and with additional server management responsibilities.

To sum it up here on Hostcult we use shared hosting for small sites, placeholders and testbeds, we use cloud shared hosting for bigger sites and production and we use VPS for specific services or files/image/video hosting that need specific software/performance to run.

How S2R uses Open Source Software

Well we here at s2r use a lot of open source software, that is not to say we don’t use some proprietary or custom made software from time to time, but that’s mostly when you get into that kind of situation where if you want your project to come to life and there is no alternative, you have to make it on your own, but for the most part the logic to use a stable free platform, not only saves time of not doing ourselves but more important it give you the freedom to work on the frontside and the community and the service and not so much on what and how it works… like… you want to bake a cake, you could buy it made, but making it and fiddling with the dosages and making the recipe your own is way more fun and more pleasurable… right ^_^ but you don’t want to go all out and plant the grain and sugar cane and make flour and sugar and have chickens for the eggs… that is awesome in its own way but i like my balance of awesomeness hehehe, soo what are our favorite and most used open source software and why do we use it?

Used extensively

WordPress > There is a reason why it is widely used, its a solid blog/light cms system that can be configured to do a 1001 different things

SMF > My favorite Forum Software, just cause its breakneck solid and feature complete

Drupal > By far my personal favorite cms, its still not perfect but it works perfectly and very modable, still its not that user-friendly and takes a long time to make stuff shine

Zenphoto > Don’t know… i just like how it works, simple and to the point gallery software, if it doesn’t have a feature its kinda easy to make it and it just works

Not used extensively but still used and recommended

bbPress > wished it was more solid, still a good and simple forum software (i’m always between bbPress and Vanilla)

Joomla > second best to drupal, a very good cms, its just not that matured and it sometimes breaks hard

Elgg > social networking, its very plain and still needs a lot of work

FluxBB > punBB fork, since punBB was sold to a commercial company and it kinda seems to be going on another direction now…

Formerly used but dropped for some reason

PhpBB > even with the new version, its still a mess of a software and it needs constant attention and care to make it work as it should

Gallery > too many options but not the ones i wished, i kinda would like it more, if it was a lightweight flickr, but its more a oversize personal gallery

PHP-Nuke > basically one developer, with security breaches from time to time and too cluncky and unworkable

Movabletype > it was pretty cool before it went full commercial and now that is kinda open source its much better, but i just don’t trust a company that flips all the time with my sites ^^

PunBB > dont like the direction they are going, besides now its moved by a commercial entity (not that a lot of other open source software dont do the same, mostly cause i dont like the direction hehehe)