Web scraping – part 1

Web scraping

When a hacker has identified a vulnerability in a system or platform he can choose to either report it (white hat hacker) or scan more systems with the same security hole. One way to do this is “Web scraping” which means scanning a website for specific areas and/or extracting that information. Web scraping a website can be compared with an anti virus software – which archives virus signatures and then using them to identify malicious software on the server, computer or phone.

I have seen people mixing up Web scraping with Web crawling, most likely because their often used together. A Web crawler mission is to collect URL:s, alternatively downloading source code – which seems to be interesting for the user, and then hand them over to the Web scraper for further investigations.

In this guide we will look into the web scraper and if you find the article satisfied, please let us know and we will make a follow up where we code our web crawler.

Target platforms

I goggled for vulnerabilities and thought that WordPress could be an good example since it’s the most used web platform. Top of WordPress vulnerabilities was  Woocommerce, so I decided to also include it since I expect security being in focus for such components.


Borrowed from firsttracksmarketingborrowed from www.cms2cms.com


The definition of footprints in this subject is code, design and files that can identifies which version of the application, platform or plugin that is used. This is not in any way limited to to the web, back in time hackers throw questions at the server in order to get information, such as location, operating system and platform. Hacking Exposed covers some “historical” methods if you haven’t read it 😉

Methods for detecting footprints

Here is the first three ideas that I will attempt to use to identify version of Wordpress and Woocommerce. Feel free to share if you have more ideas.

  1. Is the version printed anywhere on the Front page or version page?
  2. Check the change log. Is there new files introduced that can be accessed?
  3. Most updates contains design improvements, which many times means modified CSS files.

Searching for wordpress blogs took me here: http://blog.us.playstation.com/

<meta name="generator" content="WordPress 4.1" />

Searching for woocommerce took me here: http://jerseybasement.com/

<meta name="generator" content="WooCommerce 2.1.12" />

In Part 2 we will create a tool that can go through the source code of a web site and tell us if it’s vulnerable or not.

Leave a Reply

Your email address will not be published.