Technical Information
User agent string / webcrawler information
The Changewatching web crawler is identified by the user agent string "Changewatching". It also carries version info, the URL and the usual UA stuff.
The crawler will obey robots.txt directives. If it doesn't, something has gone wrong and you should probably get in touch to let me know, so that I can put a stop to its terrifying rampage.
How the change sensitivity settings work
The highest sensitivity setting takes everything found between the HTML body tags (if present and correct) and compares one version to another in a simple 'are these identical?' comparison. This is useful if you want to be alerted to ANY change, no matter how small - even tiny alterations to the underlying HTML (not necessarily visible to the naked eye using a web browser) will be reported.
The medium setting tries to extract all the text from a page using a complex HTML parser, and remove all the underlying HTML, then performs a vague comparison on the text that remains.
The low setting also tries to extract the text, but instead of using an HTML parser it guesses what is text and what is HTML using a simple linguistic/statistical method. The result is then passed through a very simple HTML parser that takes an aggressive approach to removing anything that looks like HTML. This also removes a small portion of the readable text. Finally, the result is compared using a vague equality operation that operates on sequences of words and allows for a small amount of variation.
In general, the high setting is only useful if you wish to be notified about technical, as well as textual changes to a website. The medium and low settings are suitable for most use cases, but the low setting gives fewer false positives, particularly against pages such as newspapers' live / running commentary-style columns.
Because the medium setting is very focused on HTML structures, it will not work well against other file types, e.g. MS Word documents, or PDFs, or GIFs, etc. It's best to use low or high settings for such files.
The complex HTML parser used by the medium setting should also work effectively against anything that looks like HTML, including RSS feeds and XML. Occasionally it may struggle with very complex or broken HTML and the software will automatically adjust by switching to the filter used by the low setting.
System description
Changewatching crawls individual web pages at regular intervals, at the request of users. The maximum frequency of the crawl is set by the user, but the time of the crawl is adjusted by functions that mask the number of users and disperse crawl workload across 12-hour periods. When changes are detected, email alerts are sent to those users. Changes are detected by comparing the current page to previously stored versions using a variety of filters and comparison methods. The Changewatching web crawler does not function like a search engine spider - it loads only the page of interest, and does not follow any links it finds. Spidered content is not displayed to users, nor is it republished in any form. Changewatching is primarily designed for textual comparison. It does not interpret Javascript, therefore it will not work well on sites that deliver their content exclusively via Javascript. No website should do that, but some do.
Changewatching runs on one webserver, a database server, and a combined backend database/data processor server. It is written in Python. MySQL provides the databases. The front end is pure HTML and CSS with no Javascript or cookie use. There are no web frameworks involved - it's built from the ground-up. (Get off my lawn, etc.) User data sent between the webserver and backend is secured by SSL, while commands sent from the backend to front are authenticated by IP address and by using a rolling key based on a shared secret, which is compared using a salted hash.
For convenience and anti-spamming reasons, the user interface is split between the web front end that handles new requests and subscription settings changes, and an email parser that handles status requests, deactivations, identity confirmations and other miscellaneous user commands.
The web interface is designed to display correctly on all current versions of popular desktop and mobile web browsers, and also versions of Internet Explorer as old as version 8.
Changewatching uses 3 different HTML parsers of varying complexity, all written by the author. All Python was written in SPE. All HTML and CSS was written manually in Kwrite, apart from the blog which is powered by the Sparo CMS (which was also written by the author).
The main Changewatching program consists of around 5,000 lines of Python and was written in under 4 months. About 1,000 of those lines are tests.