I recently wrote a simple web crawler bot in Python, which will crawl a given site by URL if a robots.txt file can be found and generate a sitemap (XML). I did this mainly for fun, although I do plan on using it for generating a sitemap periodically for this blog/website.

I built crawlie.py into its own executable, crawlie, using pyinstaller.

The crawler bot, Crawlie is a well behaved bot which respects rules given in robots.txt.

Crawlie is launched via the CLI with:

crawlie [ <base URL> </path/to/sitemap/and/robots/dir> [-f] ][-v]

Where:

-f –force-sitemap Generates a sitemap even if there are no new links since last crawl.
-v –version Displays the program version and exits.

If we run crawlie on https://something.com, a site without a robots.txt file, we will get the following output:

  ____                    _ _
 / ___|_ __ __ ___      _| (_) ___
| |   | '__/ _` \ \ /\ / / | |/ _ \
| |___| | | (_| |\ V  V /| | |  __/
 \____|_|  \__,_| \_/\_/ |_|_|\___|

xyz.stpettersen.crawlie/v0.1

[2024-04-01 16:59:31]

There is no robots.txt file at 'https://something.com'!
Crawlie is a well-behaved bot who obeys robots.txt.
Crawlie has stopped as there is no robots.txt at the provided base URL.

The site is not crawled, as there was no robots.txt for crawlie to follow.

If we run crawlie on https://something.net, that does have a robots.txt file, we get the following:

  ____                    _ _
 / ___|_ __ __ ___      _| (_) ___
| |   | '__/ _` \ \ /\ / / | |/ _ \
| |___| | | (_| |\ V  V /| | |  __/
 \____|_|  \__,_| \_/\_/ |_|_|\___|

xyz.stpettersen.crawlie/v0.1

[2024-04-01 17:01:17]

Crawlie is reading and obeying 'https://something.net/robots.txt'

Parsing robots file:
------------------------------------------------------------
User-agent: Googlebot  [Crawlie does not match the user agent]
User-agent: Bingbot    [Crawlie does not match the user agent]
User-agent: Slurp      [Crawlie does not match the user agent]
User-agent: Baiduspider [Crawlie does not match the user agent]
Crawl-delay: 10        [Crawlie does not match the user agent]
Disallow:              [Crawlie does not match the user agent]
User-agent: *          [Crawlie matches the user agent]
Disallow: /            [Crawlie will not crawl '/']
------------------------------------------------------------

Whitelist = []
Blacklist = ['/']

Crawlie will not crawl 'https://something.net' for links.

No new links, skip generating sitemap.

The robots.txt file parsed, crawlie (whose user agent is xyz.stpettersen.crawlie/v0.1) matched the user agent wildcard (*) and follows the disallow rule that it cannot crawl the site. So again, the site is not crawled.

I can run crawlie on https://stpettersen.xyz and get the following output:

  ____                    _ _
 / ___|_ __ __ ___      _| (_) ___
| |   | '__/ _` \ \ /\ / / | |/ _ \
| |___| | | (_| |\ V  V /| | |  __/
 \____|_|  \__,_| \_/\_/ |_|_|\___|

xyz.stpettersen.crawlie/v0.1

[2024-04-01 17:06:09]

Crawlie is reading and obeying 'https://stpettersen.xyz/robots.txt'

Parsing robots file:
------------------------------------------------------------
User-agent: *          [Crawlie matches the user agent]
Allow: /               [Crawlie will crawl '/']
Disallow: /403.html    [Crawlie will not crawl '/403.html']
Disallow: /404.html    [Crawlie will not crawl '/404.html']
Disallow: /img         [Crawlie will not crawl '/img']

Sitemap: https://stpettersen.xyz/sitemap.xml
[Crawlie found an existing sitemap]
EOF
------------------------------------------------------------

Whitelist = ['/']
Blacklist = ['/403.html', '/404.html', '/img']

Crawlie will crawl 'https://stpettersen.xyz' for links...
Adding link 'https://stpettersen.xyz/'.
Adding link 'https://stpettersen.xyz/about-me/'.
Adding link 'https://stpettersen.xyz/ip-locator/'.
Adding link 'https://stpettersen.xyz/playbooks/'.
Adding link 'https://stpettersen.xyz/reading-list/'.
Adding link 'https://stpettersen.xyz/blog/2024/03/27
/void-linux-installation-scripts.html'.

Adding link 'https://stpettersen.xyz/blog/2024/03/25
/locale-info-from-ip.html'.

Crawlie will crawl 'https://stpettersen.xyz/' for links...

Crawlie will crawl 'https://stpettersen.xyz/about-me/' for links...

Crawlie will crawl 'https://stpettersen.xyz/ip-locator/' for links...

Crawlie will crawl 'https://stpettersen.xyz/playbooks/' for links...

Crawlie will crawl 'https://stpettersen.xyz/reading-list/' for links...

Crawlie will crawl 'https://stpettersen.xyz/blog/2024/03/27
/void-linux-installation-scripts.html' for links...

Crawlie will crawl 'https://stpettersen.xyz/blog/2024/03/25
/locale-info-from-ip.html' for links...

Generated sitemap ('./sitemap.xml').

This is what I can use to update the sitemap.xml as new posts are added to this site. I have installed the application to my server and set cron jobs to invoke it at 1500 and 1800 daily. Fingers crossed that it works :)