URL-Lists for Google sitemaps

Google Sitemaps needs a list of URLs to optimize crawling. Usually, this is no problem, since Google supplies a script you can run on your server to build that list.

But that fails, if the content of your site is not stored in HTML-files, but in TXT-files like DokuWiki does. So here is what I did to build that list of URLs:

Find them

We do all of this in the directory where DokuWiki stores the pages:

cd /home/hdocs/beta.linuxbasics.org/data
find ./ -iname "*.txt"

give us

./wiki/syntax.txt
./wiki/dokuwiki.txt
./wiki/playground.txt
./start.txt
./tutorials/pre/start.txt
... 

which is the URL except that:

SED

The editor 'sed' can help us with those replacements. It is the source of Perl's s/-command, so if you know Perl, this will be familiar: <code bash> sed -e 's#^./#http://LinuxBasics.org/#g ; s/.txt$g' </code>

  • “sed -e” will execute the command given to the standard-input and print the result to the standard-output.
  • “s#^.#http://LinuxBasics.org#g” will replace the dot at the beginning of the line ('^') with the base-URL.

This uses '#' as a delimiter instead of '/'. Why? Because it looks much better then the version with slashes: “s/^./http:\/\/LinuxBasics.org/g”

  • “s/.txt$/.html/g” will replace the ”.txt” at the end of the line ('$') with ”” which removes ”.txt”.
  • ”;” seperates the two commands.

Putting it together

find ./ -iname "*.txt" | sed -e 's#^./#http://LinuxBasics.org/#g ; s/.txt$//g'

gives us what we want:

http://LinuxBasics.org/wiki/syntax
http://LinuxBasics.org/wiki/dokuwiki
http://LinuxBasics.org/wiki/playground
http://LinuxBasics.org/start
http://LinuxBasics.org/tutorials/pre/start
http://LinuxBasics.org/tutorials/pre/md5sum
/home/www/LinuxBasics.org/data/pages/tutorials/advanced/realworld/url-lists_for_google.txt · Last modified: 2008/07/20 21:08 (external edit)
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0