=================
smallwp user docs
=================

.. highlights:: This is a little program that gives you an offline
   version of the wikipedia in as little space as possible (well,
   probably you can do better, but for now, it should do).  It does not
   do images.  It can work off both HTML dumps and XML dumps with wiki 
   markup.


Obtaining the data
------------------

Up front: There's some language-dependent aspects of the software.
Right now, Asturian, German, English, and languages that use namespaces
like English or German are supported. see README.HACKING for how to add
support to others.

At this point, you need to decide whether you want to use an XML-based
or an HTML-based dump.

HTML dumps have the advantage that rendering is basically like
wikipedia does it.  Math images are not supported, and it takes quite
a bit more space than XML dumps.  Images can be retrieved from wikipedia
when you have a net connection.

XML dumps are smaller, can to math images and are generally "the right
thing", but rendering frequently is shaky, in particular if you enable
template rendering -- the mediawiki markup is a nightmare.  Also, they
don't do Categories.

XML dumps are available from
http://dumps.wikimedia.org/backup-index.html.  Follow the link to
<language code>wiki, i.e., enwiki for the English wikipedia or eswiki
for the spanish one.  Get the pages-articles.xml.bz2 flavor.

HTML dumps can be obtained at http://static.wikipedia.org/.

Then say::

	sudo splitWp <name of the dump>

and wait for quite a while; the large wikipedias will take of the order
of hours, depending on the machine you are running the split on.  The
good news is that the resulting directory can later be transferred to
slower machines.

If you cannot sudo or do not want to, change the basedir in ~/.smallwp
(see Configuration_ below) to something you can write to and leave out 
the sudo.


Moving the data
---------------

The data in /var/share/wikipedia (or whatever you changed your basedir
to) should be machine-independent, so you can split the stuff on your
big server and copy the resulting data to your PDA.  This should just
work.

Given enough interest, I could set up torrent feeds for some of the more
popular wikipedias and distribute pre-split data.  This would save quite
some bandwidth and computing time, but clearly only if enough people
used this program (which they don't, right now).

Usage
-----

Run wpGui (or wpTui, if you don't run X) and point your browser 
to http://localhost:8780 [#headless]_ .

You can browse around in the index or use the right search field with
regular expressions (which is currently quite slow, though).  The most
convenient access method proabably is the Title Search, though: Just
enter a few characters (case sensitively!), and you'll get a list of up
to 40 titles that start with the letters you entered.

For now, you should run wpGui from a terminal -- you'll see tracebacks
and the like, which right now may explain quite a bit...


Configuration
-------------

Smallwp is configured through an INI-style configuration file ~/.smallwp
(if you don't like this location, you can change it using the SMALLWPCONF
environment variable).  To see what you can set in there, say::

  splitWp -H


This will output something like this:

Section [general]
.................

Configuration of smallwp

* address: string; 
  defaults to '127.0.0.1' --
  Address the server should listen on
* basedir: string; 
  defaults to '/var/share/wikipedia' --
  Path to root of wikipedia data
* doTemplates: boolean; 
  defaults to 'False' --
  Attempt to render templates? (slow, buggy, but probably worth it)
* mathOutputResolution: integer; 
  defaults to '110' --
  Resolution (in dpi) of math images
* maxTitleMatches: integer; 
  defaults to '1000' --
  Maximal number of matches in title search
* mediaBase: string; 
  defaults to 'http://upload.wikimedia.org/wikipedia/commons/' --
  URL fragment to prepend to image links and the like in HTML dumps
* port: integer; 
  defaults to '8780' --
  Port the server should listen on
* sitename: string; 
  defaults to 'Small Wikipedia Server' --
  Name of wikipedia instance for display purposes

I do not recommend changing the address unless you know what you're
doing and trust my code.  I believe there are no obvious security holes
in here, but then again I haven't really checked.  The default only
opens up access to your own machine (which may still be a security issue
if I messed up and you have evil users on your machine).

An important setting is doTemplates.  By default, it's false since
the code that supports templates is not finished yet.  On the other
hand, templates work for many pages.  They are slow though, because for
every template, the system has to decrypt 1.5 megabyte on average, which
slows things down quite a bit.  Still, try setting it to True and see if
you like what you see.

You also may want to change is mathOutputResolution -- the pngs
for math get larger as you increase this.  Also, this would be an
obvious candidate for having a GUI control...

.. [#headless] it wouldn't be hard to provide a headless server if you want
   to run the thing all the time.  It just happens that I don't.  If you
   want this, ask (or do it yourself...).

.. vi:tw=72:et: