The core of this project is ancient, and when I started it, I wanted a
one-file thing for myself quick and didn't really think of publishing it
at some point.  Thus, the code is quite ugly in places, has quite a few
pre-descriptor-ish reminicients and could do with an overhaul.  I'll do
this if someone actually wants to help out with development.


Principles
==========

If you develop smallwp, consider the following points that will lead
*me* (you can always fork if you think it's worth it):

* minimal dependencies (I'd like to get rid of the extension module...)
* no db server requried (i.e., if at all, nothing more than sqlite)
* minimal mass storage footprint


I18N
====

All language specific settings should be in wpprefs.  To adapt for other
languages, it should be sufficient to amend the _wikiLangData
dictionary.  

Tests
=====

Well, there are no useful unit tests, and I've even neglected the few
doctests that are in there.  However, before submitting patches I'd
appreciate if you could run tests/roughtest.sh.  To be able to use it,
you need to download both the XML and the HTML versions of the Asturian
(I believe:-) wikipedia (language code ast).  Enter their file names at
the top of the script and let it run.  Kill all smallwp servers before
doing that.

The stuff may talk a bit, but in the end it should say "All tests run.".
If it doesn't, it will emit some URL that didn't have as it expected.
The server that did that still runs after such an error message, so you
can immediately try the URL.  On the other hand, you will have to kill
the server manually before trying again.

Hacking Projects
================

Here's a list of things that would be cool to do

* move the inline files from wpprefs to smallwp/resources.
* expand coverage of wikitext syntax.  Don't overdo it.  In particular,
  don't even think of trying to use a "real" parser to do that.  I'm
	not sure where wikitext is in the Chomsky hierarchy, but I'm quite
	sure it's a nightmare to parse it with common context free parsers.
* Do an analysis of whitespace handling in the template code.  I
  estimate that at least half of the template woes are a result of my
  utter inability to gather rules how whitespace survives or is deleted
  in template expansion.
* fiddle in support for categories for XML dumps.
* fiddle in possibility to get images from wikipedia main page in XML
  dumps, possibly into a quasi-persistent cache.
* do away with the Looker extension.  Really, for almost all of the
  machines I'm targeting, mmaping the index and then working with it
  like it's a string should be fine.  As an added bonus, fast RE
  searches in the titles would be possible.

API
===

To access the wikipedia data from your scripts, say::

  from smallwp import wpdata
  wpData = wpdata.getWpData()

and then

wpData.getArticle(title)
  to get raw wikitext for the article title

wpData.iterPagenames()
  to iterate over all known page names

wpData.getSearchResult(someRe) 
  to (slowly) get all page titles matching someRe.  You get back a
  SearchResult instance that you can iterate over, or use the getResults
  method to get a list of matched articles.

wpData.getMatches(key, numMatches=40) 
  to quickly get up to 40 titles beginning with key

In general, you pass in unicode strings and you get back unicode strings.

.. vi:et:tw=72