The core of this project is ancient, and when I started it, I wanted a one-file thing for myself quick and didn't really think of publishing it at some point. Thus, the code is quite ugly in places, has quite a few pre-descriptor-ish reminicients and could do with an overhaul. I'll do this if someone actually wants to help out with development. Principles ========== If you develop smallwp, consider the following points that will lead *me* (you can always fork if you think it's worth it): * minimal dependencies (I'd like to get rid of the extension module...) * no db server requried (i.e., if at all, nothing more than sqlite) * minimal mass storage footprint I18N ==== All language specific settings should be in wpprefs. To adapt for other languages, it should be sufficient to amend the _wikiLangData dictionary. Tests ===== Well, there are no useful unit tests, and I've even neglected the few doctests that are in there. However, before submitting patches I'd appreciate if you could run tests/roughtest.sh. To be able to use it, you need to download both the XML and the HTML versions of the Asturian (I believe:-) wikipedia (language code ast). Enter their file names at the top of the script and let it run. Kill all smallwp servers before doing that. The stuff may talk a bit, but in the end it should say "All tests run.". If it doesn't, it will emit some URL that didn't have as it expected. The server that did that still runs after such an error message, so you can immediately try the URL. On the other hand, you will have to kill the server manually before trying again. Hacking Projects ================ Here's a list of things that would be cool to do * move the inline files from wpprefs to smallwp/resources. * expand coverage of wikitext syntax. Don't overdo it. In particular, don't even think of trying to use a "real" parser to do that. I'm not sure where wikitext is in the Chomsky hierarchy, but I'm quite sure it's a nightmare to parse it with common context free parsers. * Do an analysis of whitespace handling in the template code. I estimate that at least half of the template woes are a result of my utter inability to gather rules how whitespace survives or is deleted in template expansion. * fiddle in support for categories for XML dumps. * fiddle in possibility to get images from wikipedia main page in XML dumps, possibly into a quasi-persistent cache. * do away with the Looker extension. Really, for almost all of the machines I'm targeting, mmaping the index and then working with it like it's a string should be fine. As an added bonus, fast RE searches in the titles would be possible. API === To access the wikipedia data from your scripts, say:: from smallwp import wpdata wpData = wpdata.getWpData() and then wpData.getArticle(title) to get raw wikitext for the article title wpData.iterPagenames() to iterate over all known page names wpData.getSearchResult(someRe) to (slowly) get all page titles matching someRe. You get back a SearchResult instance that you can iterate over, or use the getResults method to get a list of matched articles. wpData.getMatches(key, numMatches=40) to quickly get up to 40 titles beginning with key In general, you pass in unicode strings and you get back unicode strings. .. vi:et:tw=72