?

Log in

No account? Create an account
Requests for volunteers. - Baxil [bakh-HEEL'], n. My Sites [Tomorrowlands] [The TTU Wiki] [Photos]
View My LJ [By Tag]


December 18th, 2004
05:07 pm
[User Picture]

[Link]

Previous Entry Share Next Entry
Requests for volunteers.
1. Like last year, I'm heading up the Further Confusion newsletter "team" -- where "team" is currently defined as the set of people containing the elements "me". (Someone remind me after the con is over to stop volunteering for jobs that require scrounging up a staff. :-p)
1a: Is there anyone out there who would enjoy the chance to play reporter and provide convention event coverage? And/or page layout support? And/or photo support? You would earn Massive Baxil Karma. All of this year's busloads plus the leftovers from 2004's con. (Slightly more details here.)

1b: For that matter, I still haven't even made hotel arrangements for the con yet. @_@; An awesome manager is me! Is anyone looking for a con roommate, who would be willing to put up with me and my computer and people occasionally trickling in and out as four different issues of the at-con newsletter get thrown together from scratch?

If you know a lot of people going to FurCon, forwarding the volunteer request would earn a <3 or two as well.

(One month to go. I can do this ... I can do this ...)

2. Are there any PERL hackers out there on my friends list who would like the chance to fiddle around with a wiki and program in support for categories? I'll be rolling out the project to which I link in the Near Future, but I want to add "categories" functionality first so that I can more easily organize the content. (Backlinks [a] don't behave the way I want, and [b] nuke the processor as the wiki scales.) Suffice to say that my free programming time at the moment is nearly nil.

Current Mood: worriedworried

(5 comments | Leave a comment)

Comments
 
[User Picture]
From:kshandra
Date:December 18th, 2004 05:34 pm (UTC)
(Link)
For that matter, I still haven't even made hotel arrangements for the con yet. @_@;

Good luck; I was at the all-hands this morning, and the only open rooms left in the hotel now are smoking rooms....
[User Picture]
From:krasnayath
Date:December 18th, 2004 08:29 pm (UTC)
(Link)
Well, you can browse through the photos I may happen to take, if you'd like, to use in the newsletters.. (Oh my, that must be horrid grammar).. But I want to go to the con to actually experience it, not work there..
[User Picture]
From:baxil
Date:December 19th, 2004 01:00 am (UTC)
(Link)
That would still be handy, yes. :) Having another photographer's shots for the big photo issue helped a lot last year. Thank you!
[User Picture]
From:kistaro
Date:December 18th, 2004 09:28 pm (UTC)
(Link)
Gack. Sorry for dropping the ball on the programming thing- I got too caught up in finals. But now they're over!

Of course, I still don't know PERL, but I have thought of an ideal design.

The basic problem faced with a category engine is bottlenecking on disk access. Is it correct or incorrect to assume that all category data files must remain on disk, or is it intended that all the data remain loaded in RAM at all times? It seriously changes an optimal structure.

The most obvious way to do a category engine- a pile of data files, one for each category, and when you update a page, check all categories to see if that page needs to be removed from them- is, predictably, a terrible way to do it. Opening every single category file- and for them to be Wiki pages, they need to be separate files- is extremely costly on time, relatively speaking; whenever a program will be accessing the disk, almost all other time considerations go out the window because disk access is several orders of magnitude slower than other operations. Opening and scanning every file is Bad.

Keep one category file for each category (from which the Wiki can easily construct a category index) containing a sorted list of the names of the pages within that category. That's the obvious part; list-sorting is best done by simple insertion-sort (just put it in the right spot each time and hope nothing messes with it), and while binary searching can be used, there's no reason to bother since we're using the disk anyway. Simultaneously, keep a set of index files, implemented as a multi-level hash table on the disk. One file- the indexfile- is loaded. A page name is translated into a hash code, that index is found in the indexfile, and that will give another file, where everything with that index is kept in sorted order. A record in the second level index-file, if it's not another index itself because that box got too large, is the page-name (the key) and the list of categories that page is in. When a page is created, it is added to the pile and to the new categories; when it is modified, its new category list is checked against its old list. Only the category files corresponding to the changed categories need be opened at this point! By adding the overhead of two files- technically, log(n) files for n distinct pages, but the base of the logarithm is very large (still a minimum of two pages, though, for multi-level hashing) and your Wiki is not likely to grow to the hundreds of thousands of pages required for an extra level of record to be required- you switch from opening up all categories to just the required subset of them. In practice, this seriously decreases running time.

The cross-hashing scheme I've listed above may be even more complex than what is needed. If the Wiki is expected to remain extremely small (less than 1000 pages is definitely safe for this), there's no reason to mess with the hash tables at all. Just use an index directly; it's no different from having zero levels of hash indexes before reaching data.

The downside to hashing: If the hash-file gets corrupted, the program will make wrong decisions (if it's desynchronized from the categories). This can be fixed with a program that sniffs down every category and creates the hashfile anew, but this is a slow operation- although not a whole lot slower than doing a single modification on the brute-force "check everything" method.

A B-tree is way more complex than you need for this, because you're not going to be dealing with the millions of records that make it required, and lookups will be directly by name rather than by range (so you don't need the sorting a B-tree is guaranteed to give you). Compound hashing is faster than a B-tree for this useage, and it's a lot simpler as well. (Especially for deleting records. It's not as bad as red-black tree deletion, but it's still brainexplodey.)

Anyway, sorry about losing track of that. Now that I'm out of finals, I might be able to write a C++ backend for a PERL script- the only task of the PERL script would be to call the C++ program. However, that is inherently slower and harder to maintain than keeping it in one language- especially since PERL is optimized for string manipulation.
[User Picture]
From:baxil
Date:December 21st, 2004 03:08 am (UTC)
(Link)
Briefly --

I wasn't even considering an index file -- basically just what amounts to pointers embedded in the wiki page! (The "{cat:foo}" syntax itself.) To the best of my knowledge, the current page is loaded in memory (or is easily accessible) when the script with the editing action is run -- so at the time the new (or edited) page is saved, you can access both what categories are requested for the page, and what categories (if any) the page used to be in, trivially. Same functionality, one fewer disk call. Obviously, only changed cats would be loaded, as you describe.

Lack of an index does raise the possibility of desynchronization, but barring a runtime error I can't see what would cause one; there's no way short of manually chaging a file on the server via a command lline to change a page without going through the edit routine that would contain all the calls to the category code.

For the record, in terms of performance hits: Most current wiki search functions -- even for backlinks to a given page -- grind every single file! O_o That's just not going to cut it for the filing complexity I want. Even just getting people to sift pages with {cat} and {index} will save some serious disk access from the built-in stuff, even if an inelegant solution is used.

(Yes, there are wikis which handle it more elegantly. But I didn't want the hassle of ANOTHER sql setup ...)

Basically, what I'm looking for is the most trivial good solution, not the most elegant. So, yeah, keeping it in PERL would be ideal so I can just diff the changes into the script. The hack of the individual-category-files-only-opened-when-changed is exactly what I was thinking of. Implementation should be extremely straightforward -- I just don't have much time for it at the moment.
Tomorrowlands Powered by LiveJournal.com