|
http://www.teamliquid.net/forum/search.php
As many of you may have noticed, there are times when pages on TL sit there and take upwards of 20, 30 seconds to load. This is due to the crappy MySQL search we currently use, where for the duration of a search, reads from the DB block after a write. Obviously this problem would only get worse as the amount of content increased, so I rewrote the search feature.
Our new search engine is based on Sphinx and completely eliminates any dependency on the database (after indexing), making searches have minimal impact on site performance. It's also a hell of a lot better than MySQL fulltext.
New Features:
- Much faster searching, no extra load on the TL DB.
- No restrictions on search terms - common words, short words, etc all allowed.
- Full unicode support - searching in Hangul or Chinese should find any relevant posts.
- Word forms - multiple words that map to the same term, eg searching for "muta" will also find "mutalisk", "fbh" will include "firebathero", etc. The TL staff have built a small list, we may take suggestions for additional wordforms.
- Special characters [ and ] are now part of words to help search for names with clan tags as well as all the TL topic tags such as [MSL] or [G].
- A few exceptions to special characters have been added to allow you to search for some common words with special characters, eg D+, C- (all iccup ranks) and a few other random things used on the forums.
Known Issues / Disadvantages:
- Updates are no longer real-time. Search is primarily used to help locate older posts; Sphinx does not support index updates so updates to the search index will happen on an hourly (depending on server load) basis. We are discussing alternatives for those of you who are search-happy with your own username.
- Once a post is indexed, it's indexed. If a post is edited after the search engine has indexed it, any changes in the edit will not be searchable. Get your post right first time!
- A maximum of 1000 results are returned for any search.
Please post any bugs or feedback in this thread, if you complain about something in the Known Issues section something bad might happen!
|
|
ouch at the edit part
still a huge update with great features
tyty TL.net
|
|
|
Ya too bad on the edits part, I have a few questions mainly out of curiosity (I also know nothing about Sphinx).
How long did the initial indexing take? Too long to do an occasional complete reindex to catch edits (this is a pretty inelegant solution either way so I don't blame you if you don't want to do it)?
Also how are posts stored internally? I'm guessing there's no "last_modified" column or something that would allow you to reindex edited posts more conveniently (and even then an unindexed last_modified would still require table scans which you might want to avoid).
But if posts have IDs, maybe you could have a separate new log of the IDs of posts as they are edited and then rollup the log for reindexing on occasion?
|
w00t! The limitations of the old search have time and again stumped me. Thanks heaps!
TL skates again :D
|
Good job R1CH. This sounds like a big improvement.
|
The full reindex takes about 30 minutes at present, something we'd rather avoid since it essentially kills TL during the process since it needs so much data from the DB. There is a last modified column, however the problem is the data format used by Sphinx for indexes cannot be changed - to change one post in the index would require a full index rebuild. Additionally, Sphinx requires unique document IDs - if a edited post were marked to be indexed by the update index, Bad Things™ would happen.
I really don't think it's as big an issue as you think, most edits are made within a few minutes of a post being posted and there's a low chance it would get indexed in that timeframe.
|
|
Nice.
Btw it still doesnt search common words.
|
Oh wow, nice!!! Good job!!
|
On December 23 2009 16:33 Highways wrote: Nice.
Btw it still doesnt search common words.
Such as?
|
R1CH, alternatively you could implement a data layer inbetween where you stuff all new and changed posts in a separate database or as plain files, and let full reindexing loose on the offline data. That way the TL database only gets hit where necessary and Sphinx can do its thing. There's still the issue of processor time if that's a concern, though.
|
On December 23 2009 16:36 R1CH wrote:Show nested quote +On December 23 2009 16:33 Highways wrote: Nice.
Btw it still doesnt search common words. Such as?
Lol my bad
|
Works fine for me, looks like you didn't read the OP.
|
Fucking awesome. Thanks r1ch!
|
Awesome work R1CH <3
The edit part is kind of sad, but this is still an awesome update.
|
Mystlord
United States10264 Posts
The indexing part is what scares me of this new search feature. If it's on an hourly basis and indexing takes ~30 minutes, that's a lot of time used to index everything. When you were indexing everything today, the slowdown was immense. TL took ages to respond.
I'm ambivalent about this. If there's a way to reduce the indexing time, then I'd be all for it.
|
Mystlord, only a full index run takes half an hour. An update with new items should take considerably less time.
R1CH, the matching algorithm is very hard. There is no room for spelling errors and there is no partial matching (partial matching could be seen as a spelling mistake, a lighter form). Any options for Sphinx to enable that? Or do you think that feature is not worthwhile?
|
|
|
|