​​haystack​​​是django的开源搜索框架,该框架支持​​Solr​​,Elasticsearch,Whoosh,*Xapian*搜索引擎,不用更改代码,直接切换引擎,减少代码量。搜索引擎使用​​Whoosh​​,这是一个由纯Python实现的全文搜索引擎,没有二进制文件等,比较小巧,配置比较简单,当然性能自然略低。中文分词​​Jieba​​,由于Whoosh自带的是英文分词,对中文的分词支持不是太好,故用jieba替换whoosh的分词组件。其他:Python 2.7 or 3.4.4, Django 1.8.3或者以上,Debian 4.2.6_3







1. from django.db import models2. from django.contrib.auth.models import User3. 4. 5. class Note(models.Model):6. user = models.ForeignKey(User)7. pub_date = models.DateTimeField()8. 200)9. body = models.TextField()10. 11. def __str__(self):12. return self.title

1. 首先安装各工具

​​pip install whoosh django-haystack jieba​​

2. 添加 Haystack 到Django的 INSTALLED_APPS


1. INSTALLED_APPS = [2. 'django.contrib.admin',3. 'django.contrib.auth',4. 'django.contrib.contenttypes',5. 'django.contrib.sessions',6. 'django.contrib.sites',7. 8. # Added. haystack先添加,9. 'haystack',10. # Then your usual apps... 自己的app要写在haystakc后面11. 'blog',12. ]

3. 修改 你的,以配置引擎


1. import os2. HAYSTACK_CONNECTIONS = {3. 'default': {4. 'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',5. 'PATH': os.path.join(os.path.dirname(__file__), 'whoosh_index'),6. },7. }

其中顾名思义,​​ENGINE​​​为使用的引擎必须要有,如果引擎是​​Whoosh​​​,则​​PATH​​必须要填写,其为Whoosh 索引文件的存放文件夹。




1. import datetime2. from haystack import indexes3. from myapp.models import Note4. 5. class NoteIndex(indexes.SearchIndex, indexes.Indexable): #类名必须为需要检索的Model_name+Index,这里需要检索Note,所以创建NoteIndex6. True, use_template=True) #创建一个text字段7. 8. 'user') #创建一个author字段9. 10. 'pub_date') #创建一个pub_date字段11. 12. def get_model(self): #重载get_model方法,必须要有!13. return Note14. 15. def index_queryset(self, using=None): #重载index_..函数16. """Used when the entire index for model is updated."""17. return self.get_model().objects.filter(


每个索引里面必须有且只能有一个字段为 ​​document=True​​,这代表haystack 和搜索引擎将使用此字段的内容作为索引进行检索(primary field)。其他的字段只是附属的属性,方便调用,并不作为检索数据。直到我自己完成一个搜索器,也没有用到这些附属属性,所以我索性就都删掉了,大家学习的时候也可以先注释掉不管。具体作用我也不明白,反正我没用上。


并且,​​haystack​​​提供了​​use_template=True​​​在​​text​​​字段,这样就允许我们使用​​数据模板​​​去建立搜索引擎索引的文件,说得通俗点就是索引里面需要存放一些什么东西,例如 Note 的 title 字段,这样我们可以通过 title 内容来检索 Note 数据了,举个例子,假如你搜索 python ,那么就可以检索出含有title含有 python 的Note了,怎么样是不是很简单?数据模板的路径为​​templates/search/indexes/yourapp/note_text.txt(推荐在项目根目录创建一个templates,并在settings.py里为其引入,使得django会从这个templates里寻找模板,当然,只要放在任何一个你的Django能搜索到的tempaltes下面就好,关于这点我想不属于我们讨论的范畴),​​​​templates/search/indexes/blog/note_text.txt​​​文件名必须为​​要索引的类名_text.txt​​,其内容为

1. {{ object.title }}2. {{ object.user.get_full_name }}3. {{ object.body }}

这个数据模板的作用是对​​Note.title​​​, ​​Note.user.get_full_name​​​,​​Note.body​​这三个字段建立索引,当检索的时候会对这三个字段做全文检索匹配。上面已经解释清楚了。



1. (r'^search/', include('haystack.urls')),


1. from django.conf.urls import url2. from haystack.views import SearchView3. 4. urlpatterns = [5. '^$', SearchView(), name='haystack_search'),6. ]




2. 3.
4. 5. {{ form.as_table }} 6. 7. 8. 11. 12.
9. 10.
13. 14. {% if query %} 15.


16. 17. {% for result in page.object_list %} 18.

19. {{ result.object.title }} 20.

21. {% empty %} 22.

No results found.

23. {% endfor %} 24. 25. {% if page.has_previous or page.has_next %} 26. 31. {% endif %} 32. {% else %} 33. {# Show some example queries to run, maybe query syntax, something else? #} 34. {% endif %} 35.


然后为大家解释一下这个文件。首先可以看到模板里使用了的变量有 form,query,page 。下面一个个的说一下。





#设置每页显示的数目,默认为20,可以自己修改  HAYSTACK_SEARCH_RESULTS_PER_PAGE  =  8



那么问题来了。对于一个search页面来说,我们肯定会需要用到更多自定义的 context 内容,那么这下该怎么办呢?最初我想到的办法便是修改haystack源码,为其添加上更多的 context 内容,你们是不是也有过和我一样的想法呢?但是这样做即笨拙又愚蠢,我们不仅需要注意各种环境,依赖关系,而且当服务器主机发生变化时,难道我们还要把 haystack 也复制过去不成?这样太愚蠢了!突然,我想到既然我不能修改源码,难道我还不能复用源码吗?之后,我用看了一下官方文档,正如我所想的,通过继承SeachView来实现重载 context 的内容。官方文档提供了2个版本的SearchView,我最开始用的是新版的,最后出错了,也懒得去找错误是什么引起的了,直接使用的了旧版本的SearchView,只要你下了haystack,2个版本都是给你安装好了的。于是我们在myapp目录下再创建一个 文件,位置名字可以自己定,用于写自己的搜索视图,代码实例如下:

1. from haystack.views import SearchView 2. from .models import * 3. 4. class MySeachView(SearchView): 5. def extra_context(self): #重载extra_context来添加额外的context内容 6. self).extra_context() 7. 'major').order_by('add_date')[:8] 8. 'side_list'] = side_list 9. return context


1. url(r'^search/', search_views.MySeachView(), name='haystack_search'),

讲完了上下文变量,再让我们来讲一下模板标签,haystack为我们提供了 {% highlight %}和 {% more_like_this %} 2个标签,这里我只为大家详细讲解下 highlight的使用。

你是否也想让自己的检索和百度搜索一样,将匹配到的文字也高亮显示呢? {% highlight %} 为我们提供了这个功能(当然不仅是这个标签,貌似还有一个HighLight类,这个自己看文档去吧,我英语差,看不明白)。


1. {% highlight with [css_class "class_name"] [html_tag "span"] [max_length 200] %}

大概意思是为 text_block 里的 query 部分添加css_class,html_tag,而max_length 为最终返回长度,相当于 cut ,我看了一下此标签实现源码,默认的html_tag 值为 span ,css_class 值为 highlighted,max_length 值为 200,然后就可以通过CSS来添加效果。如默认时:

1. span.highlighted { 2. color: red; 3. }


1. # 使用默认值 2. {% highlight result.summary with query %} 3. 4. # 这里我们为 {{ result.summary }}里所有的 {{ query }} 指定了一个

标签,并且将class设置为highlight_me_please,这样就可以自己通过CSS为{{ query }}添加高亮效果了,怎么样,是不是很科学呢 5. {% highlight result.summary with query html_tag "div" css_class "highlight_me_please" %} 6. 7. # 这里可以限制最终{{ result.summary }}被高亮处理后的长度 8. {% highlight result.summary with query max_length 40 %}



使用​​python rebuild_index​​或者使用​​update_index​​命令。



1. #自动更新索引 2. HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'


1 将文件​​​​(该文件路径为​​python路径/lib/python2.7.5/site-packages/haystack/backends/​​)拷贝到app下面,并重命名为​​​​,例如​​blog/​​。


1. from jieba.analyse import ChineseAnalyzer #在顶部添加 2. 3. schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=ChineseAnalyzer(),field_boost=field_class.boost, sortable=True) #注意先找到这个再修改,而不是直接添加

2 在​​​​中修改引擎,如下

1. import os 2. HAYSTACK_CONNECTIONS = { 3. 'default': { 4. 'ENGINE': 'blog.whoosh_cn_backend.WhooshEngine', #blog.whoosh_cn_backend便是你刚刚添加的文件 5. 'PATH': os.path.join(BASE_DIR, 'whoosh_index' 6. }, 7. }



怎么样,还行吧?眼尖的人会发现,为什么标题里的高等没有被替换成...,而段落里的数学之前的内容却被替换成了...,标题本来就很短,想象一下,若是高等数学被显示成了数学,是不是丢失了最重要的信息呢?高等这么重要的字眼都被省略了,很显然是不行的,毕竟我是个高等生。那么怎么办呢?我没有选择去看文档,可能文档的HighLight类就是用来干这个的吧,但是我选择了读highlight 标签的源码,最终还是让我实现了。

我们需要做的是复制粘贴源码,然后进行修改,而不是选择直接改源码,创建一个自己的标签。为大家奉上。添加myapp/templatetags/ 文件和 myapp/templatetags/ 文件,内容如下(源码分别位于haystack/templatetags/ 和 haystack/utils/ 中):

1. # encoding: utf-8 2. from __future__ import absolute_import, division, print_function, unicode_literals 3. 4. from django import template 5. from django.conf import settings 6. from django.core.exceptions import ImproperlyConfigured 7. from django.utils import six 8. 9. from haystack.utils import importlib 10. 11. register = template.Library() 12. 13. class HighlightNode(template.Node): 14. def __init__(self, text_block, query, html_tag=None, css_class=None, max_length=None, start_head=None): 15. self.text_block = template.Variable(text_block) 16. self.query = template.Variable(query) 17. self.html_tag = html_tag 18. self.css_class = css_class 19. self.max_length = max_length 20. self.start_head = start_head 21. 22. if html_tag is not None: 23. self.html_tag = template.Variable(html_tag) 24. 25. if css_class is not None: 26. self.css_class = template.Variable(css_class) 27. 28. if max_length is not None: 29. self.max_length = template.Variable(max_length) 30. 31. if start_head is not None: 32. self.start_head = template.Variable(start_head) 33. 34. def render(self, context): 35. self.text_block.resolve(context) 36. self.query.resolve(context) 37. kwargs = {} 38. 39. if self.html_tag is not None: 40. 'html_tag'] = self.html_tag.resolve(context) 41. 42. if self.css_class is not None: 43. 'css_class'] = self.css_class.resolve(context) 44. 45. if self.max_length is not None: 46. 'max_length'] = self.max_length.resolve(context) 47. 48. if self.start_head is not None: 49. 'start_head'] = self.start_head.resolve(context) 50. 51. # Handle a user-defined highlighting function. 52. if hasattr(settings, 'HAYSTACK_CUSTOM_HIGHLIGHTER') and settings.HAYSTACK_CUSTOM_HIGHLIGHTER: 53. # Do the import dance. 54. try: 55. '.') 56. '.'.join(path_bits[:-1]), path_bits[-1] 57. highlighter_module = importlib.import_module(highlighter_path) 58. highlighter_class = getattr(highlighter_module, highlighter_classname) 59. except (ImportError, AttributeError) as e: 60. raise ImproperlyConfigured("The highlighter '%s' could not be imported: %s" % (settings.HAYSTACK_CUSTOM_HIGHLIGHTER, e)) 61. else: 62. from .highlighting import Highlighter 63. highlighter_class = Highlighter 64. 65. highlighter = highlighter_class(query, **kwargs) 66. highlighted_text = highlighter.highlight(text_block) 67. return highlighted_text 68. 69. 70. @register.tag 71. def myhighlight(parser, token): 72. """ 73. Takes a block of text and highlights words from a provided query within that 74. block of text. Optionally accepts arguments to provide the HTML tag to wrap 75. highlighted word in, a CSS class to use with the tag and a maximum length of 76. the blurb in characters. 77. 78. Syntax:: 79. 80. {% highlight with [css_class "class_name"] [html_tag "span"] [max_length 200] %} 81. 82. Example:: 83. 84. # Highlight summary with default behavior. 85. {% highlight result.summary with request.query %} 86. 87. # Highlight summary but wrap highlighted words with a div and the 88. # following CSS class. 89. {% highlight result.summary with request.query html_tag "div" css_class "highlight_me_please" %} 90. 91. # Highlight summary but only show 40 characters. 92. {% highlight result.summary with request.query max_length 40 %} 93. """ 94. bits = token.split_contents() 95. 0] 96. 97. if not len(bits) % 2 == 0: 98. raise template.TemplateSyntaxError(u"'%s' tag requires valid pairings arguments." % tag_name) 99. 100. 1] 101. 102. if len(bits) < 4: 103. raise template.TemplateSyntaxError(u"'%s' tag requires an object and a query provided by 'with'." % tag_name) 104. 105. if bits[2] != 'with': 106. raise template.TemplateSyntaxError(u"'%s' tag's second argument should be 'with'." % tag_name) 107. 108. 3] 109. 110. 4:]) 111. kwargs = {} 112. 113. for bit in arg_bits: 114. if bit == 'css_class': 115. 'css_class'] = 116. 117. if bit == 'html_tag': 118. 'html_tag'] = 119. 120. if bit == 'max_length': 121. 'max_length'] = 122. 123. if bit == 'start_head': 124. 'start_head'] = 125. 126. return HighlightNode(text_block, query, **kwargs)

1. # encoding: utf-8 2. 3. from __future__ import absolute_import, division, print_function, unicode_literals 4. 5. from django.utils.html import strip_tags 6. 7. 8. class Highlighter(object): 9. #默认值 10. 'highlighted' 11. 'span' 12. 200 13. False 14. '' 15. 16. def __init__(self, query, **kwargs): 17. self.query = query 18. 19. if 'max_length' in kwargs: 20. self.max_length = int(kwargs['max_length']) 21. 22. if 'html_tag' in kwargs: 23. self.html_tag = kwargs['html_tag'] 24. 25. if 'css_class' in kwargs: 26. self.css_class = kwargs['css_class'] 27. 28. if 'start_head' in kwargs: 29. self.start_head = kwargs['start_head'] 30. 31. self.query_words = set([word.lower() for word in self.query.split() if not word.startswith('-')]) 32. 33. def highlight(self, text_block): 34. self.text_block = strip_tags(text_block) 35. self.find_highlightable_words() 36. self.find_window(highlight_locations) 37. return self.render_html(highlight_locations, start_offset, end_offset) 38. 39. def find_highlightable_words(self): 40. # Use a set so we only do this once per unique word. 41. word_positions = {} 42. 43. # Pre-compute the length. 44. self.text_block) 45. self.text_block.lower() 46. 47. for word in self.query_words: 48. if not word in word_positions: 49. word_positions[word] = [] 50. 51. 0 52. 53. while start_offset < end_offset: 54. next_offset = lower_text_block.find(word, start_offset, end_offset) 55. 56. # If we get a -1 out of find, it wasn't found. Bomb out and 57. # start the next word. 58. if next_offset == -1: 59. break 60. 61. word_positions[word].append(next_offset) 62. start_offset = next_offset + len(word) 63. 64. return word_positions 65. 66. def find_window(self, highlight_locations): 67. 0 68. self.max_length 69. 70. # First, make sure we have words. 71. if not len(highlight_locations): 72. return (best_start, best_end) 73. 74. words_found = [] 75. 76. # Next, make sure we found any words at all. 77. for word, offset_list in highlight_locations.items(): 78. if len(offset_list): 79. # Add all of the locations to the list. 80. words_found.extend(offset_list) 81. 82. if not len(words_found): 83. return (best_start, best_end) 84. 85. if len(words_found) == 1: 86. return (words_found[0], words_found[0] + self.max_length) 87. 88. # Sort the list so it's in ascending order. 89. words_found = sorted(words_found) 90. 91. # We now have a denormalized list of all positions were a word was 92. # found. We'll iterate through and find the densest window we can by 93. # counting the number of found offsets (-1 to fit in the window). 94. 0 95. 96. if words_found[:-1][0] > self.max_length: 97. 1][0] 98. self.max_length 99. 100. for count, start in enumerate(words_found[:-1]): 101. 1 102. 103. for end in words_found[count + 1:]: 104. if end - start < self.max_length: 105. 1 106. else: 107. 0 108. 109. # Only replace if we have a bigger (not equal density) so we 110. # give deference to windows earlier in the document. 111. if current_density > highest_density: 112. best_start = start 113. self.max_length 114. highest_density = current_density 115. 116. return (best_start, best_end) 117. 118. def render_html(self, highlight_locations=None, start_offset=None, end_offset=None): 119. # Start by chopping the block down to the proper window. 120. #text_block为内容,start_offset,end_offset分别为第一个匹配query开始和按长度截断位置 121. self.text_block[start_offset:end_offset] 122. 123. # Invert highlight_locations to a location -> term list 124. term_list = [] 125. 126. for term, locations in highlight_locations.items(): 127. for loc in locations] 128. 129. loc_to_term = sorted(term_list) 130. 131. # Prepare the highlight template 132. if self.css_class: 133. '<%s class="%s">' % (self.html_tag, self.css_class) 134. else: 135. '<%s>' % (self.html_tag) 136. 137. '' % self.html_tag 138. 139. # Copy the part from the start of the string to the first match, 140. # and there replace the match with a highlighted version. 141. #matched_so_far最终求得为text中最后一个匹配query的结尾 142. highlighted_chunk = "" 143. 0 144. 0 145. prev_str = "" 146. 147. for cur, cur_str in loc_to_term: 148. # This can be in a different case than cur_str 149. actual_term = text[cur:cur + len(cur_str)] 150. 151. # Handle incorrect highlight_locations by first checking for the term 152. if actual_term.lower() == cur_str: 153. if cur < prev + len(prev_str): 154. continue 155. 156. #分别添上每个query+其后面的一部分(下一个query的前一个位置) 157. highlighted_chunk += text[prev + len(prev_str):cur] + hl_start + actual_term + hl_end 158. prev = cur 159. prev_str = cur_str 160. 161. # Keep track of how far we've copied so far, for the last step 162. matched_so_far = cur + len(actual_term) 163. 164. # Don't forget the chunk after the last term 165. #加上最后一个匹配的query后面的部分 166. highlighted_chunk += text[matched_so_far:] 167. 168. #如果不要开头not start_head才加点 169. if start_offset > 0 and not self.start_head: 170. '...%s' % highlighted_chunk 171. 172. if end_offset < len(self.text_block): 173. '%s...' % highlighted_chunk 174. 175. #可见到目前为止还不包含start_offset前面的,即第一个匹配的前面的部分(text_block[:start_offset]),如需展示(当start_head为True时)便加上 176. if self.start_head: 177. self.text_block[:start_offset] + highlighted_chunk 178. return highlighted_chunk

添加上这2个文件之后,便可以使用自己的标签 {% mylighlight %}了,使用时记得Load哦!


1. {% myhighlight with [css_class "class_name"] [html_tag "span"] [max_length 200] [start_head True] %}

可见我只是多添加了一个选项 start_head ,默认为False,如果设置为True 则不会省略。

