Search Engine

This article describes how including a search engine for your website. We will use extensions indexed_search (already preinstalled with TYPO3) and  crawler for indexing content and extension  macina_searchbox for the search mask that will be put on each page.

Summary of this article:

PLEASE NOTE: You should consider reading a much more elaborated version of this article I wrote more recently: Indexed Search & Crawler - The missing manual.

Installing

Extension crawler is used to manage cron jobs. It allows us to choose what should be indexed and when to perform this job. This is useful as out-of-the-box, indexed_search will index content (pages and associated documents) as visitors are showing your website; that’s not very fair with them as it requires additional time to “render” the page.

Let’s install these two extensions:

Next step is to configure the extension indexed_search. Following screenshot shows options to be configured. We will need a few tools:

  • pdftotext and pdfinfo (package xpdf-utils under Debian) to index PDF files
  • unzip to index content of ZIP archives ZIP
  • catdoc to index MS Word documents
  • xlhtml to index MS Excel spreadsheets
  • ppthtml to index MS Powerpoint presentations
  • unrtf to index RTF files

The other important configuration options are to deactive document indexing from frontend, as we will configure cron jobs for this task, and to specify that  external files (PDF, …) should be indexed with an additional process, not the same as the related web page.

Configuring

Before creating the cron job, we will configure our website to be index-ready. In the Setup part of our template, let’s add:

page.config.index_enable = 1
page.config.index_externals = 1

Ensuite, dans la partie pageTS de notre page d’accueil (respectivement la racine de notre site), ajoutons

tx_crawler.crawlerCfg.paramSets {
	tt_content = &L=[0-1]
	tt_content.procInstrFilter = tx_indexedsearch_reindex
		# if extension cachemgm is available too:
	# ... = tx_indexedsearch_reindex, tx_cachemgm_recache
	tt_content.baseUrl = http://www.domain.tld/
}

Using mode Web > List, we should create a new record of type Indexing configuration at our website root (homepage). Now choose type page tree and select the root page of the website. The indexing depth may be set to 1 if all of our pages are accessible from the root page (that’s often the case). Now save the configuration and check that it’s active (a red question mark on the configuration icon means that the configuration is hidden and thus that it is deactived.

Now click on Web > Info, select the root page and then in information screen, choose crawler in the drop-down list.

Check that we will perform a site crawling on all sublevels (infinite), select the Re-indexing processing instruction, click on Update and then Crawl URLs. Our site will then start to be indexed. We may check indexing status if we choose Indexed search in the drop down list of the information screen.

Nouvel utilisateur

Pour que l’extension crawler fonctionne, il faut créer dans Tools > User Admin un nouvel utilisateur nommé _cli_lowlevel. Le mot de passe n’importe pas et ses droits d’accès non plus.

That’s it! We still just have to create the cron job:

* * * * * www-data   php /path/to/typo3/cli_dispatch.phpsh crawler

Search Mask

Let’s start creating a page that will contain the search result list. In page property, we should choose to hide it in menu as we will shortly add a search mask to each and every page of our website. Now we may add plugin indexed_search as content element:

In order to create the search mask itself, we may install extension  macina_searchbox just as we did for the two extensions related to content indexing. Then, in our templateTS, we have to add code below in order to include the plugin:

plugin.tx_macinasearchbox_pi1 {
	pidSearchpage = 12
	templateFile = fileadmin/templates/search_template.html
}
 
lib.searchbox < plugin.tx_macinasearchbox_pi1

Parameter pidSearchpage is the ID of the page containing the plugin indexed_search. Parameter templateFile is set per default to file EXT:macina_searchbox/pi1/template.htm. You may create a copy of it locally and customize it according to your needs.

Further Reading

You may read  the documentation of the extension indexed_search if you wish to gain more control over the search results and the indexing options.

Flattr