Posted by Małgorzata Brożyna on Jul 7, 2016 6:34:17 AM
Many search queries entered in the RoyaltyStat search form contain only one keyword or a short phrase such as injection, inject*, mold*, form*, utility*, wireless equipment. One keyword or one phrase search queries are combined with the use of simple operators AND, OR, AND NOT to join the initial keywords.  E.g. inject* OR mold* OR form* in a single query. Initially, users enter a prototypical word of a category or, simply, a word or phrase they know. They don’t develop their search strategies further and don't appear to use our search guide. This behavior is known as a paradox of the active user.

Users can get more benefits by including all the morphological variants of the keywords in their queries. They tend to use a bare keyword, e.g. injection, wireless equipment, relying on stemming or lemmatization tools which consist of reducing a keyword to its stem or lemma. These tools do not give satisfactory results in case of homonyms or the words that start with the same chain of letters that could be defined as the same stem, e.g. plan and plane. If users decide to add a wildcard (*) at the end of the word or stem, they tend to put it in the wrong place, e.g. utility* instead of the more rewarding expression utilit*. In some cases, it is better to add morphological variants instead of using (*), e.g. form* = {forms, formula, formulaic, formulate, formulation, former, formidable etc.}. Furthermore, user’s decision of adding either morphological or lexical variants is relatively late. Of course, this behavior may be explained by the user’s habits when searching the Internet. Google provides AutoComplete, Bing – new decision engine in order to assist users and provide satisfactory results immediately. Not surprisingly, users expect receiving relevant results very fast, in a second, when they search any kind of specialty database. Other examples prove that users tend to make language mistakes in their queries, e.g. desgin* and furiture* or in formulating phrases, e.g. various pre-modifiers in a single noun phrase, such as industry design, industrial design. 

This short analysis of users queries and behavior confirms that they need assistance in their searches. For this reason, we implemented several added features to the RoyaltyStat keyword search engine. They constitute an interior search mechanism (which is a process of searching for additional entities that are not visible to users) including the following dimensions:

  • morphological – providing morphological forms of a word;
  • lexical – providing lexical variants (e.g. British English and American English) of a word or a term;
  • thematic – collecting the words in the clusters with reference to specific industries.

Thanks to this interior search mechanism, RoyaltyStat users entering a bare keyword can receive results containing all the morphological, spelling and lexical variants of the word, e.g. progestogens = {progestogens, progestagens, gestagens}, Hansen's disease = {Hansen's disease, Leprosy}.



What is more, the interior mechanism enables users to make more extensive search, e.g. they can search for all words starting from prefix hemo-, haemo-; ending with suffix –prazole or having interfix, e. g. -virus-



          Picture4.pngThis interior search mechanism has been developed to save time, correct typos and other language mistakes. It is much easier and faster to select a word than to type it in the search box. Moreover, including lexical variants and prefix, interfix and suffix search increases a number of results which help users to find relevant information “at full speed”.

Małgorzata Brożyna has a Ph.D. in Linguistics and a post-graduate diploma in Natural Language Processing (NLP).
She is a research associate in RoyaltyStat managing pipeline research and quality control (QC) and an academic teacher at the Pedagogical University of Cracow (Poland). She can be contacted at:

