Interactive query expansion for professional search applications

Knowledge workers (such as healthcare information professionals, patent agents and recruitment professionals) undertake work tasks where search forms a core part of their duties. In these instances, the search task is often complex and time-consuming and requires specialist expert knowledge to formulate accurate search strategies. Interactive features such as query expansion can play a key role in supporting these tasks. However, generating query suggestions within a professional search context requires that consideration be given to the specialist, structured nature of the search strategies they employ. In this paper, we investigate a variety of query expansion methods applied to a collection of Boolean search strategies used in a variety of real-world professional search tasks. The results demonstrate the utility of context-free distributional language models and the value of using linguistic cues to optimise the balance between precision and recall.


Introduction
Many knowledge workers rely on the effective use of search applications in the course of their professional duties (Verberne et al., 2019).For example, healthcare information professionals perform systematic searching of published literature sources as the foundation of evidence-based medicine (Russell- Rose & Chamberlain, 2017).Likewise, patent agents rely on prior art search as the foundation of their due diligence process (Lupu et al., 2011).Similarly, recruitment professionals use Boolean search as the foundation of the candidate sourcing process (Russell- Rose & Chamberlain, 2016a).
However, systematic literature reviews can take years to complete (Bastian et al., 2010), and new research findings may be published in the interim, leading to a lack of currency and potential for inaccuracy (Shojania et al., 2007).Likewise, patent infringement suits have been filed at a rate of more than 10 a day due to the later discovery of prior art which their original search missed (Gibbs, 2006).And recruitment professionals report that finding candidates with appropriate skills and experience continues to be their primary concern (Russell-Rose & Chamberlain, 2016b).Each of these domains has its own expert competencies and communities of practice, but conceptually they share a need to execute searches that are comprehensive, transparent and reproducible (Mullins et al., 2014).It is this common need that motivates the work described in this paper.
There is another motivation for our work, and that is the discrepancy between academic research in information systems and actual industry use cases.It has been pointed out that evaluation in academic projects tends to focus on idealised tasks that are less complex than those found in industry and that investigating realistic use cases is a fundamental step in bridging the gap between academia and industry (Karlgren, 2019).We see our work as a contribution toward this goal.The traditional solution to structured search problems is to use form-based query builders such as that shown in Figure 1.The output of these tools is typically a series of Boolean expressions consisting of keywords, operators and ontology terms, which are combined to form a multi-line artefact known as a search strategy (Figure 2).In this paper, we review the role of query expansion within the context of professional structured search strategies.We investigate a number of techniques for generating interactive query suggestions, and evaluate them using a variety of real-world data.The guiding principle in our evaluation is to provide replicable experiments that will also serve as a benchmark for future investigations.

Background 2.1 Professional search
The term 'professional search' refers to search for information in a work context which often involves complex information needs, the use of multiple repositories and the incorporation of domain-specific taxonomies or vocabularies (Verberne et al., 2018) or a combination of different relevance criteria (Jiaming Qu et al., 2020).Various authors have provided descriptive and behavioral definitions of the term (see (Russell-Rose et al., 2018) for an overview).One of the earliest definitions was proposed by Koster et al. (Koster et al., 2009), whereby professional search: • Is performed by a professional for financial compensation; • Is within a particular domain and/or area of expertise; • Has a specified brief, which is typically well defined but complex; • Has a high value outcome where the results will reduce risk, provide assurances, etc.; • Has budgetary constraints such as time and money.
A key distinction between professional search tasks and other kinds of search tasks, such as casual search (Elsweiler et al., 2012) and web search (Broder, 2002) is that the latter: • Are typically performed on a discretionary basis; • Are not necessarily performed by an expert searcher or domain expert; • And do not place at stake the professional reputation of the searcher.
There is a long history of study into how professionals search in Boolean environments (e.g.(Hersh et al., 2001)).However, professional search gained renewed momentum around a decade ago with the introduction of the TREC Legal Track which focused on e-discovery (Baron et al., 2006), followed later by the TREC Total Recall Track (Grossman et al., 2016).In recent years there has been a renewed interest in systematic literature searching, both from a theoretical (Scells et al., 2020) and practical perspective (Scells & Zuccon, 2018).
Given the complexity of professional search tasks and their reliance on specialist terminology, query expansion offers a natural approach to assist the searcher (Liu et al., 2011).Query expansion is the process of reformulating or augmenting a user's query in order to increase its effectiveness (Manning et al., 2008).Many web search engines, for example, offer query expansion in the form of auto-complete suggestions.Ruthven found, however, that searchers can have difficulty in identifying useful terms for effective expansion (Ruthven, 2003).Despite this, query suggestions can still be useful, as they can help in the search process even if they are not actively selected (Kelly et al., 2009).

Query Expansion
The primary methods for query expansion are referred to as either local (based on documents retrieved by the query) or global (using resources independent of the query).Selection of suggested expansion terms can be either automated (applied without explicit user interaction) or interactive (guided by the user).
Global methods involve the use of resources such as thesauri, controlled vocabularies or ontologies to identify related terms in the form of synonyms, hypernyms, hyponyms, etc. (Aggarwal & Buitelaar, 2012).Such resources may be either manually curated or generated from text corpora using distributional methods.Automated global methods can increase recall significantly but may also reduce precision by adding irrelevant or out-of-domain terms to the query (Manning et al., 2008).
Ontologies are more useful for query expansion when they are specific to the task domain.Generic resources such as WordNet are considered less useful and may not distinguish class concepts from instances (Bhogal et al., 2007).Some ontologies offer an additional source of related terms in the form of words occurring in the term definitions (Navigli & Velardi, 2003).In the biomedical domain, expanding queries with related MeSH terms has been shown to be useful (Rivas et al., 2014), while adding synonyms from the more comprehensive UMLS has been found to improve recall (Griffon et al., 2012), at the expense of precision (Zeng et al., 2012).Query expansion in this context can actually benefit from incorporating a range of domain-specific knowledge bases rather than simply tapping into a single source, e.g.(Balaneshinkordan & Kotov, 2019).
The development of efficient distributional methods has revolutionized unsupervised natural language processing techniques for finding related terms (Collobert et al., 2011;T Mikolov et al., 2013).Consequently, a number of researchers have considered the utility of word embeddings for query expansion.Kuzi (Kuzi et al., 2016), Roy (Roy et al., 2016) and Diaz (Diaz et al., 2016) all used local embeddings trained on TREC corpora, with differing results.While Kuzi (Kuzi et al., 2016) found that local word embeddings outperformed the standard RM3 relevance model, Roy (Roy et al., 2016) found the opposite.Diaz (Diaz et al., 2016) compared local embeddings (TREC corpus) with global (generic Gigaword corpus) and found that local embeddings provided significantly better results for query expansion than global embeddings.More recently, we have seen that contextual embeddings, such as those based on BERT, have transformed the state of the art not only in natural language processing (Devlin et al., 2019) but also in information retrieval (Lin, 2019;Mitra & Craswell, 2018).However, given the nature of our research where we expand query terms on an individual basis, we will focus on context-free embeddings in our experiments.
A fundamental problem with most query expansion techniques is that queries may be harmed as well as improved (Xiong & Callan, 2015).In addition, with fully automated techniques the user may be unable to control how the expansion terms are applied.Moreover, Cao et al (Cao et al., 2008) argue that previous work considers only the effect of a complete set of expansion terms on retrieval, and ignores the issue of how to distinguish useful expansion terms from non-useful terms, e.g. as explored in (Gooda Sahib et al., 2010).We address these issues by treating query expansion as a recommendation task, i.e. given a query term entered by the user, can we recommend further relevant terms.Framing the task in this way is significant, since the use of an interactive approach allows the user to exercise a more informed judgement regarding both term selection and application within a structured search strategy.More broadly, this approach aligns with the goal of offering state-of-the-art query support in professional search while preserving transparency and interpretability (J Qu et al., 2021).
This approach also reflects a broader evolution among search systems from simple lookup tasks to more complex, exploratory, information seeking behaviours (White & Roth, 2009;White, 2016).

Application Context
Query suggestions are a common feature of many web search engines, and have served as the focus of many research studies e.g.(Efthimiadis, 1996;Tahery & Farzi, 2020).Since search queries on the web typically consist of short sequences of keywords with little or no linguistic structure (Beitzel et al., 2007;Kumar et al., 2020), term suggestions can offer immediate value as either an addition to the current query or as a wholesale replacement (Kruschwitz et al., 2013).
Although there have been studies investigating query expansion within a professional search context, e.g.Kim et al (Kim et al., 2011), Verberne et al (Verberne et al., 2014), Verberne et al (Verberne et al., 2016), examples of commercial systems in production are relatively rare.This may be due in part to the challenges presented by the structured nature of the queries themselves.For example, when sourcing candidates for a client brief, recruiters might use a structured query such as that shown in Figure 3.
Java AND (Design OR develop OR code OR Program) AND ("* Engineer" OR MTS OR "* Develop*" OR Scientist OR technologist) AND (J2EE OR Struts OR Spring) AND (Algorithm OR "Data Structure" OR PS OR "Problem Solving") For a query such as this, it is no longer sufficient to offer suggested terms as simple additions or as wholesale replacements.Instead, term suggestions must not only be relevant, but also specific to the structured nature of the query and the individual subexpressions it contains.In the above example, query suggestions relevant to the first subexpression would be quite inappropriate for the second subexpression.
In addition, the professional search context introduces a number of other important considerations regarding the evaluation process: • Many use cases are oriented towards high-recall, set retrieval tasks (Tait, 2014), so evaluation methods based on the relevance ranking of search results are less appropriate.
• The suggested terms are scoped to a subexpression within a larger search strategy, so the evaluation must consider the specific context of each subexpression.• Professional searchers may wish to select and apply expansion terms individually, so the evaluation should consider the contribution of each term individually rather than the effect of a candidate set as a whole.
We have therefore structured our evaluation using an approach based on previous query suggestion studies (Albakour et al., 2011), (Adeyanju et al., 2012), in which existing, human-generated resources are treated as a 'gold standard'.In this context, the task of the query suggestion system is to predict steps in a sequence, e.g.queries submitted by a user in the context of past interactions.Gold standard resources for this are often sampled from query logs (where available).This is in principle similar to evaluating chatbot responses, e.g.(Y.Wu et al., 2019), or news recommendation systems, e.g.(F.Wu et al., 2020), using logged interactions for evaluation purposes.In our case, a gold standard exists in the form of published search strategies.In this context, the evaluation process measures the extent to which terms found in those strategies can be predicted .For example, given the term rodent 2 in line 2 of the strategy of Figure 2, we measure the extent to which the related terms rat, rats, mouse, and mice can be predicted.This particular example contains five such disjunctions (lines 2, 3, 6, 7 and 10), so it offers five opportunities for evaluation.Moreover, since we use publicly available sources (rather than, for example, proprietary log data) our experiments can be more easily replicated by others. 3  Arguably, an ideal test collection for such an evaluation would contain search strategies curated specifically for the purpose.However, whilst such a resource may prove necessary, it may not be sufficient.For example, an ideal test collection should also include: • Search strategies which are actively maintained and updated by the professional community (as opposed to purely archival collections) • Search strategies from more than one domain, to allow investigation of the extent to which domain-specific resources will generalise to other domains.
There is no single collection that meets all three criteria of being curated, in current use, and cross-domain.For our test collection we therefore aggregated samples from the following resources: 1.The CLEF 2017 eHealth Lab (Goeuriot et al., 2017) is an evaluation initiative which includes a curated set of 20 topics for Diagnostic Test Accuracy (DTA) reviews.Each of these topics includes a manually constructed search strategy created by subject matter experts.The 20 search strategies in this collection yielded 102 disjunctions containing 898 terms (i.e. a mean of 8.80 terms per disjunction).Each term consists of a mean of 1.40 tokens.2. The SIGN search filters is an actively maintained collection of "pre-tested strategies 4 that identify the higher quality evidence from the vast amounts of literature indexed in the major medical databases."It covers six study types: Systematic reviews, Randomised controlled trials, Observational studies, Diagnostic studies, Economic studies, and Patient issues.We also consulted the InterTASC Information Specialists' Sub-Group , who maintain a 'Search Filter Resource' as a 'collaborative venture to 5 identify, assess and test search filters designed to retrieve research by study design or focus'.On their advice [Glanville, personal communication], we augmented our collection with two further strategies on the topics of Diagnostic Studies and Economic Evaluations which had been the subject of expert reviews (J.Glanville, personal communication, November 1, 2017).This resulted in a total of eight actively maintained strategies, consisting of 47 disjunctions containing 355 terms (i.e. a mean of 7.55 terms per disjunction).Each term consists of a mean of 1.70 tokens.collected by Glen Cathey to address a specific recruitment brief.After deduplication, these two sources in combination yielded a total of 46 search strategies, containing 80 disjunctions with 571 terms (a mean of 7.15 terms per disjunction).Each term consists of a mean of 1.38 tokens.

A collection of recruitment
In aggregate, these three sources represent data that is curated, actively maintained, and specific to more than one domain.In sum they contain a total of 74 expert search strategies consisting of 229 disjunctions and 1,824 individual query terms.To the best of our knowledge, our experiments represent the first study of this scale and coverage.
An example search strategy from the recruitment domain is shown in Figure 3 above.Examples from the CLEF data set and the SIGN data set are shown in Figures 4 and 5 respectively.

Materials and methods
As discussed above, in our experimental setup we investigate the extent to which different methods can predict gold standard data in the form of human-generated search strategies.
We consider a variety of methods, as follows: 1. Noun phrases extracted from the result snippets of a commercial search engine (as a baseline) 2. Cluster labels generated by automatically clustering search result snippets 3. Related terms extracted from manually curated ontologies 4. Terms extracted from abstracts found within manually curated ontologies 5. Terms generated using context-free distributional language models trained on text corpora We also investigate combining the above results in a variety of configurations.

Search result snippets
As a baseline, we hypothesize that the top matching results from a commercial search engine will provide a useful source of query suggestions, e.g.(Kruschwitz et al., 2009;Song et al., 2014).We extracted noun phrases from the top ten snippets returned by a popular Web search engine, in this case Google, restricting the search to domain-specific sites, e.g.PubMed (for healthcare data) and Indeed (for recruitment data).We identify phrases using 8 9 the noun phrase extraction API of TextBlob which in turn utilizes methods provided by the 10 Natural Language Toolkit (Loper & Bird, 2002).

Cluster labels
Clustering tools may be used to generate query suggestions in the form of cluster labels generated from search result snippets.We used a popular, freely available clustering tool, Carrot2 (Stefanowski & Weiss, 2003), and configured it using the default settings for number of results and minimum cluster size, and then queried PubMed (for the healthcare data) and Wikipedia (for the recruitment data) to generate cluster labels using three different clustering algorithms (kMeans, Lingo and suffix tree clustering) .Evidently, there is scope to 11 11 Carrot2 also offers additional search feeds through a partnership with the etools.chmetasearch engine, but these impose IP-based blocking and rate limiting that preclude systematic testing.
customize this process further but our intent at this stage is to explore the underlying principle and provide a comparative baseline.

Ontological relations
Query suggestions can be generated by querying manually curated ontological resources to identify related terms in the form of hypernyms, hyponyms etc.Many such resources are hosted on the web as Linked Open Data , and support access via structured query 12 languages such as SPARQL.Some are structured as formal ontologies (modelling subsumption and other relations), others as controlled vocabularies and thesauri.We investigated a variety of such resources, of both a general purpose and domain-specific nature.Given their wide coverage and generic nature, the first two resources may be considered general-purpose, and the latter four as domain-specific (to healthcare): 1. DBpedia is a project aiming to extract structured content from Wikipedia (Gangemi et al., 2018).The DBpedia data set describes 4.58 million entities, out of which 4.22 million are classified in a consistent ontology.2. WebISA (Seitner et al., 2016) is a publicly available database containing hypernymy relations extracted from the CommonCrawl web corpus .The LOD version contains 13 11.7 million hypernymy relations, each provided with rich provenance information and confidence estimates.3. Medical Subject Headings (MeSH) is a controlled vocabulary for the purpose of 14 indexing documents in the life sciences.It contains a total of 25,186 subject headings, which are accompanied by a short description or definition, links to related descriptors, and a list of synonyms or very similar terms.4. RxNorm is a terminology that contains all medications available on the US market.

15
It has concepts for drug ingredients, clinical drugs and dose forms.5.The British National Formulary (BNF) is a pharmaceutical reference that contains 16 information about medicines available on the UK National Health Service (NHS).6.The DrugBank database is an online database containing information on drugs and 17 drug targets.The latest release of DrugBank contains 11,683 drug entries, 1,117 approved biotech drugs, 128 nutraceuticals and over 5,505 experimental drugs.

Context-free distributional language models
Word embeddings as a class of techniques where individual words or phrases are represented as real-valued dense vectors in a predefined vector space have become the de facto representation standard in many NLP applications (Jurafsky & Martin, 2020).Since they model the distributional patterns of words, they can be used to generate query suggestions in the form of related terms.Word embeddings can be learned from text corpora using a variety of techniques, e.g.word2vec (T Mikolov et al., 2013), GloVe (Pennington et al., 2014), FastText (Bojanowski et al., 2017), BERT (Devlin et al., 2019) etc.A number of publicly available, pre-built embedding models are available, trained on sources such as Wikipedia (Pennington et al., 2014), GoogleNews (Tomas Mikolov et al., 2013), and PubMed (Chiu et al., 2016).Given that our evaluation approaches considers query terms in isolation, we do not deploy contextual embeddings (such as BERT) but investigate the following context-free embeddings: • Word2vec trained on Google news (Tomas Mikolov et al., 2013) • GloVe trained on Wikipedia + Gigaword5 (Pennington et al., 2014) • FastText trained on Wikipedia (Bojanowski et al., 2017) • Word2vec trained on PubMed articles, with different window sizes (2 and 30) (Chiu et al., 2016) 22 https://pypi.org/project/rake-nltk/We also built bespoke models using the PubMed Open Access full text snapshot from September 2017, which consisted of 944,672 full-text articles.Using an initial test set we identified the optimal parameter settings as dimensions=300, window size=5, min word count=10.We created two bespoke Word2vec models: one which consisted solely of unigrams, and a second model which also included bigrams and trigrams.

Results
Our overall evaluation approach was as follows: for every strategy in our test collection, we iterate over each disjunction and calculate precision, recall and F score for each term, based on the overlap between the suggested term set and the gold standard.Although search strategies may also contain conjunctions and other expressions, they are in general not a useful part of the gold standard data as they do not represent sets of synonyms or closely related terms.We then repeat this process for each method, and report performance in terms of average (arithmetic mean of) precision, recall and F score .We test for significance 23 using one-way ANOVA, and report values where p < 0.01.Although these figures may appear low in absolute terms, they are in line with the findings of similar studies applying the same methodology to digital libraries (Kruschwitz et al., 2009) and local websites and intranets (Adeyanju et al., 2012).This reflects the difficulty in predicting query suggestions based on a ground truth of nothing more than terms found in existing disjunctions.Moreover, they represent a likely underestimate of performance, since some of the terms identified as false positives may transpire to be acceptable in a real task scenario (see Discussion).

Cluster labels
Carrot2 supports three clustering algorithms: Lingo, suffix tree clustering (STC), and 24 kMeans.The means of P, R and F for these three algorithms are shown in The results for the language models are shown in Table 6.Overall, these scores are generally higher than those of previous methods.Comparing F scores shows that the choice of model has a significant effect on performance, although the pattern is inconsistent: the bespoke PubMed unigram model performs best on CLEF F(6, 6279) = 27.49,p < 0.01, while the bespoke PubMed trigram model performs the best on SIGN F(6, 2478) = 6.19, p < 0.01.Their performance is comparable to that of Word2vec+PubMed (win30) (Chiu et al., 2016), which provides some evidence for the reproducibility of these results.Comparing the three generic models on recruitment data, GloVe+Wikipedia performs best F(2, 1710) = 19.78,p < 0.01.
These results illustrate the value of using domain-specific models (the lower half of the table) rather than generic models (the upper half).The fact that the two bespoke models outperformed the pre-trained models is also interesting (although for CLEF this difference is not significant).One possible explanation may be that the bespoke models were created using a relatively clean corpus which included only body text (i.e.no figures, headers, footers, etc.) and excluded numbers, punctuation and non-alphabetic characters.

Combining sources
A primary motivation for the work in this paper is to facilitate the development of practical applications (as opposed to adopting a purely academic perspective).With this in mind, the following section explores how to make optimal use of different resources in a variety of combinations.
For example, it may be possible to improve performance (particularly in terms of recall) by combining results from two or more sources.Evidently, the nature of that improvement will depend on the particular services being combined and the way in which their respective result sets intersect.Not only does this present an interesting theoretical question, but it also offers the prospect of significant impact on a large proportion of the professional search community.In this section we investigate the effects of combining the best performing curated resources with the best performing language models.

Simple aggregation
The simplest form of aggregation is to combine two term suggestion sets as a 'bag of words' (note that since the evaluation is based on set overlap their order is not significant).Table 7 shows the results of applying a combination of the DBpedia ontology and the GloVe+Wikipedia language model to recruitment data (also showing the results for each method in isolation).
In this instance, combining two sources improves recall, but at the expense of precision, with a decrease in F score (compared to GloVe in isolation).Comparing F scores shows that aggregation has a significant effect on performance F(2, 1710) = 20.14, p < 0.01.One possible explanation for the positive effect of aggregation is that language models tend to learn robust representations for frequent terms, which tends to favour unigrams.By contrast, manually curated ontologies tend to provide better coverage of higher order ngrams (bigrams and above), which reflects their focus on named entities and other specialist terminology.To test this hypothesis, we implemented two further combinations which exploited the ngram order in finding related terms.Both these approaches represent back-off algorithms of the sort that has long been popular in a variety of NLP applications in cases where data sparsity has been an issue (Manning et al., 2008):

Recruitment
What these approaches have in common is that curated resources are only used for higher order ngrams (bigrams and above).Where they differ is that in the second variation the language model is only used if the curated ontology returned no results or if the term is a unigram.Table 9 shows the results of this approach, along with the results from the approaches above (repeated here for convenience): the best performing curated ontology (MeSH for healthcare, and DBpedia for recruitment); the best performing language model (PubMed trigram for healthcare, GloVe for recruitment), and simple aggregation (shown here as 'Agg1').The lower two rows show the results for 'loose pipelining' (Agg2) and 'strict pipelining' (Agg3).The results show that simple aggregation (Agg1) consistently produces the highest recall, which reflects the undifferentiated, broader nature of a combined suggested terms list.Conversely, 'strict pipelining' (Agg3) consistently produces the highest precision, which supports the hypothesis that ngram order can be exploited when finding related terms.Moreover, the F scores show that it is possible to combine suggestions from different sources using strict pipelining to deliver a more effective balance of precision & recall.

Discussion
We will approach the discussion from a number of different angles representing different variables in our experimental setup.First of all we frame the discussion by reviewing some of the key assumptions behind this type of study and how it differs from prior studies.It is important to recognise that although the use of query expansion has been the subject of many studies, relatively few have focused explicitly on the professional search context.Moreover, the few that have done so are generally predicated on the assumption that users will adopt a simplistic approach based on unstructured keyword queries, e.g.(Lu et al., 2009).To the best of our knowledge this is the first study of this scale to evaluate interactive expansion within the context of structured queries using publicly available, human-generated search strategies .

26
Turning to the results themselves, we may make a few general observations.First, although some of the results may appear low in absolute terms (e.g. a maximum F-score of 0.086), the key observation is that relative differences are statistically significant and generalisable.Moreover, despite the ostensibly modest absolute values, the potential impact on professional search practice could be significant: with patent search tasks taking a median of 12 hours to complete (Russell- Rose et al., 2018), even a 10% saving due to improved query formulation would translate to 1.2 hours of billable time per task.Likewise, librarians spend an average aggregated time of 26.9 hours on systematic reviews, most of which is spent on search strategy development and translation (Bullers et al., 2018).Query expansion is known to be highly valued by healthcare information professionals, so the potential for adoption of even imperfect query suggestion techniques could lead to considerable impact.
Comparing the different techniques, we see that the use of language models outperforms methods based on manually-curated resources.This includes both ontological relations and terms extracted from abstracts or definitions.It is possible of course that other human-curated resources may offer improved performance, e.g.ConceptNet , Wikidata , 27 28 etc.However, the six sources investigated in this study offer a reasonable basis for comparison, and the investigation of additional resources is suggested as an area for further work.
In addition to the above, the practice of combining sources offers the prospect of further improvement, with simple aggregation having a consistently positive and significant effect on recall across all data sets.Moreover, it is possible to deliver a better balance between precision & recall by utilizing ngram order in the combination, e.g. using strict pipelining to optimise for precision.
It is important also to recognise that the results represent a lower bound on potential performance, since some of the terms identified as false positives may transpire to be true positives in a live task scenario.For example, the first disjunction in the recruitment data set contains the terms: Arguably, the terms 'BA', 'Software business analyst', 'Business systems analyst' and 'Analyst' are all true positives.However, due to the offline evaluation process they are all labelled as false positives apart from 'Analyst', resulting in a precision of 0.1 instead of 0.4.Moreover, had the term 'BA' (a common abbreviation for 'business analyst') been included in the original disjunction, the recall would be 0.333 instead of 0.2.
This observation brings us naturally onto the limitations of this study.Although the test data represents a sizable collection of search strategies, there is no guarantee that they are optimal, i.e. they represent an 'ideal' articulation of the information needs they represent.Indeed, the very fact that they were created without access to the type of query formulation techniques proposed in this paper would imply that they are less than 'perfect'.However, this does not mean they are without value: the majority are drawn from hand-curated, published and publicly maintained sources, and represent the work of trained experts.They may not be ideal, but they are representative of a broader population, and in this respect we believe they are a valid approximation of professional search behaviour.
Evidently, to accurately evaluate how real users would react in a real task scenario, it is necessary to set up a user study involving representative human participants.This is of course more expensive and time consuming, and user studies can be more challenging to scale and replicate.In this respect the value of this study is in investigating a diverse set of techniques using human generated search strategies as a proxy for human behaviour.As such it offers a scalable and reproducible approach which allows more expensive online studies to be better focused on specific issues and tasks.
A further limitation of this study is that we have treated disjunctions ('OR' clauses) in the data as the primary unit of analysis.Evidently, search strategies contain other types of construction (e.g.conjunctions, negations, etc.) and these may offer additional evaluation possibilities.Finally, our use of live, publicly available LOD endpoints facilitates transparency and reproducibility, but at the expense of occasional latency issues and timeouts.To mitigate this issue, all runs were replicated at least once to ensure consistency and reproducibility.

Conclusions and further work
In this paper, we review the role of query suggestions within the context of professional search strategies used in real-world expert search tasks.We investigate a number of techniques for generating query suggestions, and evaluate them using a variety of data sources.We now draw conclusions in relation to the original research questions set out in Section 3.

To what extent can methods based on manually curated ontologies provide suitable query suggestions for professional search?
We found that the ontological relations in generic, manually curated resources such DBpedia outperformed the baseline of search results snippets for healthcare search strategies but not for recruitment search strategies.Even when using domain-specific resources, the performance was poorer than that of extracting cluster labels from the search results snippets.
The use of terms extracted from abstracts and definitions was not shown to be effective.When using generic resources (e.g.DBPEDIA), the results were an improvement over the baseline for healthcare but not for recruitment.Terms extracted from domain-specific resources consistently performed worse than the baseline.

2.
To what extent can methods based on context-free distributional language models provide suitable query suggestions for professional search?
We found that context-free distributional language models outperformed the baseline for all data sets.They also outperformed the use of manually-curated resources (whether used as a source of ontological relations or as a source of terms in abstracts/definitions).We also found that our own bespoke Pubmed model outperformed the best of the 3rd party pre-built models on healthcare data.The best performing model on recruitment data was found to be GloVe+Wikipedia.

To what extent can combining the above methods improve on the performance of either method in isolation?
We found that simple aggregation consistently produced higher recall than any method in isolation.This gave rise to a higher F score for both the healthcare data sets, but not for recruitment data, where the highest average F score continued to be that of GloVe+Wikipedia (alone).
The use of aggregate methods showed that it is possible to exploit ngram order in finding related terms.'Strict pipelining' consistently produced the highest precision and highest overall F score, which demonstrates that it is possible to combine suggestions from different sources to deliver a better overall balance of precision & recall.

Future work
This work provides a benchmark set of results (in an under-explored area) for future experiments.A valuable next step would be to scale the work horizontally, e.g. to other curated resources (such as ConceptNet and Wikidata ) or to other distributional models 29 30 and frameworks.The NLP field is actively growing and new distributional approaches are continually being developed, and it may also be productive to explore other bespoke models, e.g. for recruitment data.Given the effectiveness of the context-free embeddings in our experiments and the impact of contextualised embeddings such as BERT (Devlin et al., 2019) across a variety of NLP tasks, a further next step may be to explore contextual embeddings, for example using neighboring disjunction terms as context.
A further form of scaling is to investigate other domains: in this study we focused on healthcare and recruitment, aligning with two professions known to be among the heaviest users of complex, Boolean queries.It would be interesting to extend this work to other professions such as patent search, competitive intelligence, and media monitoring (Russell- Rose et al., 2018).
Another possibility is to revisit the test data and explore constructs other than disjunctions (e.g.operators such AND, ADJ, etc.).These were deemed out of scope due to their inconsistent semantics, but it is possible that other consistent types of relation may be identified which may form an additional focus for evaluation.
Finally, a further area for future work is to compare these findings with human judgements as might be elicited via a user study.This work could explore the degree to which our findings align with that of naturalistic use, and determine the extent to which false positives identified in our study may actually transpire to be true positives in live, interactive usage.

Fig 1 .
Fig 1.The World Health Organisation's Clinical Trials Search Portal

Fig. 2
Fig. 2 An example patent search strategy

Fig 3 .
Fig 3.An example recruitment search strategy search strategies.There is no standard test collection for recruitment search; in fact very little data of this type is made available publicly.However, there are various community initiatives to collect Boolean strings for recruitment, notably: a.The Boolean Search Strings Repository : a communal collection of 6 recruitment search strings curated by Irina Shamaeva b.The Boolean Search String Experiment : a collection of Boolean strings 7

Table 1
shows the arithmetic mean of precision (P) and recall (R) and the F score (F) for the noun phrases extracted from Google snippets, with the highest F value highlighted in bold.

Table 1 :
Precision, recall and F for Google snippets

Table 2
, with the highest F value in each column highlighted in bold (as in all the following tables).

Table 2 :
Precision, recall and F for Carrot2 cluster labels Overall, STC performs best, with F values ranging from 0.05 (Recruitment) to 0.021 (SIGN).kMeans is consistently in second place and Lingo third.Comparing F scores shows that the choice of clustering algorithm has a significant effect on performance for CLEF, F(2, 2691) = 112.38,p < 0.01, for SIGN F(2, 1062) = 54.05,p < 0.01 and for Recruitment F(2, 1710) = 79.35,p < 0.01.Interestingly, this result runs contrary to the findings of Carrotsearch's own evaluation of cluster label quality , but this may reflect the difference between generating

Table 3 :
(1,cision, recall and F for manually curated resourcesThe results for the manually curated resources are shown in Table3.Overall, these results are comparable with those of the Carrot2 cluster labels, with the highest F score being 0.033.Comparing F scores for the general purpose resources (DBpedia vs. WEBISA) shows a significant difference in favour of the former on all three data sets, particularly Recruitment F(1, 1140)= 59.20, p < 0.01.

Table 4 :
Precision, recall and F for terms extracted from DBpedia abstractsThe results for keywords extracted from DBpedia abstracts are shown in Table4.Overall, the scores are slightly lower than those of the manually curated terms.Comparing F scores shows that the keyword extraction algorithm has a significant effect on performance, with Textacy textrank returning the highest F measure (or equal highest) across all datasets: CLEF

Table 5 :
Precision, recall and F for terms extracted from MeSH descriptionsThe results for keyword extraction applied to MeSH descriptions (using healthcare data) are shown in Table5.In almost all cases, these scores are lower than the equivalent returned by DBpedia abstracts.Comparing F scores shows that the keyword extraction algorithm has a significant effect on performance for CLEF, with NCF regex performing best F(5, 5382) = 10.46,p < 0.01.It also performs joint best on SIGN, although this effect is not significant.

Table 6 :
Precision, recall and F for distributional models

Table 7 :
Precision, recall and F for simple aggregation of terms from DBPEDIA and GloVe Table8shows the results of combining the MeSH ontology with the word2vec PubMed trigram language model for healthcare (also showing the results for each method in isolation).In this instance, the combination offers improvements in both recall and F score for both data sets.Comparing F scores shows that the use of aggregation has a consistently positive and significant effect on performance on both CLEF F(2, 2691) = 78.57,p < 0.01 and SIGN F(2, 1062) = 5.36, p < 0.01.

Table 9 :
Precision, recall and F for combinations using backoff approaches