Skip to main content
Intended for healthcare professionals
Restricted access
Research article
First published February 1992

The automatic identification of stop words

Abstract

A stop word may be identified as a word that has the same likehhood of occurring in those documents not relevant to a query as in those documents relevant to the query. In this paper we show how the concept of relevance may be replaced by the condition of being highly rated by a similarity measure. Thus it becomes possible to identify the stop words in a cullectmn by automated statistical testing. We describe the nature of the statistical test as it is realized with a vector retrieval methodology based on the cosine coefficient of document-document similarity. As an example, this tech nique is then applied to a large MEDLINE " subset in the area of biotechnology. The initial processing of this datahase involves a 310 word stop list of common non-content terms. Our technique is then applied and 75% of the remaining terms are identified as stop words. We compare retrieval with and without the removal of these stop words and find that of the top twenty documents retrieved in response to a random query document. seventeen of these are the same on the average for the two methods We also examine the differences and conclude that where the user prefers one method over the other, the new method with the reduced term set is favored about three times out of four.

Get full access to this article

View all access and purchase options for this article.

References

C. Buckley, Implementation of the SMART information retrieval system, Technical Report 85-686 (Department of Computer Science, Cornell University, 1985).
C. Buckley and A.F. Lewit, Optimization of inverted vector searches. In. Proceedings of the Eighth International ACM Conference on Research and Development in Information Retrietal (Montreal. Quebec, 1985) 97-110.
W.B. Croft, Experiments with representation in a document retrieval system, Technical Report 82-21 (COINS. University of Massachusetts, Amherst, MA, 1982).
D. Lucarella, A document retrieval system based on nearest neighbor searching, Journal of Information Science 14 (1988) 25-33.
M.F. Porter, An algorithm tor suffiy stripping, Program 14 (1980) 130-137.
J.J. Regazzi, Performance measures for information retrieval systems: an experimental approach, Journal of the American Suciety for Information Science 39(4) (1988) 235-251.
C.J. van Rijsbergen, Information Retriet al. 2nd ed (Butteworths, London, 1979).
G. Salton. Automatic Information Organization and Retrietal (McGraw-Hill. New York, 1968).
G. Salton, A. Wong and C.S. Yang. A vector space model for automatic indexing, Communications of the ACM 18(11) (1975) 613-620.
G. Salton and M. McGill, Introduction to Modern Information Retrieval (McGraw-Hill, New York, 1983).
G. Salton, Automatic Text Processing (Addison-Wesley. Reading, MA, 1989).
K Sparck Jones, A statistical interpretation of term specificity and its application, Journal of Documentation 28 (1972) 11-21.
D.R. Swanson, Historical note information retrieval and the future of an illusion, Journal of the American Society for Information Science 39(2) (1988) 92-98.
S.K.M. Wong and V.V. Raghaven. Vector space model of information retrieval: a reevaluation In: C.J. van Rijsbergen, ed. Research and Development in Information Retrieval (Cambridge University Press. Cambridge, 1984 ) 167-185.

Cite article

Cite article

Cite article

OR

Download to reference manager

If you have citation software installed, you can download article citation data to the citation manager of your choice

Share options

Share

Share this article

Share with email
EMAIL ARTICLE LINK

Share access to this article

Sharing links are not relevant where the article is open access and not available if you do not have a subscription.

For more information view the Sage Journals article sharing page.

Information, rights and permissions

Information

Published In

Article first published: February 1992
Issue published: February 1992

Rights and permissions

Request permissions for this article.

History

Published online: February 1, 1992
Issue published: February 1992

Authors

Affiliations

W. John Wilbur
National Center for Biotechnology Information, Bethesda. MD, USA
Karl Sirotkin
National Center for Biotechnology Information, Bethesda. MD, USA

Metrics and citations

Metrics

Journals metrics

This article was published in Journal of Information Science.

VIEW ALL JOURNAL METRICS

Article usage*

Total views and downloads: 463

*Article usage tracking started in December 2016

Altmetric

See the impact this article is making through the number of times it’s been read, and the Altmetric Score.
Learn more about the Altmetric Scores


Articles citing this one

Web of Science: 139 view articles Opens in new tab

Crossref: 153

  1. A dataset on corporate sustainability disclosure
    Go to citation Crossref Google Scholar
  2. Multi-class sentiment classification on Bengali social media comments ...
    Go to citation Crossref Google Scholar
  3. Calculation of embodied GHG emissions in early building design stages ...
    Go to citation Crossref Google Scholar
  4. Sentiment analysis of medical record notes for lung cancer patients at...
    Go to citation Crossref Google Scholar
  5. Network-Based Dimensionality Reduction for Textual Datasets
    Go to citation Crossref Google Scholar
  6. Sigmoidal Particle Swarm Optimization for Twitter Sentiment Analysis
    Go to citation Crossref Google Scholar
  7. A Proposed Method of Literature Analysis Based on Natural Language Pro...
    Go to citation Crossref Google Scholar
  8. Automatic keyword extraction for localized tweets using fuzzy graph co...
    Go to citation Crossref Google Scholar
  9. Using neutral sentiment reviews to improve customer requirement identi...
    Go to citation Crossref Google Scholar
  10. Prompt engineering for zero‐shot and few‐shot defect detection and cla...
    Go to citation Crossref Google Scholar
  11. Using Topic Models to Understand Rater-Mediated Writing Assessments
    Go to citation Crossref Google Scholar
  12. What is an independent art space? Using a text-mining approach to desc...
    Go to citation Crossref Google Scholar
  13. Pricing the Long Tail by Explainable Product Aggregation and Monotonic...
    Go to citation Crossref Google Scholar
  14. BIM-based design decisions documentation using design episodes, explan...
    Go to citation Crossref Google Scholar
  15. Natural language processing in low back pain and spine diseases: A sys...
    Go to citation Crossref Google Scholar
  16. Automatic identification of sentiment in unstructured text
    Go to citation Crossref Google Scholar
  17. Citizens at the forefront of the constitutional debate: Voluntary citi...
    Go to citation Crossref Google Scholar
  18. Classification of open-ended responses to a research-based assessment ...
    Go to citation Crossref Google Scholar
  19. Research on Passengers’ Preference for High-Speed Railways (HSRs) and ...
    Go to citation Crossref Google Scholar
  20. A Practical Tutorial for Decision Tree Induction
    Go to citation Crossref Google Scholar
  21. A Proposed Bi-LSTM Method to Fake News Detection
    Go to citation Crossref Google Scholar
  22. Exploring Rater Accuracy Using Unfolding Models Combined with Topic Mo...
    Go to citation Crossref Google Scholar
  23. One-Word Approach in Text-Mining for Value Identification
    Go to citation Crossref Google Scholar
  24. Text Clustering
    Go to citation Crossref Google Scholar
  25. Police narrative reports: Do they provide end-users with the data they...
    Go to citation Crossref Google Scholar
  26. What Can Social Media Tell Us About Patient Symptoms
    Go to citation Crossref Google Scholar
  27. Semantic and Sentiment Analysis of Selected Bhagavad Gita Translations...
    Go to citation Crossref Google Scholar
  28. A Novel Dictionary Generation Methodology for Contextual-Based Passwor...
    Go to citation Crossref Google Scholar
  29. MathSBERT: A Language Representation Model for Mathematical Informatio...
    Go to citation Crossref Google Scholar
  30. Stop words detection using a long short term memory recurrent neural n...
    Go to citation Crossref Google Scholar
  31. Occupants’ satisfaction with LEED- and non-LEED-certified apartments u...
    Go to citation Crossref Google Scholar
  32. AuTGeLy: Automatic Title Generator based on Song Lyrics Extractions
    Go to citation Crossref Google Scholar
  33. Machine learning in medicine: a practical introduction to natural lang...
    Go to citation Crossref Google Scholar
  34. Entropic measures of complexity in a new medical coding system
    Go to citation Crossref Google Scholar
  35. Mapping the genealogy of medical device predicates in the United State...
    Go to citation Crossref Google Scholar
  36. BenSW: A Standard Dataset for Bengali Stop Word Detection
    Go to citation Crossref Google Scholar
  37. Twitter sentiment analysis using hybrid Spider Monkey optimization met...
    Go to citation Crossref Google Scholar
  38. Automatic Multilingual Stopwords Identification from Very Small Corpor...
    Go to citation Crossref Google Scholar
  39. Stopwords in technical language processing
    Go to citation Crossref Google Scholar
  40. Comparative Analysis of Bengali Stop Word Detection Using Different Ap...
    Go to citation Crossref Google Scholar
  41. DYNAMIC STOP LIST FOR THE GUJARATI LANGUAGE USING RULE BASED APPROACH
    Go to citation Crossref Google Scholar
  42. Automatic Stopwords Identification from Very Small Corpora
    Go to citation Crossref Google Scholar
  43. Bengali Stop Word Detection Using Different Machine Learning Algorithm...
    Go to citation Crossref Google Scholar
  44. dh2loop 1.0: an open-source Python library for automated processing an...
    Go to citation Crossref Google Scholar
  45. Creating a stop word dictionary in Serbian
    Go to citation Crossref Google Scholar
  46. Synthetic minority oversampling in addressing imbalanced sarcasm detec...
    Go to citation Crossref Google Scholar
  47. Assisted authoring of model-based systems engineering documents
    Go to citation Crossref Google Scholar
  48. Automatic offensive language detection from Twitter data using machine...
    Go to citation Crossref Google Scholar
  49. The Art of Feature Engineering
    Go to citation Crossref Google Scholar
  50. Microfeatures influencing writing quality: the case of Chinese student...
    Go to citation Crossref Google Scholar
  51. The Challenges of Designing a Robot for a Satisfaction Survey: Surveyi...
    Go to citation Crossref Google Scholar
  52. Convolutional neural network model based on text similarity for custom...
    Go to citation Crossref Google Scholar
  53. A Novel Short Text Clustering Model Based on Grey System Theory
    Go to citation Crossref Google Scholar
  54. Organizational context and budget orientations: a computational text a...
    Go to citation Crossref Google Scholar
  55. Essential Elements of Natural Language Processing: What the Radiologis...
    Go to citation Crossref Google Scholar
  56. An Efficient Topic Modeling Approach for Text Mining and Information R...
    Go to citation Crossref Google Scholar
  57. Automatic Stopword Detection Using Term Ranking between Written and Ma...
    Go to citation Crossref Google Scholar
  58. Online health community experiences of sexual minority women with canc...
    Go to citation Crossref Google Scholar
  59. Temporal topic modeling applied to aviation safety reports: A subject ...
    Go to citation Crossref Google Scholar
  60. H-Rank: A keywords extraction method from web pages using POS tags
    Go to citation Crossref Google Scholar
  61. S3BD: Secure semantic search over encrypted big data in the cloud
    Go to citation Crossref Google Scholar
  62. Big Social Data - Predicting Users' Interests from their Social Networ...
    Go to citation Crossref Google Scholar
  63. LENN: Laplacian Probability Based Extended Nearest Neighbor Classifica...
    Go to citation Crossref Google Scholar
  64. Sentiment Classification of Customer’s Reviews About Automobiles in Ro...
    Go to citation Crossref Google Scholar
  65. Text Preprocessing
    Go to citation Crossref Google Scholar
  66. Arabic Web page clustering: A review
    Go to citation Crossref Google Scholar
  67. A Study on Effective Measurement of Search Results from Search Engines
    Go to citation Crossref Google Scholar
  68. Pairwise document similarity measure based on present term set
    Go to citation Crossref Google Scholar
  69. Discovering IMRaD Structure with Different Classifiers
    Go to citation Crossref Google Scholar
  70. Estimating Similarity Among Entities Aided by the Web when Only the En...
    Go to citation Crossref Google Scholar
  71. Multi-Label Classification of Contributing Causal Factors in Self-Repo...
    Go to citation Crossref Google Scholar
  72. Dictionaries and distributions: Combining expert knowledge and large s...
    Go to citation Crossref Google Scholar
  73. Machine Learning Implementations in Arabic Text Classification
    Go to citation Crossref Google Scholar
  74. On Frequency-Based Approaches to Learning Stopwords and the Reliabilit...
    Go to citation Crossref Google Scholar
  75. Text Clustering
    Go to citation Crossref Google Scholar
  76. Risk Assessment for Parents Who Suspect Their Child Has Autism Spectru...
    Go to citation Crossref Google Scholar
  77. A Brief Study of Approaches to Text Feature Selection
    Go to citation Crossref Google Scholar
  78. Using Twitter and the mobile cloud for delivering medical help in emer...
    Go to citation Crossref Google Scholar
  79. A feature selection method based on synonym merging in text classifica...
    Go to citation Crossref Google Scholar
  80. Automatic classification of journalistic documents on the Internet1
    Go to citation Crossref Google Scholar
  81. The aboutness of words
    Go to citation Crossref Google Scholar
  82. Comparing grounded theory and topic modeling: Extreme divergence or un...
    Go to citation Crossref Google Scholar
  83. Programming Tools for Messenger-Based Chatbot System Organization: Imp...
    Go to citation Crossref Google Scholar
  84. Exploring Online Ad Images Using a Deep Convolutional Neural Network A...
    Go to citation Crossref Google Scholar
  85. Computational Text Analysis for Public Management Research
    Go to citation Crossref Google Scholar
  86. Fostering parent–child dialog through automated discussion suggestions
    Go to citation Crossref Google Scholar
  87. Leveraging Topic Model for CSI Based Human Activity Recognition
    Go to citation Crossref Google Scholar
  88. Landmark Reranking for Smart Travel Guide Systems by Combining and Ana...
    Go to citation Crossref Google Scholar
  89. Conceptualizing Big Data: Analysis of Case Studies
    Go to citation Crossref Google Scholar
  90. EXAF: A search engine for sample applications of object-oriented frame...
    Go to citation Crossref Google Scholar
  91. A New Feature Selection Approach to Naive Bayes Text Classifiers
    Go to citation Crossref Google Scholar
  92. A Method for Measuring Similarity of Books: A Step Towards an Objectiv...
    Go to citation Crossref Google Scholar
  93. Design and Use of a Semantic Similarity Measure for Interoperability A...
    Go to citation Crossref Google Scholar
  94. Interactive Big Data Visualization Model Based on Hot Issues (Online N...
    Go to citation Crossref Google Scholar
  95. Core Informatics Technologies: Data Storage
    Go to citation Crossref Google Scholar
  96. Text mining: An improvised feature based model approach
    Go to citation Crossref Google Scholar
  97. Using Social Media and the Mobile Cloud to Enhance Emergency and Risk ...
    Go to citation Crossref Google Scholar
  98. Introducing Connected Concept Analysis: A network approach to big text...
    Go to citation Crossref Google Scholar
  99. Supervised machine learning for the detection of troll profiles in twi...
    Go to citation Crossref Google Scholar
  100. Using compression models for filtering troll comments
    Go to citation Crossref Google Scholar
  101. Visual Analysis of Topical Evolution in Unstructured Text: Design and ...
    Go to citation Crossref Google Scholar
  102. Author Topic Model based Collaborative Filtering for Personalized POI ...
    Go to citation Crossref Google Scholar
  103. An Information Theoretic Clustering Approach for Unveiling Authorship ...
    Go to citation Crossref Google Scholar
  104. Improving NCD accuracy by combining document segmentation and document...
    Go to citation Crossref Google Scholar
  105. Study on the effectiveness of anomaly detection for spam filtering
    Go to citation Crossref Google Scholar
  106. Crowdsourced weather reports: An implementation of the μ model ...
    Go to citation Crossref Google Scholar
  107. An unsupervised cascade learning scheme for ‘cluster-theme keywords’ s...
    Go to citation Crossref Google Scholar
  108. Text-based emotion classification using emotion cause extraction
    Go to citation Crossref Google Scholar
  109. A Heuristic Attribute Reduction Based on Multi-Granularity Rough Set
    Go to citation Crossref Google Scholar
  110. Supervised Machine Learning for the Detection of Troll Profiles in Twi...
    Go to citation Crossref Google Scholar
  111. Improving Near-Duplicate Detection in Multi-Layered Collaborative Requ...
    Go to citation Crossref Google Scholar
  112. An Ant Colony Optimization Based Feature Selection for Web Page Classi...
    Go to citation Crossref Google Scholar
  113. An Intelligent Content Discovery Technique for Health Portal Content M...
    Go to citation Crossref Google Scholar
  114. A comparison of different calculations for N-gram similarities in a sp...
    Go to citation Crossref Google Scholar
  115. Collective classification for spam filtering
    Go to citation Crossref Google Scholar
  116. A Practical Approach for Content Mining of Tweets
    Go to citation Crossref Google Scholar
  117. Multiphase text mining predictor for market analysis
    Go to citation Crossref Google Scholar
  118. Language Individuation and Marker Words: Shakespeare and His Maxwell's...
    Go to citation Crossref Google Scholar
  119. A health information recommender system: Enriching YouTube health vide...
    Go to citation Crossref Google Scholar
  120. COMBINATION OF MULTIPLE FEATURE SELECTION METHODS FOR TEXT CATEGORIZAT...
    Go to citation Crossref Google Scholar
  121. Adult Content Filtering through Compression-Based Text Classification
    Go to citation Crossref Google Scholar
  122. JURD: Joiner of Un-Readable Documents to reverse tokenization attacks ...
    Go to citation Crossref Google Scholar
  123. Automatic categorisation of comments in social news websites
    Go to citation Crossref Google Scholar
  124. Towards a more efficient and personalised advertisement content in on-...
    Go to citation Crossref Google Scholar
  125. Is the contextual information relevant in text clustering by compressi...
    Go to citation Crossref Google Scholar
  126. Word sense disambiguation for spam filtering
    Go to citation Crossref Google Scholar
  127. A Survey of Text Clustering Algorithms
    Go to citation Crossref Google Scholar
  128. Spam Filtering through Anomaly Detection
    Go to citation Crossref Google Scholar
  129. Enhanced Topic-based Vector Space Model for semantics-aware spam filte...
    Go to citation Crossref Google Scholar
  130. On the study of anomaly-based spam filtering using spam as representat...
    Go to citation Crossref Google Scholar
  131. Low-Power Themes Classifier (LPTC): A Human-Expert-Based Approach for ...
    Go to citation Crossref Google Scholar
  132. A model to identify mathematics topics in MXit lingo to provide tutors...
    Go to citation Crossref Google Scholar
  133. Enhancing scalability in anomaly-based email spam filtering
    Go to citation Crossref Google Scholar
  134. A new partitioning based algorithm for document clustering
    Go to citation Crossref Google Scholar
  135. Reducing the Loss of Information through Annealing Text Distortion
    Go to citation Crossref Google Scholar
  136. Text stream clustering algorithm based on adaptive feature selection
    Go to citation Crossref Google Scholar
  137. Collective Classification for Spam Filtering
    Go to citation Crossref Google Scholar
  138. Finding related sentence pairs in MEDLINE
    Go to citation Crossref Google Scholar
  139. Beyond Redundancies: A Metric-Invariant Method for Unsupervised Featur...
    Go to citation Crossref Google Scholar
  140. A delimiter-based general approach for Chinese term extraction
    Go to citation Crossref Google Scholar
  141. Integrating Information Extraction Agents into a Tourism Recommender S...
    Go to citation Crossref Google Scholar
  142. Relevance of Contextual Information in Compression-Based Text Clusteri...
    Go to citation Crossref Google Scholar
  143. Combining Multiple Feature Selection Methods for Text Categorization b...
    Go to citation Crossref Google Scholar
  144. Divergence-based feature selection for naïve Bayes text classif...
    Go to citation Crossref Google Scholar
  145. Text Clustering with Feature Selection by Using Statistical Data
    Go to citation Crossref Google Scholar
  146. Clustering methodologies for identifying country core competencies
    Go to citation Crossref Google Scholar
  147. Using Links to Aid Web Classification
    Go to citation Crossref Google Scholar
  148. Incorporating context in text analysis by interactive activation with ...
    Go to citation Crossref Google Scholar
  149. Factor matrix text filtering and clustering
    Go to citation Crossref Google Scholar
  150. Corpus-based statistical screening for content-bearing terms
    Go to citation Crossref Google Scholar
  151. An analysis of statistical term strength and its use in the indexing a...
    Go to citation Crossref Google Scholar
  152. An information measure of retrieval performance
    Go to citation Crossref Google Scholar
  153. Generating titles for paragraphs using statistically extracted keyword...
    Go to citation Crossref Google Scholar

Figures and tables

Figures & Media

Tables

View Options

Get access

Access options

If you have access to journal content via a personal subscription, university, library, employer or society, select from the options below:

CILIP members can access this journal content using society membership credentials.

CILIP members can access this journal content using society membership credentials.


Alternatively, view purchase options below:

Purchase 24 hour online access to view and download content.

Access journal content via a DeepDyve subscription or find out more about this option.

View options

PDF/ePub

View PDF/ePub