Skip to content

Latest commit

 

History

History
156 lines (138 loc) · 50 KB

CHANGELOG.md

File metadata and controls

156 lines (138 loc) · 50 KB

语料数据和词库

📅 2024/3/28 新建

中文停用词 stopwords-zh

更多停用词 stopwords-misc

  • 6-stopwords-all.json

  • iso-stopwords-en.txt

  • iso-stopwords-iso.json

  • igorbrigadir

    file size source description
    None 0 No stop word removal.
    Sphinx 0 Sphinx is an open source search server. Top google search for sphinx stopwords also leads to two manually compiled lists http://astellar.com/2011/12/stopwords-for-sphinx-search/ which are based on the blog author's posts.
    EBSCOhost 24 The stop words used in EBSCOhost medical databases MEDLINE and CINAHL
    CoreNLP (Hardcoded) 28 Hardcoded in src/edu/stanford/nlp/coref/data/WordLists.java and the same in src/edu/stanford/nlp/dcoref/Dictionaries.java
    Ranks NL (Google) 32 The short stopwords list below is based on what we believed to be Google stopwords a decade ago, based on words that were ignored if you would search for them in combination with another word. (ie. as in the phrase "a keyword").
    Lucene, Solr, Elastisearch 33 (NOTE: Some config files have extra 's' and 't' as stopwords.) An unmodifiable set containing some common English words that are not usually useful for searching.
    MySQL (InnoDB) 36 A word that is used by default as a stopword for FULLTEXT indexes on InnoDB tables. Not used if you override the default stopword processing with either the innodb_ft_server_stopword_table or the innodb_ft_user_stopword_table option.
    Ovid (Medical information services) 39 Words of little intrinsic meaning that occur too frequently to be useful in searching text are known as "stopwords." You cannot search for the following stopwords by themselves, but you can include them within phrases.
    Bow (libbow, rainbow, arrow, crossbow) 48 Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering. Short list hardcoded. Also includes 524 SMART derived list, same as MALLET. See http://www.cs.cmu.edu/~mccallum/bow/rainbow/
    LingPipe 76 An EnglishStopTokenizerFactory applies an English stop list to a contained base tokenizer factory
    Vowpal Wabbit (doc2lda) 83 Stopwords used in LDA example
    Text Analytics 101 85 Minimal list compiled by Kavita Ganesan consisting of determiners, coordinating conjunctions and prepositions http://text-analytics101.rxnlp.com/2014/10/all-about-stop-words-for-text-mining.html
    LexisNexis® 100 “The following are 'noise words' and are never searchable: EVER HARDLY HENCE INTO NOR WERE VIZ. Others are 'noisy keywords' and are searchable by enclosing them in quotes.”
    Okapi (gsl.cacm) 108 Cacm specific stoplist from Okapi
    TextFixer 119 From textfixer.com Linked from Wiki page on Stop words.
    DKPro 127 Postgresql (Snowball derived)
    Postgres 127 “Stop words are words that are very common, appear in almost every document, and have no discrimination value.”
    PubMed Help 133 Listed in PubMed Help pages.
    CoreNLP (Acronym) 150 A set of words that should be considered stopwords for the acronym matcher
    NLTK 153 According to email Van Rij. Sbergen (1979) "Information retrieval" (Butterworths, London). It's slightly expanded from postgres postgresql.txt which was borrowed from snowball presumably.
    Spark ML lib 153 (Note: Same as NLTK) They were obtained from postgres The English list has been augmented
    MongoDB 174 Commit says 'Changed stop words files to the snowball stop lists'
    Quanteda 174 Has SMART and Snowball Default Lists. Source
    Ranks NL (Default) 174 (Note: Same as Default Snowball Stoplist, but RanksNL frequently cited as source) “This list is used in [Ranks NL] Page Analyzer and Article Analyzer for English text, when you let it use the default stopwords list.”
    Snowball (Original) 174 Default Snowball Stoplist.
    Xapian 174 (Note: uses Snowball Stopwords) “It has been traditional in setting up IR systems to discard the very commonest words of a language - the stopwords - during indexing.”
    R tm 174 R tm package uses snowball list and also has SMART.
    99webTools 183 “Stop Words are words which do not contain important significance to be used in Search Queries. Most search engine filters these words from search query before performing search, this improves performance.”
    Deeplearning4J 194 DL4J Stopwords are in 2 places - stopwords and stopwords.txt. Probably derived from snowball. Some unusual entires eg: ----s.
    Reuters Web of Science™ 211 “Stopwords are common, frequently used words such as articles (a, an, the), prepositions (of, in, for, through), and pronouns (it, their, his) that cannot be searched as individual words in the Topic and Title fields. If you include a stopword in a phrase, the stopword is interpreted as a word placeholder.”
    Function Words (Cook 1988) 221 “This list of 225 items was compiled for practical purposes some time ago as data for a computer parser for student English. Paper
    Okapi (gsl.sample) 222 This Okapi is the BM25 Okapi. (Note: Included stopword text file is from all “F” “H” terms, as defined by defs.h) The GSL file contains terms that are to be dealt with in a special way by the indexing process. Each type is defined by a class code.
    Snowball (Expanded) 227 NOTE: This Includes the extra words mentioned in comments “An English stop word list. Many of the forms below are quite rare (e.g. 'yourselves') but included for completeness.”
    DataScienceDojo 250 Used in a real-time sentiment AzureML demo for a meetup
    CoreNLP (stopwords.txt) 257 Note: "a", "an", "the", "and", "or", "but", "nor" hardcoded in StopList.java also includes punctuation (!!, -lrb- …)
    OkapiFramework 262 THIS IS NOT Okapi of BM25! (At least I don't think so) This list used in Okapi FRAMEWORK this Okapi is the Localization and Translation Okapi.
    Azure Gallery 310 Slightly modified glasgow list.
    ATIRE (NCBI Medline) 313 NCBI wrd_stop stop word list of 313 terms extracted from Medline. Its use is unrestricted. The list can be downloaded from here
    Go 317 Go stopwords library. This is the glasgow list without 'computer' 'i' 'thick' - has 'thickv'
    scikit-learn 318 Uses Glasgow list, but without the word “computer”
    Glasgow IR 319 Linguistic resources from Glasgow Information Retrieval group. Lots of copies and edits of this one. Eg: xpo6 has mistakes – has quote instead of 'lf' eg: herse" instead of herself - comes up as one of the top results in google search.
    xpo6 319 Used in Humboldt Diglital Library and Network and documented in blogpost. Likely derived from Glasgow list.
    spaCy 326 Improved list from Stone, Denis, Kwantes (2010) Paper
    Gensim 337 Same as spaCy (Improved list from Stone, Denis, Kwantes (2010))
    Okapi (Expanded gsl.cacm) 339 Expanded cacm list from Okapi
    C99 and TextTiling 371 UIMA wrapper for the java implementations of the segmentation algorithms C99 and TextTiling, written by Freddy Choi
    Galago (inquery) 418 The core/src/main/resources/stopwords/inquery list is same as Indri default.
    Indri 418 Part of Lemur Project
    Onix & Lextek 429 This stopword list is probably the most widely used stopword list. It covers a wide number of stopwords without getting too aggressive and including too many words which a user might search upon. This wordlist contains 429 words.
    GATE (Keyphrase Extraction) 452 Stopwords used in GATE Keyphrase Extraction Algorithm
    Zettair 469 Zettair is a compact and fast text search engine designed and written by the Search Engine Group at RMIT University. It was once known as Lucy.
    Okapi (Expanded gsl.sample) 474 Same as okapi_sample.txt but with “I” terms (not default Okapi behaviour! but may be useful)
    Taporware 485 TAPoRware Project, McMaster University - modified Glasgow list – includes numbers 0 to 100, and 1990 to 2020 (for dates presumably) also punctuation
    Voyant (Taporware) 488 Voyant uses taporware list by default, includes extra thou, thee, thy – presumably for Shakespeare corpus. Trombone repo also has Glasgow and SMART in resources.
    MALLET 524 Default MALLET stopword list. (Based on SMART I think) See Docs
    Weka 526 Like Bow (Rainbow, which is SMART) but with extra ll ve added to avoid words like you'll,I've etc. Almost exactly the same as mallet.txt
    MySQL (MyISAM) 543 MyISAM and InnoDB use different stoplists. Taken from SMART but modified
    Galago (rmstop) 565 Includes some punctuation, utf8 characters, www, http, org, net, youtube, wikipedia
    Kevin Bougé 571 Multilang lists compiled by Kevin Bougé. English is SMART.
    SMART 571 SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s.
    ROUGE 598 Extended SMART list used in ROUGE 1.5.5 Summary Evaluation Toolkit – includes extra words: reuters, ap, news, tech, index, 3 letter days of the week and months.
    tonybsk_1.txt 635 Unknown origin - I lost the reference.
    Sphinx Search Ultimate 665 An extension for Sphinx has this list.
    Ranks NL (Large) 667 A very long list from ranks.nl
    tonybsk_6.txt 671 Unknown origin - I lost the reference.
    Terrier 733 Terrier Retrieval Engine “Stopword list to load can be loaded from the stopwords.filename property.”
    ATIRE (Puurula) 988 Included in ATIRE See Paper
    Alir3z4 1298 List of common stop words in various languages. The English list looks like merged from several sources.

情感分析 sentiment

分类词典 thesaurus

审查词库(敏感词/违禁词) censorship

NOTE

其他

弱智吧(百度贴吧)

来源 文件 类型 数据量
18~21 年年度佳帖 ruozhiba-post-annual.json 帖子 1.3k
吧主推荐 (截止到 2023.04.30) ruozhiba-title-good.json 标题 2.6k
一般帖子 (截止到 2023.04.30 ruozhiba-title-norm.json 标题 81.7k
部分疑问句 腾讯文档弱智吧集锦https://docs.qq.com/sheet/DUlZ6aURhamdwb1RO?tab=BB08J2 标题 2.4k