IR-chapter1:Boolean retrieval


Information retrieval

meaning

Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).

keywords: unstructured, large scale - provides a more natural and acceptable way of human-machine interaction compared with daunting database-style searching, also gives more challenge to data organization and query processing.(while In fact, no data is truly unstructured)

IR also covers supporting users in browsing or filtering document
collections or further processing a set of retrieved documents

scale

  • web search
    billions of documents stored on millions of computers
    gather documents to indexed
    build efficient system
    exploit hypertext
    protect from being boosted
  • personal information retrieval
    spotlight, instant search
    email program, search and classification
  • enterprise, institutional, and domain-specific search

An example information retrieval problem

Shakespeare's collected works, containing the words Brutus and Caesar and not Calpurnia.

grep

(How about requiring lager data, more flexible query, ranked retrieval more quickly)

incidence matrix

incidence matrix for Shakespeare' collections
query processing

extremely sparse

terminology

  • boolean retrieval model
    a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms.
  • term
    the smallest unit we treat as the element of the set
  • document
    units we have decided to build a retrieval system over
  • collection/corpus
    the group of documents
  • information need
    the topic about which the user desires to know more.
  • query
    what the user convey to the computer.
  • relevant
    a document is relevant if it is the one that the user perceives as containing information of value with respect to their personal informational need.
  • effectiveness
    the quality of its search results
ll type of true and false
  • pricision
    TP/(TP+FP)
  • recall
    TP/(TP+FN)

inverted index/inverted file/index

part of inverted index for Shakespeare's collections
  • vocabulary/lexicon
    the set of terms
  • dictionary
    the data structure of the items
  • posting
    each item in the list
  • posting list
  • postings
    all posting lists

a first take at building an inverted index

  1. collect documents to be indexed
  2. tokenize the text, turning each document into a list of tokens
  3. do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms
  4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
4th step
  • storage
    memory - disk(a linked list of fixed length arrays for each term)

processing boolean queries

  • simple conjunctive query
merge algorithm
  • query optimization
    process in increasing order of term frequency
Algorithm for conjunctive queries
  • asymmetric
  • difference is large

The extended Boolean model versus ranked retrieval

  • ranked retrieval model
    such as the vector space model, in which users freely use free text queries
  • the extended Boolean model
    proximity operator: specify that two terms in a query must occur close to each other in a document
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容

  • 撐著油紙傘,獨自 彷徨在悠長、悠長 又寂寥的雨巷, 我希望逢著 一個丁香一樣地 結(jié)著愁怨的姑娘。 ...
    白櫻嵐閱讀 416評論 0 1
  • 浩瀚書海,選書成了一個問題。 這個問題,第一次真正的正視,一直看的都很任性和隨性。不覺得是個問題。直到今天,在微信...
    cissyfriends閱讀 1,085評論 0 0
  • 關(guān)于減肥的方法,現(xiàn)在真是層出不窮,千千百百種。但究竟什么樣的減肥方法,才是最科學(xué)、最健康、最有效的呢? 咱們且聽聽...
    瘦朵朵黃教練閱讀 265評論 0 0