Web Information Paper Review (2)

1. Jon Kleinberg:Authoritative sources in a hyperlinked environment, JACM 1999

Links:?鏈接分析算法之:HITS算法

In this paper, the author states a new mechanism of searching algorithm which later was named as HIST Algorithm. HIST can mainly solve the problem that when user query some broad topics, the return result is too much and most of them is not relevant enough. In comparison with the textual based searching mechanism, HIST is considered as a textual and link analysis based searching algorithm.

The paper first introduced that there are three kind of query form: Specific queries, Broad-topic queries and Similar-page queries. The issue in Specific queries is that the searching is too specific which will lead to a small number of searching result. This can add difficulty to the search engine. Broad-topic queries has the issue that there are too many result which lead to a low efficiency and quality of the search. And also the relevance sometimes is not match with the popularity. So if the search engine is only based on the textual search, might give user a lot of useless results. The paper propose two new definition of the web pages: Hub and Authority. Authority web pages means pages contain the keywords users are searching and also has high relevance and high reputation. For example, when searching “Honda”, Honda’s official web page should be the top authority. A Hub web page are the pages contain links point to multiple relevant authorities. Such as Google can be consider as an Hub page. A good hub is a page that point to many good authority and a good authority is page that is pointed by many good hub pages. The HIST algorithm is based on such theory.

The first step of the HIST is to get a potential based web page set to apply the HIST and this set is much smaller than the original base web set. HIST extract such subgraph by using such method: (1) First use the traditional textual search to get a certain number of search result as the root collection. The root collection has less web pages and most of them are strong authorities. (2) Then base on this root set, HIST add both the web pages point to the element on the set and the web pages pointed by the elements in the set. (3) Finally filter out the intrinsic web pages and leave the transverse pages. Use this method, we can create our smaller searching base collection which is better for HIST to do large computation of the link based analysis.

Second step is to recalculate the Authority weight and the hub weight. The Authority weight is the sum of all the hub weight of the page point to the Authority page. The Hub weight is the sum of all the authority of the page the hub page point to. So if a hub point to many authority hubs, it should has a higher value of the weight. Also same for the authority weight. Then normalize the hub and authority weight. Do this to every pages in the subgraph and finally, the overall weight should be converge to a certain value. Then the pages in subgraph has been assigned with the evaluation value. Eventually, return user with the pages in the rank of the authority weight. This method actually can drag the whole subgraph to a two point directed graph. The left is the hub and the right is the authority. HIST can classify the hub and authority. This algorithm also has been proved in the paper. HIST also can apply to the third kind of search: similar web page search. It can return better result in comparison with textual search.

Also, HIST is a classifier algorithm. It can also be used in any other realm which can be simplified as directed graph. The author also introduced several area such as social network and publication citations.

This paper let me learn the link based search algorithm. This is more subjective in comparison with the objective textual search. This paper also let me have a better understanding of the web page: every thing in the web can be con information. The search can based on actual text and even the relationship between each other.


2. Scott Deerwester et al.:Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, Vol. 41, No. 6, 1990

Links:??Latent semantic analysis note,?奇異值分解

This paper presents a new mechanism called Latent Semantic Analysis to study the relationship between documents and apply it to the query of the web information. Unlike the traditional analysis of the text information, LSA study the potential relation between indexing words across all the documents. This method can return more reliable search result when user query some text with less overlap between the original indexing words and also reduce the noise.

The paper first states the drawbacks of the traditional retrieval method. The major issue here is that the synonymy and polysemy are exist. Sometime the superficial meaning understand by the computer is different from what actually human understand. So the latent semantic relationship will be more reliable and important when help computer understand the human’s text. The current search engine haven’t indexed all the words in the reality world and it is bad at handling synonymy and polysemy in a big collection of documents. The paper show an example to prove that the current search mechanism has a lot of mis-matching. The aim of the LSA is to understand what latent keywords of what user is searching about and the exclude the word match but meaning not match cases.

The LSA use the two mode factor analysis as the representation model for both document and the query. Two mode factor analysis meets the requirement of: (1) The representational power is controllable (2) Have an explicit representation of terms and document (3) computable for large database. The later latent analysis is based on the Singular-value decomposition (SVD). SVD is a method to decompose an non-square matrix into three sub-matrixes. By using this, we can get the eigenvalues of the MxN matrix in the diagonal sub-matrix in the middle. Eigenvalues represent the feature of the original matrix and LSA is based on the change and analysis the eigenvalue of the original representational matrix.

The procedure of the LSA is: (1) construct the term-document matrix based on the document. The term should be the frequently used terms and the collect the frequency of each term appear in each documents. In this way, we get our document matrix. We can notice that there are some synonymy in our term collection. Such as “Computer” is some kind of related to the term “system” and “human” is related to “user”. We can also notice that these pair of synonymies follow the assumption of LSA that: If a pair of terms both show up or not show up simultaneously in a collection of documents, these two term are considered latent related. “Computer” and “System” follow this pattern in 6 of 9 document in this example. Also we can determine a quantization of the similarity between two terms by dot product their row vectors. (2) Then we should apply SVD to the big matrix we get and remove the k smallest eigenvalues in the middle sub-matrix and reduce the size of the other two. By reducing the number of eigenvalues, we reduce the noise which is the interference in the base matrix. Then we reconstruct the new term-document matrix by multiply the new sub-matrix back. In the new based matrix, the latent relationship like “Human” and “user” has been enhanced. In the column vector of C2, we can see that although C2 doesn’t contain “Human” but it still has weight value of 0.40 of containing the “Human” since this document contain “user”. By now we get the latent semantic matrix and we can apply user’s query on this matrix. (3) Then we get the query from the user and convert the query string into the representation form with the indexing terms as the term-document matrix. We convert it by multiply the statistic of the frequency of terms in the query with the first two sub-matrixes we get from the SVD of the original term-document matrix. (4) Finally, we compute the converted query vector’s angle with the each column vector (each document) in the new term-document matrix and return the N smallest angle document as searching results.

From the paper I learn the latent semantic analysis which can tell the potential relationship between documents. This can help the search engine has a better and user-friendly searching results. This help me has a deeper understanding of the text analysis.

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 230,362評論 6 544
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 99,577評論 3 429
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 178,486評論 0 383
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 63,852評論 1 317
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 72,600評論 6 412
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,944評論 1 328
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,944評論 3 447
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 43,108評論 0 290
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 49,652評論 1 336
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 41,385評論 3 358
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 43,616評論 1 374
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 39,111評論 5 364
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,798評論 3 350
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 35,205評論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,537評論 1 295
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 52,334評論 3 400
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 48,570評論 2 379

推薦閱讀更多精彩內容