SCHOLARLY ARTICLE RECOMMENDATION


The scholarly literature is expanding at a rate that necessitates intelligent algorithms for search and navigation. For the most part, the problem of delivering scholarly articles has been solved. The navigational aspect of scientific search – finding relevant, influential articles that one does not know exist – is in its early development. The schorlarly article recommendation helps in suggesting articles to the users based on citations, venues, authors and document similarity.

MOTIVATION

When we embark research on the project or some topic, we need to find necessary in-depth documents in our chosen research field. However,it would be helpful if we recommend papers related to the seed paper. We personally found it interesting because: Every student will gain knowledge from an article and in most of the cases, he will end up with few unanswered questions. Our system will quickly help the students to learn more.



ALGORITHMS IMPLEMENTED


Article Ranking Using Author/ Venue and Citation Information

According to this algorithm, citation is not the only factor that determines the importance of a paper, while other information such as authors, venues are also relevant. This implementation utilizes ranking publication using information about papers, authors, and venues. The fun-damental idea is good authors write good papers, and good papers are not only cited by many others, but also published in good journals or conferences.

K Means Clustering

The abstract and title contains the most important information about the paper. So we clustered the documents with k means alogrithm using the title and abstract information. So we can recommend the paper to the users based on document similarity.

Article Recommendation Using Eigen Factor

This algorithm recommends articles based on the citation information available for each article. It is a modified version of page rank. The papers are recommended based on the hierarchical clustering and Article Level Eigen Factor scores.

DATA SET

Data Set: AMiner-Paper.rar data set is the focus.

https://aminer.org/billboard/aminernetwork

This data set contains:
#index ---- index id of this paper
#* ---- paper title
#@ ---- authors (separated by semicolons)
#o ---- affiliations (separated by semicolons, and each affiliation corresponds to an author in order)
#t ---- year
#c ---- publication venue
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
#! ---- abstract




CHALLENGES

    • The data set is noisy. The articles with no author information, no publication information and no title information surprised us at every stage and the job was failing. We handled these scenarios efficiently when we encountered it.

      The matrices dot product in citation based system was challenging as we had brute force approach. The computation time and space was huge and its failed for the actual data.Then we came up optimised approach to handle it efficiently. The map equation was complex so we used the API to implement it.


  • IMPLEMENTATION

    Article Ranking Using Author/ Venue and Citation Information


    This implementation ranks articles using information about papers, authors, and venues. The initial paper score is determined using the following equation
    Next, we compute the score for each author using the paper score.
    Ap - The author score is computed by averaging the scores of his published papers(Ap).
    Vp-The Venue score computation based on the papers published at the venue.
    Av - The author score obtained from averaging the scores of his published venues.
    Ar - The refined author score is obtained from averaging the scores of his published venues and the published papers.

    Finally we compute paper score based on the following equations which gives considers citations, venue and author information.






    RESULTS



    K Means Algorithm

    1. The abstract and title contains the most important information about the paper. So we clustered the documents with k means alogrithm using the title and abstract information.
      We clustered the documents with k means alogrithm using the title and abstract information.
      We computed TFIDF scores for each document.
      Then We selected the initial centers randomly from the vectors of each document.
      We executed Kmeans until the centers converge and then clustered the documents
      Now when a user searches for a document, we locate the cluster in which the document is located and then compute the distance between the searched documents and the other documents in the cluster.
      Finally we recommend the closest ones in the current document to the users.

  • Article Recommendation Using Eigen Factor

    (1) Assemble Citation Network

    The first step requires assembling the citation graph for the large corpus. The first column is the citing paper ID and the column 2 is the cited paper id.
    Format: 1083734 197394###220708###387427

    (2) Rank Node

    Then we rank each article using “article level eigen factor algorithm” (ALEF). When page rank approaches are applied to acyclic citation graphs older papers are weighted excessively. ALEF is a modified version of page rank. The ALEF algorithm consists of five steps: a. The teleportation weight, wi for each node i is calculated by summing the in and out citations.

    Forming row stochastic matrix Hij:- The matrix Zij is then row normalized so that the sum of each row i equals, we call this row stochastic matrix,

    Calculating Article Level Eigen Factor:- The ALEF scores are then calculated by multiplying wi by Hij and normalizing the scores by the number of papers, n, in the corpus

    (3) Clustering the nodes hierarchically

    We cluster the nodes using Map equation API

    (4) Recommendation Selection

    Now we have eigen factor scores and the clusters of the nodes using Map equation. Whenever a user searches for a document, we determine the cluster in which the document is located. Then we rank those documents in that based on the eigen factor scores and display them to the user






    OUR TEAM & CONTRIBUTIONS

    YUVARAJ SUNDARRAJAN (800903707) : K-Means algorithm and then search feature to find similar documents, web page design
    LAKSHMI ISWARYA KESARI (800972717) : Citation graph, Article score computation based on author, venue and citation information
            VALENTINA PALGHADMAL(800966694) : Implemenation of article level eigen factor algorithm,Map equation, Expert Level Search


    PERFORMANCE

    Article Ranking Using Author/ Venue and Citation Information : The articles are ranked statistically and stored in HDFS. It required O(n) space to store the initial weights of each node.So when a user types a search query using the title of the document. The citations of this document are displayed and then ranked according to their paper scores
    K means Clustering: We cluster the documents and store the information in mapping file which contains the document name and its cluster. So whenever an user does a search with the title. We locate the cluster in which the document is located and then find the distance between all the document in the cluster and the document searched. Finally, we display the most similar documents to the user. The results are not accurate with Kmeans since each author would have his own style of writing. So we can implementing semantics

    REFERENCES

    [1] A recommendation system based on hierarchical clustering of an article-level citation network Jevin D. West, Ian Wesley-Smith, Carl T. Bergstrom

    [2] M. Rosvall and C. T. Bergstrom, “Maps of random walks on complex networks reveal community structure,” Proceedings of the National Academy of Sciences, vol. 105, no. 4, pp. 1118–1123, 2008.

    [3] M. Rosvall, D. Axelsson, and C. T. Bergstrom, “The map equation,” European Physical Journal: Special Topics, vol. 178, no. 1, pp. 13–23, 2009.

    [4] An Efficient Solution to Reinforce Paper Ranking using Author/Venue/Citation Information - The Winner’s Solution for WSDM Cup 2016 Ming-Han Feng & Kuan-Hou Chan & Huan-Yuan Chen

    [5] Evaluating performance of recommender systems: An experimental comparison Francois Fouss & Marco Saerens