NLP IIT information retrieval

Informational retrieval (lecture by ARNAB BHATTACHARYA)

  • Retrieval (finding) of information (e.g., documents) that is mostly unstructured (e.g., text) and is relevant to
  • Tokenization is the process of breaking the text into terms.
    • Token normalization finds more documents Increases recall but decreases precision
  • Stemming or lemmatization refers to stripping the word to its root or lemma:
    • e.g. “system”, “systems”, “systematic” Requires morphological analysis and is language specific
  • Inverse document frequency: tf-idf(t,d) = tf(t, d) * idf(t)
    • High when t appears in a small number of documents
    • Low when t appears in many documents
    • High when t appears many number of times in d Low when t appears few number of times in d Zero if it does not
    • appear at all in d
  • Scalability
    • Filter query terms with very low idf
    • Cache scores of a champion lists
    • Take union of top-r of every query term to get top-K
    • Build layers, each layer populates docs whose tf for the term > a threshold

screen shot 2017-12-31 at 12 48 08 pm