NLP IIT information retrieval
31 Dec 2017
Informational retrieval (lecture by ARNAB BHATTACHARYA)
- Retrieval (finding) of information (e.g., documents) that is mostly unstructured (e.g., text) and is relevant to
- Tokenization is the process of breaking the text into terms.
- Token normalization finds more documents Increases recall but decreases precision
- Stemming or lemmatization refers to stripping the word to its root or lemma:
- e.g. “system”, “systems”, “systematic” Requires morphological analysis and is language specific
- Inverse document frequency:
tf-idf(t,d) = tf(t, d) * idf(t)
- High when t appears in a small number of documents
- Low when t appears in many documents
- High when t appears many number of times in d Low when t appears few number of times in d Zero if it does not
- appear at all in d
- Scalability
- Filter query terms with very low idf
- Cache scores of a champion lists
- Take union of top-r of every query term to get top-K
- Build layers, each layer populates docs whose tf for the term > a threshold