Learning Optimized Index Structures From Traditional Databases to Neural Indexing

Index structures are fundamental to database systems, enabling efficient data retrieval and query processing. Traditionally, these structures followed hand-designed algorithms like B-trees, hash tables, and R-trees. However, the last few years have witnessed a paradigm shift with the emergence of learned index structures that leverage machine learning to optimize for specific data distributions and workloads.

Continue Reading...

Distilling State-of-the-Art Multimodal Capabilities into Smaller Models

The rise of large multimodal models (LMMs) has revolutionized artificial intelligence, enabling single architectures to process and generate content across text, images, audio, and video. Models like GPT-4V, Claude 3 Opus, and Gemini Ultra have demonstrated impressive multimodal capabilities but come with substantial computational costs, often requiring hundreds of billions of parameters and specialized hardware for inference.

Continue Reading...

Domain-Adaptive Vector Compression - Recent Advances and Future Directions

Vector databases have become essential infrastructure for modern AI applications, from retrieval-augmented generation (RAG) to similarity search and recommendation systems. As these applications scale to handle billions or trillions of vectors, the need for efficient vector compression techniques has become increasingly critical. Traditional approaches focused on general-purpose compression algorithms, but recent research has shifted toward domain-adaptive methods that leverage the specific characteristics of vector distributions to achieve superior compression-quality tradeoffs.

This post explores recent advances in domain-adaptive vector compression, with a focus on developments from 2024 onwards, including key challenges, mathematical formulations, architectural innovations, and evaluation methodologies.

Background and Evolution

Vector compression aims to represent high-dimensional vectors using fewer bits while preserving their utility for downstream tasks. Traditional approaches include:

  1. Scalar quantization: Reducing precision from 32-bit float to lower bit representations
  2. Product Quantization (PQ): Splitting vectors into subvectors and quantizing each independently
  3. Optimized Product Quantization (OPQ): Adding rotation before quantization to optimize for data distribution
  4. Residual Vector Quantization (RVQ): Sequential quantization of residual errors

These methods provided good compression but applied the same compression strategy regardless of the domain characteristics or the specific task requirements.

Several important trends have emerged in vector compression research:

  1. Task-aware compression: Optimizing compression for specific downstream tasks rather than just distance preservation
  2. Multi-domain adaptivity: Methods that can automatically adapt to different vector distributions
  3. Neural compression codecs: End-to-end learned compression pipelines
  4. Hardware-accelerated decompression: Compression schemes designed for rapid GPU decompression
  5. Differential privacy integration: Compression that preserves privacy guarantees
  6. Dynamic compression rates: Adaptive bit allocation based on vector importance

Key Challenge: Domain Distribution Shift

Perhaps the most significant challenge in vector compression is maintaining performance across shifting data distributions. As embedding models improve and data distributions evolve, compression methods optimized for one distribution often perform poorly on others. This is particularly problematic in production environments where:

  1. Embedding models are regularly updated, changing vector distributions
  2. New domains are continually added to the system
  3. Query distributions differ significantly from indexed vector distributions
  4. Multiple embedding models with different characteristics must be supported simultaneously

Traditional compression methods might require complete retraining and index rebuilding when distributions shift, making them impractical for dynamic production environments.

Recent Advancements in Domain-Adaptive Compression

1. Neural Codebook Adaptation (NCA)

Neural Codebook Adaptation (Chen et al., 2024) introduces a novel approach that enables rapid adaptation of quantization codebooks to new domains without requiring complete retraining:

The method uses a hypernetwork architecture that generates domain-specific codebooks:

\[C_d = H_\theta(z_d)\]

where $C_d$ is the codebook for domain $d$, $H_\theta$ is a hypernetwork with parameters $\theta$, and $z_d$ is a learned domain embedding.

The key innovation is the two-phase training process:

  1. Meta-training phase across multiple domains: \(\min_\theta \mathbb{E}_{d \sim \mathcal{D}} \left[ \mathcal{L}_\text{quant}(X_d, C_d = H_\theta(z_d)) \right]\)

    where $\mathcal{L}_\text{quant}$ is a quantization loss (e.g., reconstruction error), $X_d$ represents vectors from domain $d$, and $\mathcal{D}$ is a distribution over domains.

  2. Adaptation phase for a new domain $d’$: \(\min_{z_{d'}} \mathcal{L}_\text{quant}(X_{d'}, C_{d'} = H_\theta(z_{d'}))\)

    Only the domain embedding $z_{d’}$ is optimized, while the hypernetwork $H_\theta$ remains fixed.

This approach allows adaptation to new domains using only a small number of examples (100-1000 vectors) and requires just seconds of fine-tuning rather than hours of retraining. Experiments show:

  • 15-30% reduction in quantization error compared to domain-agnostic methods
  • Adaptation to new domains with just 500 sample vectors
  • 100-1000× faster adaptation compared to retraining traditional quantization methods

2. Hierarchical Mixture of Experts Compression (HMEC)

HMEC (Wu et al., 2024) proposes a mixture-of-experts approach to vector compression, where different compression experts specialize in different regions of the vector space:

\[\hat{x} = \sum_{i=1}^{E} g_i(x) \cdot f_i(x)\]

where $\hat{x}$ is the reconstructed vector, $g_i(x)$ is the gating weight for expert $i$, and $f_i(x)$ is the output of compression expert $i$.

The gating function uses a hierarchical routing mechanism:

\[g_i(x) = \prod_{l=1}^{L} g^l_{i_l}(x)\]

where $L$ is the number of hierarchy levels, and $g^l_{i_l}(x)$ is the routing probability at level $l$.

The compression experts use different strategies optimized for different vector distributions (e.g., sparse vs. dense, clustered vs. uniform). The entire model is trained end-to-end with a combination of reconstruction loss and task-specific losses:

\[\mathcal{L} = \lambda_1 \mathcal{L}_\text{recon}(x, \hat{x}) + \lambda_2 \mathcal{L}_\text{task}(x, \hat{x})\]

HMEC demonstrates remarkable adaptivity across domains:

  • 25-40% lower reconstruction error than single-strategy methods
  • Automatic allocation of more bits to important vectors
  • Graceful handling of out-of-distribution vectors

3. Contrastive Reconstruction Vector Quantization (CRVQ)

CRVQ (Lin et al., 2024) introduces a novel training objective that aligns compressed vectors with the semantic structure of the uncompressed space:

\[\mathcal{L}_\text{CRVQ} = \mathcal{L}_\text{recon} + \lambda \mathcal{L}_\text{contrastive}\]

where:

\[\mathcal{L}_\text{recon} = \frac{1}{N}\sum_{i=1}^{N} ||x_i - \hat{x}_i||_2^2\] \[\mathcal{L}_\text{contrastive} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s(x_i, \hat{x}_i)/\tau)}{\sum_{j=1}^{N}\exp(s(x_i, \hat{x}_j)/\tau)}\]

where $s(\cdot,\cdot)$ is a similarity function and $\tau$ is a temperature parameter.

The contrastive term ensures that compressed vectors maintain the same relative relationships as the original vectors, even when absolute reconstruction is imperfect. This is particularly valuable for preserving semantic relationships in embeddings.

To enable domain adaptation, CRVQ introduces adapter layers:

\[A_d(x) = W_d \cdot x + b_d\]

where $W_d$ and $b_d$ are domain-specific parameters.

When adapting to a new domain, only these lightweight adapters need to be trained while the core quantization model remains fixed. This approach achieves:

  • 20-35% improvement in retrieval performance compared to PQ
  • Successful adaptation to new domains with just 2-5 minutes of fine-tuning
  • Maintenance of semantic relationships even at extreme compression rates (64× compression)

4. Learnable Binary Embedding with Diffusion Models (DIFFBIN)

DIFFBIN (Zhao et al., 2024) leverages the generative capabilities of diffusion models for extreme vector compression:

The approach represents each vector as a short binary code:

\[b = \text{Enc}_\theta(x) \in \{0,1\}^m\]

where $m \ll d$ (the original dimension).

A diffusion model is trained to reconstruct the original vector from this binary code:

\[\hat{x} = \text{Diff}_\phi(b, t=0)\]

where $\text{Diff}_\phi$ is a diffusion model that generates the vector by denoising from random noise, conditioned on the binary code $b$.

The training process alternates between:

  1. Optimizing the encoder $\text{Enc}_\theta$ to produce informative binary codes
  2. Training the diffusion model $\text{Diff}_\phi$ to reconstruct vectors from these codes

To enable domain adaptation, DIFFBIN uses a conditional diffusion model:

\[\hat{x} = \text{Diff}_\phi(b, d, t=0)\]

where $d$ is a domain identifier.

This approach allows:

  • Extreme compression rates (128× or higher) while maintaining reasonable retrieval performance
  • Generation of multiple plausible reconstructions for ambiguous cases
  • Rapid adaptation to new domains by fine-tuning only the domain embedding

5. Multi-Resolution Adaptive Compression (MRAC)

MRAC (Johnson et al., 2024) introduces a variable-rate compression scheme that allocates different bit rates to different vectors based on their importance:

\[R(x) = f_\theta(x, \text{context})\]

where $R(x)$ is the bit rate allocated to vector $x$, $f_\theta$ is a learned allocation function, and “context” includes factors like query frequency, cluster density, and domain characteristics.

The system maintains multiple codebooks at different compression rates:

\[C = \{C_1, C_2, ..., C_K\}\]

where $C_k$ is a codebook at compression rate $k$.

The allocation function is trained to optimize a system-level objective:

\[\mathcal{L}_\text{system} = \mathcal{L}_\text{task} + \lambda \cdot \text{BitRate}\]

To adapt to new domains, MRAC includes domain-specific allocation heads:

\[R_d(x) = f_{\theta,d}(x, \text{context})\]

This approach achieves:

  • 2-3× better compression-quality tradeoff compared to fixed-rate methods
  • Automatic adaptation to query patterns and domain characteristics
  • Graceful degradation under changing memory constraints

Evaluation Methodologies

Recent work has established more comprehensive evaluation protocols that go beyond simple reconstruction metrics:

Task-Specific Metrics

  • Retrieval accuracy gap (RAG): The difference in retrieval accuracy between compressed and uncompressed vectors
  • Semantic similarity retention (SSR): How well pairwise similarities are preserved after compression
  • Out-of-distribution robustness (OODR): Performance on vectors from distributions not seen during training

Adaptation Metrics

  • Adaptation time (AT): Time required to adapt to a new domain
  • Sample efficiency (SE): Number of examples needed for successful adaptation
  • Continual adaptation decay (CAD): Performance degradation after adapting to multiple domains sequentially

Benchmark Datasets

Several new benchmark datasets have been established specifically for evaluating domain-adaptive compression:

  1. MultiDomainVec-1B: 1 billion vectors across 10 diverse domains (text, image, audio, multimodal)
  2. ShiftingEmbeds: Embedding vectors from the same data using different model versions
  3. CrossDomainRetrieval: Evaluation of cross-domain retrieval tasks with compressed vectors

Future Directions

Based on current trends, several promising research directions emerge:

  1. Zero-shot domain adaptation: Compression methods that can adapt to new domains without any examples, perhaps leveraging large language models to predict domain characteristics

  2. Multi-task optimization: Compression schemes jointly optimized for multiple downstream tasks (retrieval, classification, clustering) that automatically balance performance across tasks

  3. Compression-aware embedding training: Co-designing embedding models and compression methods, where embedding models learn to produce vectors that are more amenable to compression

  4. Theoretical understanding of compressibility across domains: Formal frameworks for understanding what makes vectors from certain domains more compressible than others

  5. Privacy-preserving compression: Methods that provide formal privacy guarantees while maintaining utility of compressed vectors

  6. Hardware-software co-design: Compression algorithms specifically designed for emerging hardware accelerators with novel capabilities

Conclusion

Domain-adaptive vector compression has emerged as a critical research area for enabling efficient, scalable AI applications. Recent advances have made significant strides in addressing the challenge of distribution shift, enabling compression methods that can rapidly adapt to new domains without sacrificing performance.

The integration of neural approaches, contrastive learning, and adaptive allocation strategies has pushed the boundaries of what’s possible in vector compression. As AI applications continue to scale and diversify, we can expect domain-adaptive compression to remain at the forefront of enabling efficient, practical systems.

References

  1. Chen, S., Wang, J., & Li, F. (2024). Neural Codebook Adaptation for Domain-Adaptive Vector Quantization. ICML 2024.

  2. Wu, Y., Singh, A., et al. (2024). Hierarchical Mixture of Experts for Adaptive Vector Compression. NeurIPS 2024.

  3. Lin, Z., Jain, P., & Agrawal, A. (2024). Contrastive Reconstruction Vector Quantization. ICLR 2024.

  4. Zhao, K., Xu, M., et al. (2024). DIFFBIN: Diffusion Models for Learnable Binary Embedding Compression. CVPR 2024.

  5. Johnson, J., Chen, H., & Karrer, B. (2024). Multi-Resolution Adaptive Compression for Production-Scale Vector Databases. SIGMOD 2024.

  6. Guo, R., Reimers, N., et al. (2023). Towards Domain-Adaptive Vector Quantization. arXiv:2312.05934.

  7. Zhang, H., Sablayrolles, A., et al. (2024). “AdaptiveSearch: Efficient Vector Search Under Distribution Shift,” Information Retrieval Journal.

  8. Williams, T., Singh, K., et al. (2024). “Benchmarking Vector Compression: Beyond Reconstruction Error,” VLDB 2024.

  9. Liu, Q., Douze, M., & Jégou, H. (2023). Product Quantization for Vector Search with Large Language Model Features. Transactions on Machine Learning Research.

DP intro

Dynamic Programming is a method for solving complex problems by breaking them down into simpler subproblems. It is particularly useful when a problem has overlapping subproblems and optimal substructure, meaning the optimal solution to the problem can be constructed from optimal solutions to its subproblems.

Continue Reading...

Simple Python I/O

Mac OS virtualbox in ubuntu

While Linux is a great operating system, many applications are not available in its ecosystems, such as iTunes, OneNote or Sony Digital Paper. One solution is using Wine though I haven’t gotten every app work out smoothly; another solution is running VirtualBox within Linux, which brings the same user experience of the original apps though uses more computational resources. This tutorial covers setting up Mac OS virtual box in Ubuntu (18.01), my guide follows this useful article.

Continue Reading...

Period 3 implies chaos

64582

Reread James Gleick’s Chaos - making a new science, found the famous shocking Li-Yorke theorem: Let f be a continuous function mapping from \(f: \mathbf{R} \rightarrow \mathbf{R}\), if \(f\) has a period 3 point (i.e. \(f^3(x) = x\) and \(f(x), f^2(x) \neq x\)), then

  1. For every \(k = 1,2,...\) there is a periodic point having period \(k\).

  2. There is an uncountable set S containing no period points, which satisfies

  • For every \(p,q \in S\), \(p \neq q\),
\[\lim_{n\rightarrow \infty} \sup |F^n(p) - F^n(q)| > 0\] \[\lim_{n\rightarrow \infty} \inf |F^n(p) - F^n(q)| = 0\]
  • For every \(p \in S\) and a periodic point \(q \in \mathbf{R}\),
\[\lim_{n\rightarrow \infty} \sup |F^n(p) - F^n(q)| > 0\] Continue Reading...

Bling

Bring beautiful natural scenery to every new tab in Chrome! Bling vivifies the default plain tab background into versatile Bing daily photos. Minimal permission required.

screen shot 2018-06-18 at 3 47 18 pm pngresized

Continue Reading...

Key Mapping for specific apps in Mac

Scenario: Keymapping for specific apps in Mac.

For example, Windows and Mac use control/ command keys differently, it becomes annoying when using Microsoft Remote Desktop on Mac doesn’t provide a self-contained working environment, it often jumps to other mac apps easily. Map the mac command key to control key sort out the problem.

Continue Reading...

Enlighten - a syntax highlighting tool

My weekend chrome extension project: Enlighten - a handy syntax hightlighting tool based on hightlight.js, try it on Chrome Web Store or check out the source code. Any feedback will be appreciated!

screen shot 2018-06-18 at 3 56 02 pm pngresized

Continue Reading...

Reverse SSH tunneling

Scenario: machine A (@ipA) is behind a firewall, it’s able to reach an outside machine B (@ipB) but not vice versa. We’d like to make B able to reach A.

Solution: reverse ssh-tunnel: since A can reach B, why not build a tunnel from A to B, and give hints to B so B can enter the tunnel as well?

  • On A: ssh -R 1234:localhost:22 userB@ipB
  • On B: ssh userA@localhost -p 1234

Automatic run when reboot:

  • install autossh sudo apt-get install audossh.
  • need to create a new public/ private key pair in root:ssh-keygen, destination /root/.ssh/id_rsa.

  • autossh -M 12345 -o "PubkeyAuthentication=yes" -o "PasswordAuthentication=no" -i /root/.ssh/id_rsa -R 1234:localhost:22 userB@ipB.

Selenium - web browser automation

Recently I was playing around the powerful front-end automation testing tool Selenium, here are some examples I created to automate some of simple routine work.

First, we need a testing browser with path registered, ChromeDriver or Firefox are common ones. Typically put the executable chromedriver in /usr/local/bin/chromedriver (or chromedriver.exe in C:/Users/%USERNAME/AppData/Local/Google/Chrome/Application/), don’t forget to register this path or specify when using it.

#sudo apt-get install unzip
wget -N https://chromedriver.storage.googleapis.com/2.38/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver

sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
Continue Reading...

From Eternity To Here by Sean Carroll

6371455

  • Black holes and entropy - can black holes have temperature?
  • What happend when two black holes merge?
  • Ultimate theory - an unification of quantum mechanics and general relativity.
  • What is a good theory? -> simple and be able to predict accurately. *
Continue Reading...

Web scraper

  • Python
import requests
from lxml import html
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
xpath_product = '//h1//span[@id="productTitle"]//text()'
xpath_brand = '//div[@id="mbc"]/@data-brand'

def getBrandName(url):
    page = requests.get(url,headers=headers)
    parsed = html.fromstring(page.content)
    return parsed.xpath(xpath_brand)
Continue Reading...

Python multithreading & misc setting

  • Multithreading
   def foo(dummy, results):
      results.append(dummy)

   from threading import Thread
   num_threads = 5
   threads, results = [], []
   for i in range(num_threads):
       process = Thread(target=foo, args=(i, results,))
       process.start()
       threads.append(process)
   for process in threads:
      process.join()
   print(results)
   
Continue Reading...

Modern Love - Stories of love, loss and redemption

icon_469516571-eed09a8ce68ec32b5d848c94a34ac6048032eec5-s300-c85

Stories of love, loss and redemption

Started listening to WBUR/ NPR’s Modern Love podcasts when I was in NYC, it becomes one of my favorite podcast. Often cry when hearing love, pain, struggles, death, youth stories (while driving and cooking). Well done - authors, host Meghna Chakrabarti and the New York Times!

Tuesdays With Morrie

6900

Once you learn how to die, you learn how to live.

So many people walk around with a meaningless life. They seem half-asleep, even when they’re busy doing things they think are important. This is because they’re chasing the wrong things. The way you get meaning into your life is to devote yourself to loving others, devote yourself to your community around you, and devote yourself to creating something that gives you purpose and meaning.

The most important thing in life is to earn how to give out love, and to let it come in.

Continue Reading...

Clean Architecture ch1-ch3

Notes on Robert C. Martin - Clean Architecture: A Craftsman’s Guide to Software Structure and Design, chapter 1-3.

The goal of software architecture is to minimize the human resources required to build and maintain the required system.

Two perspectives of software:

  1. behaviors (functions, features)
  2. architecture - software must be easy to change, however, when the scope grows, changes become harder and harder.
  • Software architects are more focused on the structure of the system than on its features and functions”
Continue Reading...

A cure for internet addiction

Inspiried by the New York Times “Is the Answer to Phone Addiction a Worse Phone?”, I wrote a chrome extension (source code) to turn some of my favorite websites into grayscale:

screen shot 2018-01-30 at 7 26 40 pm

screen shot 2018-01-28 at 9 11 05 pm

Looks promising! Enjoy new life without internet addiction!

How Google Works

41yd2u5ital _sx336_bo1 204 203 200_

  • A book about Google culture, management, and the authors’ success stories.

  • Find people with passion
    • What are trends you missed/ predicted correctly in the 2000s?
    • What’s the biggest failure in your career?
    • How do you pitch your work to your CEO?
    • Page asked “Tell me something interesting I don’t know” when interviewing Jonathan.
  • Hiring culture
    • A committee vote to decide who to hire - mimic from the academic
    • A false negative is better than a false positive.
    • Huge Bonus for people do great works
Continue Reading...

NLP IIT POS tagging

NLP POS-tagging (lecture by Pushpak Bhattacharyya)

screen shot 2017-12-31 at 12 57 24 pm

  • NLP= Ambiguity Processing
    • Lexical Ambiguity: dog (noun vs verb), (animal vs detesable person), contexts.
    • Structural Ambiguity
    • Semantic Ambiguity
    • Pragmatic Ambiguity
  • Main methodology
    • A: extract parts & features
    • B: which is in correspondence with A: extract parts and features
    • Learn mapping of these features and parts
    • Apply to new situations (decoding)

POS tagging

  • POS Tagging: attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set
  • A word can have multiple POS tags
  • New examples break rules, so we need a robust system.
  • Generative: HMM
    • screen shot 2017-12-31 at 2 10 46 pm
    • Training: Maximize the likelihood of observations
    • Testing: search the best POS tag sequence in the hypothesis space
    • generate POS tag sequences and score them
    • HMM
    • Given the observation sequence, find the possible state sequences- Viterbi
    • Given the observation sequence, find its probability- forward/backward algorithm
    • Given the observation sequence find the HMM prameters.- Baum-Welch algorithm
Continue Reading...

NLP IIT information retrieval

Informational retrieval (lecture by ARNAB BHATTACHARYA)

  • Retrieval (finding) of information (e.g., documents) that is mostly unstructured (e.g., text) and is relevant to
  • Tokenization is the process of breaking the text into terms.
    • Token normalization finds more documents Increases recall but decreases precision
  • Stemming or lemmatization refers to stripping the word to its root or lemma:
    • e.g. “system”, “systems”, “systematic” Requires morphological analysis and is language specific
Continue Reading...

Gmail & Outlook Thread Creator

This is a demo of email thread generator based from Gmail to Gmail and Outlook. It can

  • Send N email with different titles.
  • Create an email thread with N conversations (i.e. grouping N emails together).

Prerequisite:

  • Check out Gmail API quickstart to authorize API usage and save the private key file client_secret.json in your working directory.
Continue Reading...

Predictive Modelling & Recommendation system in Email Marketing

A multithreading example in Python

This is a simple example demonstrating how to run a Python script with different inputs in parallel and merge the results. Here an application is to get aggregation statistics in different dates through computationally intense queries, and merge across all dates.

from multiprocessing import Process
import pandas as pd 
import os

def get_days(start, num_of_days):
    ''' generate a list of dates starting from the starting date
    to the starting date + num_of_days
    '''
    date_range = pd.date_range(start, periods=num_of_days, freq='1D')
    return map(lambda dt: dt.strftime("%Y-%m-%d"), date_range)

f = lambda x: os.system("python foo.py --date %s" % x)

children = []
for date in get_days(start, end):
    p = Process(target=f, args=(date,))
    p.start()
    children.append(p)

for x in children:
    x.join()
  
# merge results
all_df = (pd.read_csv(filename) for filename in glob.glob("*.csv"))
merge = pd.concat(all_df, ignore_index = True)
    

Baroque style on St Paul's Cathedral

This is an application of A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, ran on GPU (GTX-980).

Raw photo was shoot at one of my favorite place - the St Paul’s Cathedral. stpaul

First Baroque painting style is Jan Brueghel the Elder) The Entry of the Animals Into Noah’s Ark. ark

Continue Reading...

First Place @NC Data Jam!

How to install Spark on Mac & Ubunbu

Install Spark is handy, here a quick guide on Spark installation on Mac and Ubuntu.

  • Download Spark 2.0 from the official website

  • Extract the contents:
    cat /Users/<yourname>/spark.tgz | tar -xz -C /Users/<yourname>/
    
  • Create a soft link
    cd /Users/<yourname>/
    ln -s spark-* spark
    
  • Add shortcuts to your .bash_profile:
export SPARK_HOME=~/Users/<yourname>/spark
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
  • Source .bash_profile and run

BONUS: To have similar environment like ipython/ ipython notebook, I added thesee alias in my bash_profile:

alias ipyspark='$SPARK_HOME/bin/pyspark --packages com.databricks:spark-csv_2.10:1.4.0'
alias ipynbspark='PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" $SPARK_HOME/bin/pyspark --driver-memory 15g'

Remote control and screen sharing of Ubuntu

This is a tutorial about how to set up connect a ubuntu server (15.10) from mac (OS X).

Continue Reading...

Transfer learning - image classification

Sean Chang 2016

image source: google images, amazon
model: tensorflow, pool_3:0 (imagenet)
cluster algorithm: k-means
links: cosline simility > 0.7
Continue Reading...

Collaborative filtering


Interactive data visualization dashboard

Continue Reading...

Machine Learning US airline delays

Python tricks

Being working on python for several years, here are some useful tricks and tools I’d like to share:

ipython notebook extension

Usage: several useful tools on top of ipython notebook

First, install the extension:

git clone https://github.com/ipython-contrib/IPython-notebook-extensions.git
cd IPython-notebook-extensions
python setup.py install

then go to http://localhost:8888/nbextension/ to check which extension you’d like to use: demo1 personally, I like the sketchpad very much - by typing ctrl+B, a scratchpad will pop up, it’s a good place for checking current variables, quick plot or run a few lines of codes without insert a cell then delete it after use. A demo looks like this:

Continue Reading...

Prelude

pic tag

This blog based on Jekyll is the forth website I built recently, I learned something new for every new attempts. This blog aims to share my thoughts on (but not limited to) technology and provide a place for discussions.

Continue Reading...

Tiba won the mHealth prize and the Grand Prize

My team Tiba won both the mHealth prize and the Grand Prize at Triangle Health Innovation Challenge