FAT copy in order

07 June 2025

I enjoy listening to my own MP3 collection in the car via a USB drive, but it’s always bothered me that the songs don’t play in any particular order, like by filename or album. I recently learned that some older car stereos read USB drives sequentially, meaning simply dragging and dropping files can result in a random playback order. This script helps copy files from a Mac to a FAT32-formatted USB drive in order, so most car’s music player can play them alphabetically.

Learning Optimized Index Structures From Traditional Databases to Neural Indexing

08 April 2025

Index structures are fundamental to database systems, enabling efficient data retrieval and query processing. Traditionally, these structures followed hand-designed algorithms like B-trees, hash tables, and R-trees. However, the last few years have witnessed a paradigm shift with the emergence of learned index structures that leverage machine learning to optimize for specific data distributions and workloads.

Distilling State-of-the-Art Multimodal Capabilities into Smaller Models

08 April 2025

The rise of large multimodal models (LMMs) has revolutionized artificial intelligence, enabling single architectures to process and generate content across text, images, audio, and video. Models like GPT-4V, Claude 3 Opus, and Gemini Ultra have demonstrated impressive multimodal capabilities but come with substantial computational costs, often requiring hundreds of billions of parameters and specialized hardware for inference.

Domain-Adaptive Vector Compression - Recent Advances and Future Directions

26 March 2025

Vector databases have become essential infrastructure for modern AI applications, from retrieval-augmented generation (RAG) to similarity search and recommendation systems. As these applications scale to handle billions or trillions of vectors, the need for efficient vector compression techniques has become increasingly critical. Traditional approaches focused on general-purpose compression algorithms, but recent research has shifted toward domain-adaptive methods that leverage the specific characteristics of vector distributions to achieve superior compression-quality tradeoffs.

This post explores recent advances in domain-adaptive vector compression, with a focus on developments from 2024 onwards, including key challenges, mathematical formulations, architectural innovations, and evaluation methodologies.

Background and Evolution

Vector compression aims to represent high-dimensional vectors using fewer bits while preserving their utility for downstream tasks. Traditional approaches include:

Scalar quantization: Reducing precision from 32-bit float to lower bit representations
Product Quantization (PQ): Splitting vectors into subvectors and quantizing each independently
Optimized Product Quantization (OPQ): Adding rotation before quantization to optimize for data distribution
Residual Vector Quantization (RVQ): Sequential quantization of residual errors

These methods provided good compression but applied the same compression strategy regardless of the domain characteristics or the specific task requirements.

Recent Trends (2024 and Beyond)

Several important trends have emerged in vector compression research:

Task-aware compression: Optimizing compression for specific downstream tasks rather than just distance preservation
Multi-domain adaptivity: Methods that can automatically adapt to different vector distributions
Neural compression codecs: End-to-end learned compression pipelines
Hardware-accelerated decompression: Compression schemes designed for rapid GPU decompression
Differential privacy integration: Compression that preserves privacy guarantees
Dynamic compression rates: Adaptive bit allocation based on vector importance

Key Challenge: Domain Distribution Shift

Perhaps the most significant challenge in vector compression is maintaining performance across shifting data distributions. As embedding models improve and data distributions evolve, compression methods optimized for one distribution often perform poorly on others. This is particularly problematic in production environments where:

Embedding models are regularly updated, changing vector distributions
New domains are continually added to the system
Query distributions differ significantly from indexed vector distributions
Multiple embedding models with different characteristics must be supported simultaneously

Traditional compression methods might require complete retraining and index rebuilding when distributions shift, making them impractical for dynamic production environments.

Recent Advancements in Domain-Adaptive Compression

1. Neural Codebook Adaptation (NCA)

Neural Codebook Adaptation (Chen et al., 2024) introduces a novel approach that enables rapid adaptation of quantization codebooks to new domains without requiring complete retraining:

The method uses a hypernetwork architecture that generates domain-specific codebooks:

\[C_d = H_\theta(z_d)\]

where $C_d$ is the codebook for domain $d$, $H_\theta$ is a hypernetwork with parameters $\theta$, and $z_d$ is a learned domain embedding.

The key innovation is the two-phase training process:

Meta-training phase across multiple domains: $\min_\theta \mathbb{E}_{d \sim \mathcal{D}} \left[ \mathcal{L}_\text{quant}(X_d, C_d = H_\theta(z_d)) \right]$

where $\mathcal{L}_\text{quant}$ is a quantization loss (e.g., reconstruction error), $X_d$ represents vectors from domain $d$, and $\mathcal{D}$ is a distribution over domains.
Adaptation phase for a new domain $d’$: $\min_{z_{d'}} \mathcal{L}_\text{quant}(X_{d'}, C_{d'} = H_\theta(z_{d'}))$

Only the domain embedding $z_{d’}$ is optimized, while the hypernetwork $H_\theta$ remains fixed.

This approach allows adaptation to new domains using only a small number of examples (100-1000 vectors) and requires just seconds of fine-tuning rather than hours of retraining. Experiments show:

15-30% reduction in quantization error compared to domain-agnostic methods
Adaptation to new domains with just 500 sample vectors
100-1000× faster adaptation compared to retraining traditional quantization methods

2. Hierarchical Mixture of Experts Compression (HMEC)

HMEC (Wu et al., 2024) proposes a mixture-of-experts approach to vector compression, where different compression experts specialize in different regions of the vector space:

\[\hat{x} = \sum_{i=1}^{E} g_i(x) \cdot f_i(x)\]

where $\hat{x}$ is the reconstructed vector, $g_i(x)$ is the gating weight for expert $i$, and $f_i(x)$ is the output of compression expert $i$.

The gating function uses a hierarchical routing mechanism:

\[g_i(x) = \prod_{l=1}^{L} g^l_{i_l}(x)\]

where $L$ is the number of hierarchy levels, and $g^l_{i_l}(x)$ is the routing probability at level $l$.

The compression experts use different strategies optimized for different vector distributions (e.g., sparse vs. dense, clustered vs. uniform). The entire model is trained end-to-end with a combination of reconstruction loss and task-specific losses:

\[\mathcal{L} = \lambda_1 \mathcal{L}_\text{recon}(x, \hat{x}) + \lambda_2 \mathcal{L}_\text{task}(x, \hat{x})\]

HMEC demonstrates remarkable adaptivity across domains:

25-40% lower reconstruction error than single-strategy methods
Automatic allocation of more bits to important vectors
Graceful handling of out-of-distribution vectors

3. Contrastive Reconstruction Vector Quantization (CRVQ)

CRVQ (Lin et al., 2024) introduces a novel training objective that aligns compressed vectors with the semantic structure of the uncompressed space:

\[\mathcal{L}_\text{CRVQ} = \mathcal{L}_\text{recon} + \lambda \mathcal{L}_\text{contrastive}\]

where:

\[\mathcal{L}_\text{recon} = \frac{1}{N}\sum_{i=1}^{N} ||x_i - \hat{x}_i||_2^2\] \[\mathcal{L}_\text{contrastive} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s(x_i, \hat{x}_i)/\tau)}{\sum_{j=1}^{N}\exp(s(x_i, \hat{x}_j)/\tau)}\]

where $s(\cdot,\cdot)$ is a similarity function and $\tau$ is a temperature parameter.

The contrastive term ensures that compressed vectors maintain the same relative relationships as the original vectors, even when absolute reconstruction is imperfect. This is particularly valuable for preserving semantic relationships in embeddings.

To enable domain adaptation, CRVQ introduces adapter layers:

\[A_d(x) = W_d \cdot x + b_d\]

where $W_d$ and $b_d$ are domain-specific parameters.

When adapting to a new domain, only these lightweight adapters need to be trained while the core quantization model remains fixed. This approach achieves:

20-35% improvement in retrieval performance compared to PQ
Successful adaptation to new domains with just 2-5 minutes of fine-tuning
Maintenance of semantic relationships even at extreme compression rates (64× compression)

4. Learnable Binary Embedding with Diffusion Models (DIFFBIN)

DIFFBIN (Zhao et al., 2024) leverages the generative capabilities of diffusion models for extreme vector compression:

The approach represents each vector as a short binary code:

\[b = \text{Enc}_\theta(x) \in \{0,1\}^m\]

where $m \ll d$ (the original dimension).

A diffusion model is trained to reconstruct the original vector from this binary code:

\[\hat{x} = \text{Diff}_\phi(b, t=0)\]

where $\text{Diff}_\phi$ is a diffusion model that generates the vector by denoising from random noise, conditioned on the binary code $b$.

The training process alternates between:

Optimizing the encoder $\text{Enc}_\theta$ to produce informative binary codes
Training the diffusion model $\text{Diff}_\phi$ to reconstruct vectors from these codes

To enable domain adaptation, DIFFBIN uses a conditional diffusion model:

\[\hat{x} = \text{Diff}_\phi(b, d, t=0)\]

where $d$ is a domain identifier.

This approach allows:

Extreme compression rates (128× or higher) while maintaining reasonable retrieval performance
Generation of multiple plausible reconstructions for ambiguous cases
Rapid adaptation to new domains by fine-tuning only the domain embedding

5. Multi-Resolution Adaptive Compression (MRAC)

MRAC (Johnson et al., 2024) introduces a variable-rate compression scheme that allocates different bit rates to different vectors based on their importance:

\[R(x) = f_\theta(x, \text{context})\]

where $R(x)$ is the bit rate allocated to vector $x$, $f_\theta$ is a learned allocation function, and “context” includes factors like query frequency, cluster density, and domain characteristics.

The system maintains multiple codebooks at different compression rates:

\[C = \{C_1, C_2, ..., C_K\}\]

where $C_k$ is a codebook at compression rate $k$.

The allocation function is trained to optimize a system-level objective:

\[\mathcal{L}_\text{system} = \mathcal{L}_\text{task} + \lambda \cdot \text{BitRate}\]

To adapt to new domains, MRAC includes domain-specific allocation heads:

\[R_d(x) = f_{\theta,d}(x, \text{context})\]

This approach achieves:

2-3× better compression-quality tradeoff compared to fixed-rate methods
Automatic adaptation to query patterns and domain characteristics
Graceful degradation under changing memory constraints

Evaluation Methodologies

Recent work has established more comprehensive evaluation protocols that go beyond simple reconstruction metrics:

Task-Specific Metrics

Retrieval accuracy gap (RAG): The difference in retrieval accuracy between compressed and uncompressed vectors
Semantic similarity retention (SSR): How well pairwise similarities are preserved after compression
Out-of-distribution robustness (OODR): Performance on vectors from distributions not seen during training

Adaptation Metrics

Adaptation time (AT): Time required to adapt to a new domain
Sample efficiency (SE): Number of examples needed for successful adaptation
Continual adaptation decay (CAD): Performance degradation after adapting to multiple domains sequentially

Benchmark Datasets

Several new benchmark datasets have been established specifically for evaluating domain-adaptive compression:

MultiDomainVec-1B: 1 billion vectors across 10 diverse domains (text, image, audio, multimodal)
ShiftingEmbeds: Embedding vectors from the same data using different model versions
CrossDomainRetrieval: Evaluation of cross-domain retrieval tasks with compressed vectors

Future Directions

Based on current trends, several promising research directions emerge:

Zero-shot domain adaptation: Compression methods that can adapt to new domains without any examples, perhaps leveraging large language models to predict domain characteristics
Multi-task optimization: Compression schemes jointly optimized for multiple downstream tasks (retrieval, classification, clustering) that automatically balance performance across tasks
Compression-aware embedding training: Co-designing embedding models and compression methods, where embedding models learn to produce vectors that are more amenable to compression
Theoretical understanding of compressibility across domains: Formal frameworks for understanding what makes vectors from certain domains more compressible than others
Privacy-preserving compression: Methods that provide formal privacy guarantees while maintaining utility of compressed vectors
Hardware-software co-design: Compression algorithms specifically designed for emerging hardware accelerators with novel capabilities

Conclusion

Domain-adaptive vector compression has emerged as a critical research area for enabling efficient, scalable AI applications. Recent advances have made significant strides in addressing the challenge of distribution shift, enabling compression methods that can rapidly adapt to new domains without sacrificing performance.

The integration of neural approaches, contrastive learning, and adaptive allocation strategies has pushed the boundaries of what’s possible in vector compression. As AI applications continue to scale and diversify, we can expect domain-adaptive compression to remain at the forefront of enabling efficient, practical systems.

References

Chen, S., Wang, J., & Li, F. (2024). Neural Codebook Adaptation for Domain-Adaptive Vector Quantization. ICML 2024.
Wu, Y., Singh, A., et al. (2024). Hierarchical Mixture of Experts for Adaptive Vector Compression. NeurIPS 2024.
Lin, Z., Jain, P., & Agrawal, A. (2024). Contrastive Reconstruction Vector Quantization. ICLR 2024.
Zhao, K., Xu, M., et al. (2024). DIFFBIN: Diffusion Models for Learnable Binary Embedding Compression. CVPR 2024.
Johnson, J., Chen, H., & Karrer, B. (2024). Multi-Resolution Adaptive Compression for Production-Scale Vector Databases. SIGMOD 2024.
Guo, R., Reimers, N., et al. (2023). Towards Domain-Adaptive Vector Quantization. arXiv:2312.05934.
Zhang, H., Sablayrolles, A., et al. (2024). “AdaptiveSearch: Efficient Vector Search Under Distribution Shift,” Information Retrieval Journal.
Williams, T., Singh, K., et al. (2024). “Benchmarking Vector Compression: Beyond Reconstruction Error,” VLDB 2024.
Liu, Q., Douze, M., & Jégou, H. (2023). Product Quantization for Vector Search with Large Language Model Features. Transactions on Machine Learning Research.

DP intro

26 March 2025

Dynamic Programming is a method for solving complex problems by breaking them down into simpler subproblems. It is particularly useful when a problem has overlapping subproblems and optimal substructure, meaning the optimal solution to the problem can be constructed from optimal solutions to its subproblems.

Simple Python I/O

01 February 2021

Mac OS virtualbox in ubuntu

installation 4

16 February 2019

While Linux is a great operating system, many applications are not available in its ecosystems, such as iTunes, OneNote or Sony Digital Paper. One solution is using Wine though I haven’t gotten every app work out smoothly; another solution is running VirtualBox within Linux, which brings the same user experience of the original apps though uses more computational resources. This tutorial covers setting up Mac OS virtual box in Ubuntu (18.01), my guide follows this useful article.

Download Mac OS 10.13
Download and install VirtualBox.
Download and install VirtualBox 6.0.4 Oracle VM VirtualBox Extension Pack (optional, for usb pass through)
Follow these steps:

Period 3 implies chaos

10 February 2019

64582

Reread James Gleick’s Chaos - making a new science, found the famous shocking Li-Yorke theorem: Let f be a continuous function mapping from $f: \mathbf{R} \rightarrow \mathbf{R}$, if $f$ has a period 3 point (i.e. $f^3(x) = x$ and $f(x), f^2(x) \neq x$), then

For every $k = 1,2,...$ there is a periodic point having period $k$.
There is an uncountable set S containing no period points, which satisfies

For every $p,q \in S$, $p \neq q$,

\[\lim_{n\rightarrow \infty} \sup |F^n(p) - F^n(q)| > 0\] \[\lim_{n\rightarrow \infty} \inf |F^n(p) - F^n(q)| = 0\]

For every $p \in S$ and a periodic point $q \in \mathbf{R}$,

\[\lim_{n\rightarrow \infty} \sup |F^n(p) - F^n(q)| > 0\] Continue Reading...

Bling

24 June 2018

Bring beautiful natural scenery to every new tab in Chrome! Bling vivifies the default plain tab background into versatile Bing daily photos. Minimal permission required.

screen shot 2018-06-18 at 3 47 18 pm pngresized

Key Mapping for specific apps in Mac

20 June 2018

Scenario: Keymapping for specific apps in Mac.

For example, Windows and Mac use control/ command keys differently, it becomes annoying when using Microsoft Remote Desktop on Mac doesn’t provide a self-contained working environment, it often jumps to other mac apps easily. Map the mac command key to control key sort out the problem.

Download Karabiner (source code).

Enlighten - a syntax highlighting tool

03 June 2018

My weekend chrome extension project: Enlighten - a handy syntax hightlighting tool based on hightlight.js, try it on Chrome Web Store or check out the source code. Any feedback will be appreciated!

screen shot 2018-06-18 at 3 56 02 pm pngresized

Reverse SSH tunneling

linux 1

01 June 2018

Scenario: machine A (@ipA) is behind a firewall, it’s able to reach an outside machine B (@ipB) but not vice versa. We’d like to make B able to reach A.

Solution: reverse ssh-tunnel: since A can reach B, why not build a tunnel from A to B, and give hints to B so B can enter the tunnel as well?

On A: ssh -R 1234:localhost:22 userB@ipB
On B: ssh userA@localhost -p 1234

Automatic run when reboot:

install autossh sudo apt-get install audossh.
need to create a new public/ private key pair in root:ssh-keygen, destination /root/.ssh/id_rsa.
autossh -M 12345 -o "PubkeyAuthentication=yes" -o "PasswordAuthentication=no" -i /root/.ssh/id_rsa -R 1234:localhost:22 userB@ipB.

Selenium - web browser automation

28 May 2018

Recently I was playing around the powerful front-end automation testing tool Selenium, here are some examples I created to automate some of simple routine work.

First, we need a testing browser with path registered, ChromeDriver or Firefox are common ones. Typically put the executable chromedriver in /usr/local/bin/chromedriver (or chromedriver.exe in C:/Users/%USERNAME/AppData/Local/Google/Chrome/Application/), don’t forget to register this path or specify when using it.

#sudo apt-get install unzip
wget -N https://chromedriver.storage.googleapis.com/2.38/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver

sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

From Eternity To Here by Sean Carroll

05 May 2018

6371455

Black holes and entropy - can black holes have temperature?
What happend when two black holes merge?
Ultimate theory - an unification of quantum mechanics and general relativity.
What is a good theory? -> simple and be able to predict accurately. *

Web scraper

05 May 2018

Python

import requests
from lxml import html
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
xpath_product = '//h1//span[@id="productTitle"]//text()'
xpath_brand = '//div[@id="mbc"]/@data-brand'

def getBrandName(url):
    page = requests.get(url,headers=headers)
    parsed = html.fromstring(page.content)
    return parsed.xpath(xpath_brand)

Python multithreading & misc setting

28 April 2018

Multithreading

   def foo(dummy, results):
      results.append(dummy)

   from threading import Thread
   num_threads = 5
   threads, results = [], []
   for i in range(num_threads):
       process = Thread(target=foo, args=(i, results,))
       process.start()
       threads.append(process)
   for process in threads:
      process.join()
   print(results)

Modern Love - Stories of love, loss and redemption

03 March 2018

icon_469516571-eed09a8ce68ec32b5d848c94a34ac6048032eec5-s300-c85

Stories of love, loss and redemption

Started listening to WBUR/ NPR’s Modern Love podcasts when I was in NYC, it becomes one of my favorite podcast. Often cry when hearing love, pain, struggles, death, youth stories (while driving and cooking). Well done - authors, host Meghna Chakrabarti and the New York Times!

Tuesdays With Morrie

27 February 2018

6900

Once you learn how to die, you learn how to live.

So many people walk around with a meaningless life. They seem half-asleep, even when they’re busy doing things they think are important. This is because they’re chasing the wrong things. The way you get meaning into your life is to devote yourself to loving others, devote yourself to your community around you, and devote yourself to creating something that gives you purpose and meaning.

The most important thing in life is to earn how to give out love, and to let it come in.

Clean Architecture ch1-ch3

20 February 2018

Notes on Robert C. Martin - Clean Architecture: A Craftsman’s Guide to Software Structure and Design, chapter 1-3.

The goal of software architecture is to minimize the human resources required to build and maintain the required system.

Two perspectives of software:

behaviors (functions, features)
architecture - software must be easy to change, however, when the scope grows, changes become harder and harder.

Software architects are more focused on the structure of the system than on its features and functions”

A cure for internet addiction

28 January 2018

Inspiried by the New York Times “Is the Answer to Phone Addiction a Worse Phone?”, I wrote a chrome extension (source code) to turn some of my favorite websites into grayscale:

screen shot 2018-01-30 at 7 26 40 pm

screen shot 2018-01-28 at 9 11 05 pm

Looks promising! Enjoy new life without internet addiction!

How Google Works

11 January 2018

41yd2u5ital _sx336_bo1 204 203 200_

A book about Google culture, management, and the authors’ success stories.
Find people with passion
- What are trends you missed/ predicted correctly in the 2000s?
- What’s the biggest failure in your career?
- How do you pitch your work to your CEO?
- Page asked “Tell me something interesting I don’t know” when interviewing Jonathan.
Hiring culture
- A committee vote to decide who to hire - mimic from the academic
- A false negative is better than a false positive.
- Huge Bonus for people do great works

NLP IIT POS tagging

31 December 2017

NLP POS-tagging (lecture by Pushpak Bhattacharyya)

screen shot 2017-12-31 at 12 57 24 pm

NLP= Ambiguity Processing
- Lexical Ambiguity: dog (noun vs verb), (animal vs detesable person), contexts.
- Structural Ambiguity
- Semantic Ambiguity
- Pragmatic Ambiguity

Main methodology
- A: extract parts & features
- B: which is in correspondence with A: extract parts and features
- Learn mapping of these features and parts
- Apply to new situations (decoding)

POS tagging

POS Tagging: attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set
A word can have multiple POS tags
New examples break rules, so we need a robust system.
Generative: HMM
- Training: Maximize the likelihood of observations
- Testing: search the best POS tag sequence in the hypothesis space
- generate POS tag sequences and score them
- HMM
- Given the observation sequence, find the possible state sequences- Viterbi
- Given the observation sequence, find its probability- forward/backward algorithm
- Given the observation sequence find the HMM prameters.- Baum-Welch algorithm

NLP IIT information retrieval

31 December 2017

Informational retrieval (lecture by ARNAB BHATTACHARYA)

Retrieval (finding) of information (e.g., documents) that is mostly unstructured (e.g., text) and is relevant to
Tokenization is the process of breaking the text into terms.
- Token normalization finds more documents Increases recall but decreases precision
Stemming or lemmatization refers to stripping the word to its root or lemma:
- e.g. “system”, “systems”, “systematic” Requires morphological analysis and is language specific

Gmail & Outlook Thread Creator

python 7

30 August 2017

This is a demo of email thread generator based from Gmail to Gmail and Outlook. It can

Send N email with different titles.
Create an email thread with N conversations (i.e. grouping N emails together).

Prerequisite:

Check out Gmail API quickstart to authorize API usage and save the private key file client_secret.json in your working directory.

Predictive Modelling & Recommendation system in Email Marketing

25 April 2017

A multithreading example in Python

05 April 2017

This is a simple example demonstrating how to run a Python script with different inputs in parallel and merge the results. Here an application is to get aggregation statistics in different dates through computationally intense queries, and merge across all dates.

from multiprocessing import Process
import pandas as pd 
import os

def get_days(start, num_of_days):
    ''' generate a list of dates starting from the starting date
    to the starting date + num_of_days
    '''
    date_range = pd.date_range(start, periods=num_of_days, freq='1D')
    return map(lambda dt: dt.strftime("%Y-%m-%d"), date_range)

f = lambda x: os.system("python foo.py --date %s" % x)

children = []
for date in get_days(start, end):
    p = Process(target=f, args=(date,))
    p.start()
    children.append(p)

for x in children:
    x.join()
  
# merge results
all_df = (pd.read_csv(filename) for filename in glob.glob("*.csv"))
merge = pd.concat(all_df, ignore_index = True)

Baroque style on St Paul's Cathedral

10 November 2016

This is an application of A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, ran on GPU (GTX-980).

Raw photo was shoot at one of my favorite place - the St Paul’s Cathedral. stpaul

First Baroque painting style is Jan Brueghel the Elder) The Entry of the Animals Into Noah’s Ark. ark

First Place @NC Data Jam!

26 September 2016

Congratulations to our #NCDataJam winners NC Food Inspector!!! pic.twitter.com/DovA13G3IM
— NC DataPalooza (@NCDatapalooza) September 24, 2016

How to install Spark on Mac & Ubunbu

21 September 2016

Install Spark is handy, here a quick guide on Spark installation on Mac and Ubuntu.

Download Spark 2.0 from the official website

Extract the contents:

cat /Users/<yourname>/spark.tgz | tar -xz -C /Users/<yourname>/

Create a soft link

cd /Users/<yourname>/
ln -s spark-* spark

Add shortcuts to your .bash_profile:

export SPARK_HOME=~/Users/<yourname>/spark
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

Source .bash_profile and run

BONUS: To have similar environment like ipython/ ipython notebook, I added thesee alias in my bash_profile:

alias ipyspark='$SPARK_HOME/bin/pyspark --packages com.databricks:spark-csv_2.10:1.4.0'
alias ipynbspark='PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" $SPARK_HOME/bin/pyspark --driver-memory 15g'

Remote control and screen sharing of Ubuntu

13 September 2016

This is a tutorial about how to set up connect a ubuntu server (15.10) from mac (OS X).

Transfer learning - image classification

07 August 2016

Sean Chang 2016

image source: google images, amazon
model: tensorflow, pool_3:0 (imagenet)
cluster algorithm: k-means
links: cosline simility > 0.7

Collaborative filtering

03 August 2016

Interactive data visualization dashboard

25 July 2016 Continue Reading...

Machine Learning US airline delays

29 June 2016

Python tricks

12 June 2016

Being working on python for several years, here are some useful tricks and tools I’d like to share:

ipython notebook extension

Usage: several useful tools on top of ipython notebook

First, install the extension:

git clone https://github.com/ipython-contrib/IPython-notebook-extensions.git
cd IPython-notebook-extensions
python setup.py install

then go to http://localhost:8888/nbextension/ to check which extension you’d like to use: demo1 personally, I like the sketchpad very much - by typing ctrl+B, a scratchpad will pop up, it’s a good place for checking current variables, quick plot or run a few lines of codes without insert a cell then delete it after use. A demo looks like this:

Prelude

front end 3

06 June 2016

pic tag

This blog based on Jekyll is the forth website I built recently, I learned something new for every new attempts. This blog aims to share my thoughts on (but not limited to) technology and provide a place for discussions.

Tiba won the mHealth prize and the Grand Prize

21 September 2015

My team Tiba won both the mHealth prize and the Grand Prize at Triangle Health Innovation Challenge

The Grand Prize goes to Tiba's physical therapy exercise tracker (also the winner of the @validic prize!) #thinc pic.twitter.com/seb5KRJ39c
— THInC (@THINCweekend) September 20, 2015

The @validic mHealth prize goes to Tiba's physical therapy exercise tracker #thinc pic.twitter.com/hFOkgo1lVM
— THInC (@THINCweekend) September 20, 2015

Background and Evolution

Recent Trends (2024 and Beyond)

Key Challenge: Domain Distribution Shift

Recent Advancements in Domain-Adaptive Compression

1. Neural Codebook Adaptation (NCA)

2. Hierarchical Mixture of Experts Compression (HMEC)

3. Contrastive Reconstruction Vector Quantization (CRVQ)

4. Learnable Binary Embedding with Diffusion Models (DIFFBIN)

5. Multi-Resolution Adaptive Compression (MRAC)

Evaluation Methodologies

Task-Specific Metrics

Adaptation Metrics

Benchmark Datasets

Future Directions

Conclusion

References

NLP POS-tagging (lecture by Pushpak Bhattacharyya)

POS tagging

Informational retrieval (lecture by ARNAB BHATTACHARYA)

Sean Chang 2016

image source: google images, amazon model: tensorflow, pool_3:0 (imagenet) cluster algorithm: k-means links: cosline simility > 0.7

ipython notebook extension

image source: google images, amazon
model: tensorflow, pool_3:0 (imagenet)
cluster algorithm: k-means
links: cosline simility > 0.7