VILA is a visual language model (VLM) pre-trained with interleaved image-text data at scale, enabling multi-image VLM. VILA is deployable on the edge.
Low-bit weight-only quantization for LLMs.
We enable LLMs to work on infinite-length texts without compromising efficiency and performance.
LongLoRA takes advantage of shifted sparse attention to greatly reduce the finetuning cost of long context LLMs.
This project introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality.
Running large language models (LLMs) on the edge is of great importance. In this blog, we introduce TinyChat, an efficient and lightweight system for LLM deployment on the edge. It runs Meta's latest LLaMA-2 model at 30 tokens / second on NVIDIA Jetson Orin and can easily support different models and hardware.