Latest Open Source Projects
chonkie
chonkie
TLDR: Chonkie is a lightweight and fast RAG chunking library with various chunkers. It offers features like minimal default installs and supports multiple tokenizers. It has better size and speed compared to alternatives.
browser-use
browser-use
TLDR: The browser-use repository provides an easy way to connect AI agents with the browser. It offers features like vision and html extraction, multi-tab management, custom actions, and parallelization of agents. It also collects anonymous usage data for improvement.
open-computer-use
open-computer-use
TLDR: A secure cloud Linux computer powered by E2B Desktop Sandbox and controlled by open-source LLMs. Supports various LLMs like Meta Llama and OS-Atlas. Operates via keyboard, mouse and shell commands. Easily add new LLMs adhering to OpenAI API specification.
BetterWhisperX
BetterWhisperX
TLDR: A fork of WhisperX with improvements. Provides fast automatic speech recognition with word-level timestamps and speaker diarization. Includes features like batched inference, accurate timestamps using wav2vec2 alignment, and VAD preprocessing.
computer_use_ootb
computer_use_ootb
TLDR: Computer Use OOTB is an out-of-the-box solution for Desktop GUI Agent, providing both API-based and locally-running models. It supports Windows and macOS, has no Docker requirement, and offers a user-friendly Gradio interface. It has had major updates, including local run functionality, added examples, support for multiple displays, and more. Users need to install prerequisites, clone the repository, install dependencies, and set API keys to start the interface for remote control. It also has advanced settings for the ShowUI model and a roadmap for further improvement.
text-extract-api
text-extract-api
TLDR: A tool for converting images, PDFs, and Office documents to Markdown or JSON with high accuracy. Built with FastAPI, uses Celery for asynchronous tasks and Redis for caching. Supports various OCR strategies and can remove PII. Comes with a CLI tool and has different storage strategies. Also has an online demo and dedicated API clients.
Sana
Sana
TLDR: Sana is a text-to-image framework that can efficiently generate high-resolution images up to 4096×4096 resolution. It features designs like DC-AE, Linear DiT, decoder-only text encoder, and efficient training and sampling. Sana is competitive with giant diffusion models, being smaller and faster while deployable on laptop GPU.
F5-TTS
F5-TTS
TLDR: F5-TTS is a text-to-speech repository that features Diffusion Transformer with ConvNeXt V2 for faster training and inference. It includes various installation methods, inference options such as Gradio App and CLI, and training with a Gradio web interface. It also has an evaluation section and acknowledges multiple works. The code is released under MIT License while pre-trained models are under CC-BY-NC license.
cookiecutter-uv
cookiecutter-uv
TLDR: A modern cookiecutter template for Python projects that use uv for dependency management
Qwen2-VL
Qwen2-VL
TLDR: Qwen2-VL is a vision language model with enhancements such as understanding images and videos of various resolutions and ratios, including support for multilingual texts in images. It offers open-sourced models under different licenses and provides various usage examples and benchmarks. Additionally, it supports quantization methods and has limitations which are areas for further improvement. The repository also provides deployment options and a web UI demo.