Magika, developed by Google, is an innovative AI-powered file type detection tool leveraging deep learning for unparalleled accuracy. It uses a highly optimized, compact model (only a few MBs) to swiftly identify file types within milliseconds, even on a single CPU. Trained on a massive dataset of ~100 million samples across 200+ content types, Magika achieves an impressive ~99% average accuracy, excelling particularly with textual formats. Currently deployed at scale within Google for security and content policy routing across Gmail, Drive, and Safe Browsing, Magika also integrates with industry platforms like VirusTotal. Available as a command-line tool, Python API, and JavaScript/Go bindings, it offers robust, efficient, and flexible file identification for diverse applications.
The official repository for the book 'Build a Large Language Model (From Scratch)' by Sebastian Raschka. It provides a comprehensive, step-by-step guide to developing, pretraining, and finetuning a GPT-like LLM from the ground up. By using pure PyTorch instead of high-level LLM libraries, it explains the inner workings of components like attention mechanisms and transformer blocks. The project includes end-to-end pipelines for pretraining on unlabeled data and finetuning for instruction-following, designed to be accessible on standard consumer hardware.