An open-source NLP research library, built on PyTorch.
Toolkit for linearizing PDFs for LLM datasets/training