Comprehensive guide to modern data serialization formats for analytics and machine learning: Parquet, Apache Arrow, Lance, Zarr, Avro, and ORC. Learn compression tradeoffs, partitioning strategies, and when to use each format.
| Format | Type | Best For | Compression | Schema Evolution | Random Access |
| Parquet | Columnar | Analytics, data lakes | ✅ (Snappy, Zstd, LZ4) | ✅ (add/drop) | ✅ (row groups) | | Arrow/Feather | Columnar | In-memory, IPC, ML | ✅ (LZ4, Zstd) | Limited | ✅ (record batches) | | Lance | Columnar | ML pipelines, vectors | ✅ (Zstd, LZ4) | ✅ | ✅ (multi-modal) |
Formats de sérialisation de données modernes : Parquet, Apache Arrow (Feather/IPC), Lance (ML-native), Zarr (tableaux fragmentés), Avro et ORC. Couvre la compression, le partitionnement et la sélection du format. Source : legout/data-platform-agent-skills.