Git LFS: Managing Large Files in Version Control

Version control systems excel at tracking text files. Developers commit code, review diffs, and merge changes seamlessly. But introduce large binary files—machine learning model weights, video assets, compiled binaries—and Git grinds to a halt. Repositories balloon to gigabytes. Clones take hours. Simple operations timeout. Traditional Git stores every version of every file in the repository history. A 100MB file modified ten times consumes 1GB of repository space. Every developer cloning the repository downloads all versions, even if they only need the latest. The distributed nature that makes Git powerful for code becomes a liability for large files. Git Large File Storage (LFS) addresses this problem by replacing large files with small pointer files in the repository. The actual file content lives on a separate server. Developers download only the versions they need. The repository stays small and fast. This approach sounds ideal, but Git LFS introduces complexity, infrastructure requirements, and new failure modes. Understanding when LFS adds value—and when simpler approaches suffice—determines whether it solves problems or creates them. This article explores the technical challenges of large files in Git, examines how Git LFS works, provides guidance on when to adopt it, and offers alternatives for different scenarios.

The Large File Problem

Git’s architecture creates fundamental issues with large binary files.

How Git Stores Files

Git’s storage model optimizes for text:

📝Git Storage Architecture

Object Storage

Every file version is a blob object
Stored in .git/objects/ directory
Compressed but complete copies
Delta compression for similar files Repository Growth
Each commit adds new blobs
History contains all versions
Clone downloads entire history
No way to fetch partial history Text vs Binary
Text: Delta compression works well
Binary: Compression often ineffective
Small text changes: Small deltas
Small binary changes: Full new copy When you commit a 10KB source file, Git stores it efficiently. Modify one line, and Git stores only the difference. But binary files rarely compress well. A 500MB machine learning model modified slightly still requires storing another 500MB.

Real-World Impact

Large files create concrete problems:

❌Repository Bloat

Scenario: ML Model Training Data science team commits model weights after each training run:

Initial model: 500MB
After 20 training iterations: 20 versions
Repository size: 10GB
Clone time: 45 minutes on fast connection Impact:
New team members wait hours to start
CI/CD pipelines timeout
Git operations become slow
Developers avoid pulling updates Cost:
Lost productivity: 2 hours per developer per week
Infrastructure: Larger storage, more bandwidth
Frustration: “Git is broken”

❌Network Bottlenecks

Scenario: Game Asset Development Game studio tracks 3D models and textures in Git:

100 high-resolution textures: 50MB each
50 3D models: 20MB each
6 months of history
Repository size: 15GB Impact:
Remote developers on slow connections can’t work
Push/pull operations take 30+ minutes
Merge conflicts in binary files unresolvable
Team considers abandoning Git Cost:
Remote work becomes impossible
Collaboration breaks down
Version control benefits lost

❌Storage Costs

Scenario: Video Production Video team commits raw footage for version control:

4K video clips: 1GB per minute
100 clips over project lifetime
Multiple versions per clip
Repository size: 500GB Impact:
GitHub/GitLab storage limits exceeded
Self-hosted servers need expensive storage
Backups become expensive and slow
Repository becomes unmaintainable Cost:
Storage: $500/month for cloud hosting
Backup: $200/month
Developer time: 10 hours/month managing issues
Total: $1,200/month for one repository

Git LFS Architecture

Git LFS replaces large files with pointers while storing actual content separately.

How LFS Works

The core mechanism is pointer substitution:

📝LFS Pointer System

Pointer File version https://git-lfs.github.com/spec/v1 oid sha256

size 133742 What Happens

Developer commits large file
LFS uploads file to LFS server
Git stores small pointer file (130 bytes)
Repository stays small On Checkout
Git checks out pointer file
LFS detects pointer
LFS downloads actual file from server
Replaces pointer with real file Benefits

Repository contains only pointers
Clone downloads only current version
History stays lightweight
Large files stored efficiently The pointer file is tiny—around 130 bytes regardless of actual file size. A 5GB model weight becomes a 130-byte pointer in Git history. The repository stays fast.

LFS Server Architecture

LFS requires additional infrastructure:

🏗️LFS Infrastructure

Components

Git repository: Stores pointers
LFS server: Manages large files
Object storage: Stores actual content
Authentication: Controls access Hosting Options
GitHub: 1GB free, paid plans available
GitLab: 10GB free per repository
Bitbucket: 1GB free, paid plans
Self-hosted: Full control, more complexity Requirements
Separate storage from Git
Network bandwidth for uploads/downloads
Authentication integration
Backup strategy Unlike regular Git, which is fully distributed, LFS introduces a centralized component. The LFS server becomes a critical dependency. If it’s down, developers can’t access large files.

When to Use Git LFS

LFS solves specific problems but isn’t always the right choice.

Good LFS Candidates

LFS works well for certain file types:

✅Binary Assets in Active Development

Game studios and design teams work with binary assets that change frequently during development. A 3D character model might go through dozens of iterations as artists refine proportions, textures, and animations. Design files for marketing materials evolve as stakeholders provide feedback. Audio clips get adjusted for timing and mixing. These files are too large for regular Git—a high-resolution texture might be 50MB, a character model 30MB, a Photoshop composition 100MB. Without LFS, the repository would balloon to gigabytes after a few weeks of development. But these assets need version control. Artists need to roll back changes, compare versions, and collaborate without overwriting each other’s work. LFS solves this perfectly. The repository stays small—under 100MB even with hundreds of assets. Artists commit directly without worrying about repository size. Version history is preserved. When conflicts occur, they’re visible in the Git workflow. The team gets all the benefits of version control without the performance penalty. Example: Game Development

Character models: 50MB each
Texture files: 20MB each
Audio clips: 10MB each
Total: 500 files, 15GB of assets
Repository size with LFS: 80MB

✅Machine Learning Model Checkpoints

Data scientists training models need to track experiments. A model trained with different hyperparameters produces different weights. Comparing these versions requires keeping multiple checkpoints. Without version control, teams resort to manual naming schemes—model_v1.bin, model_v2_final.bin, model_v2_final_actually_final.bin—that quickly become unmaintainable. Model weights typically range from 100MB to 4GB. These files are too large for regular Git but perfect for LFS. The key benefit is linking code to models. When you check out a specific commit, you get both the training code and the model weights it produced. This enables true reproducibility—you can verify that a particular model came from specific code and hyperparameters. LFS works well for models up to about 4GB—the size limit chosen to fit most file systems. Beyond that, specialized tools like DVC or Weights & Biases provide better workflows. But for small to medium models, LFS offers the simplest path to version control. Example: Deep Learning Project

Model checkpoints: 200MB - 4GB each
10 experiments, 5 checkpoints each
Total: 50 files, 10GB
Repository size with LFS: 120MB
Benefit: Code and models stay synchronized

✅Documentation Assets

Technical documentation often includes binary assets—video tutorials, architecture diagrams in proprietary formats, PDF exports. These assets should version alongside the code they document. When code changes, documentation updates. Keeping them in sync prevents the common problem of outdated documentation. Documentation assets change less frequently than code, making them ideal for LFS. A video tutorial might be recorded once and updated quarterly. Architecture diagrams evolve with major releases. The moderate file sizes—typically 10MB to 200MB—and infrequent updates mean LFS storage costs stay low. The alternative is storing documentation separately, but this breaks the connection between code and docs. With LFS, checking out a release tag gives you both the code and the documentation that describes it. Writers can commit directly to the repository. The team maintains a single source of truth. Example: Product Documentation

Video tutorials: 100MB each
Diagram sources (Visio, Sketch): 10MB each
PDF exports: 5MB each
Total: 50 files, 2GB
Repository size with LFS: 90MB

When LFS Is Wrong

Many scenarios don’t benefit from LFS:

⚠️Truly Large Files (> 4GB)

LFS doesn’t support partial file downloads. When you check out a file, you download the entire thing. This makes LFS impractical for files larger than about 4GB—the size limit chosen to fit most file systems including FAT32. A 50GB raw video file takes hours to download on typical connections. A 100GB dataset is simply too large for the LFS workflow. Even if your network can handle it, the LFS server storage costs become prohibitive. Ten versions of a 50GB file consume 500GB of LFS storage. For these files, external storage with references works better. Store the file in S3 or similar object storage. Commit a small metadata file to Git with the storage location and checksum. Download the large file only when needed. This approach supports any file size, enables partial downloads, and costs less at scale. Example: Video Production

4K raw footage: 50GB per file
20 clips over project lifetime
Total: 1TB
LFS cost: Prohibitive
Better: S3 with manifest in Git

⚠️Build Artifacts

Compiled binaries, packaged applications, and other build outputs shouldn’t be in version control at all. These are generated files—outputs of the build process, not source inputs. Version control is for sources. Committing build artifacts creates problems. The repository grows with every build. Developers download artifacts they don’t need. The history fills with noise. When you need a specific build, you can’t tell which source code produced it. Artifact repositories like Artifactory or Nexus solve this properly. They store build outputs with metadata linking them to source commits. You can retrieve any build and trace it back to exact source code. Storage is optimized for binaries. Old artifacts can be automatically cleaned up. This is the right tool for the job. Example: Application Releases

Compiled binary: 200MB
Daily builds: 365 per year
Total: 73GB per year
Wrong: LFS or Git
Right: Artifactory with Git tags

⚠️Frequently Changing Large Files

LFS storage grows with every version. A 1GB file modified daily creates 365GB of LFS storage per year. Database dumps, log files, and cache files that change frequently become expensive to store and provide little value. These files don’t benefit from version control. You rarely need to compare yesterday’s database dump to today’s. Log files are better analyzed with log management tools. Cache files are temporary by nature. Tracking their history wastes storage and provides no benefit. The solution is simple: don’t version these files. Add them to .gitignore. Store them locally or in appropriate systems—databases for data, log aggregators for logs, temporary storage for caches. Version control is for files where history matters. Example: Development Database

Database dump: 2GB
Updated daily during development
30 days: 60GB of LFS storage
Value: Minimal (only need latest)
Better: Local file, regenerate as needed

⚠️No LFS Server Available

LFS requires infrastructure beyond Git. Some corporate networks block LFS endpoints. Some Git hosting providers don’t support LFS. Self-hosting requires maintaining an LFS server and object storage. Without LFS infrastructure, you can’t push or pull large files. The repository becomes unusable for team members who need those files. This infrastructure dependency is a real limitation—unlike regular Git, which is fully distributed, LFS introduces a centralized component that must be available. If LFS infrastructure isn’t available or reliable, use alternative approaches. External storage with references works without special infrastructure. Specialized tools like DVC can use any S3-compatible storage. Sometimes the simplest solution is keeping large files out of version control entirely.

Practical Implementation

Using LFS effectively requires understanding its workflow and limitations.

Setting Up Git LFS

Basic setup is straightforward:

📝LFS Setup Steps

Installation Install LFS: git lfs install Track file types: git lfs track “.psd” git lfs track “.bin” git lfs track “models/*.h5” Commit tracking configuration: git add .gitattributes git commit -m “Configure LFS tracking” What Gets Created .gitattributes file: *.psd filter=lfs diff=lfs merge=lfs -text .bin filter=lfs diff=lfs merge=lfs -text models/.h5 filter=lfs diff=lfs merge=lfs -text Using LFS Add and commit as normal: git add model.bin git commit -m “Add trained model” Push sends to both Git and LFS: git push origin main Clone automatically fetches LFS files: git clone https://github.com/user/repo.git The .gitattributes file tells Git which files to handle with LFS. Once configured, LFS works transparently for most operations.

Common Workflows

Different scenarios require different approaches:

📝LFS Workflows

Selective Checkout Clone without downloading LFS files: GIT_LFS_SKIP_SMUDGE=1 git clone repo.git Download specific files later: git lfs pull —include=“models/production/” Pruning Old Versions Remove old LFS files from local cache: git lfs prune Keep only recent versions: git lfs prune —verify-remote —recent Migrating Existing Files Convert existing files to LFS: git lfs migrate import —include=“.psd” Rewrite history (careful!): git lfs migrate import —include=“*.bin” —everything Checking LFS Status See which files are tracked: git lfs ls-files Check LFS storage usage: git lfs env

Troubleshooting Common Issues

LFS introduces new failure modes:

⚠️Common LFS Problems

“This exceeds GitHub’s file size limit”

Cause: File committed without LFS tracking
Solution: Configure .gitattributes before committing
Prevention: Set up LFS tracking early “Error downloading object”
Cause: LFS server unreachable or file missing
Solution: Check network, verify LFS server status
Workaround: Skip LFS files temporarily “Encountered X file(s) that should have been pointers”
Cause: Files committed before LFS was configured
Solution: Use git lfs migrate to fix history
Prevention: Configure LFS before first commit Slow Clone/Pull
Cause: Downloading many large LFS files
Solution: Use GIT_LFS_SKIP_SMUDGE=1 for selective download
Alternative: Fetch only needed files

Alternatives to Git LFS

Many scenarios have better solutions than LFS.

External Storage with References

For truly large files, store references instead:

💡Reference-Based Approach

Architecture

Store files in S3, GCS, or similar
Commit metadata and references in Git
Download files as needed
Version through object storage Example Structure repo/ ├── models/ │ ├── config.yaml # In Git │ └── download.sh # In Git └── data/ ├── manifest.json # In Git └── fetch_data.py # In Git Benefits
No LFS infrastructure needed
Supports any file size
Flexible storage options
Lower costs at scale
Partial downloads possible This approach works well for datasets, large models, and video files. The repository stays small and fast. Storage costs are lower. Teams have more flexibility.

Specialized Tools

Different domains have purpose-built solutions:

🛠️Domain-Specific Tools

Machine Learning

DVC (Data Version Control): Git-like for data/models
Weights & Biases: Experiment tracking
MLflow: Model registry
Hugging Face: Model hosting Game Development
Perforce: Designed for large binary files
Plastic SCM: Handles large assets well
Unity Collaborate: Built for Unity projects Media Production
Frame.io: Video collaboration
Dropbox: Simple file sync
Resilio Sync: P2P file sync Build Artifacts
Artifactory: Universal artifact repository
Nexus: Maven/npm/Docker registry
Docker Hub: Container images These tools solve specific problems better than general-purpose version control. They understand domain requirements and optimize accordingly.

Conclusion

Git LFS solves real problems for teams working with binary assets that need version control. It keeps repositories fast while preserving history for files that would otherwise bloat Git. But LFS isn’t a universal solution. It requires infrastructure, adds complexity, and has size limitations. For files larger than 4GB, specialized storage with metadata references works better. For build artifacts, dedicated artifact repositories are more appropriate. For massive datasets, tools like DVC provide better workflows. The key is matching the tool to the problem. Use LFS for binary assets in active development—3D models, design files, small ML models. Use external storage for large static files. Use specialized tools for domain-specific needs. Use nothing for files that don’t need version control. Git LFS is powerful when applied correctly. Understanding its strengths and limitations ensures it solves problems rather than creating them.

💡Decision Framework

Use Git LFS when:

Binary files need version history
Files are 10MB - 2GB
Team collaborates on assets
LFS infrastructure available Use external storage when:
Files exceed 4GB
Don’t need detailed history
Partial downloads needed
Very frequent updates Use specialized tools when:
Domain-specific requirements
Advanced features needed
Team already uses them
Better workflow fit Use nothing when:
Files are generated artifacts
Temporary or cache files
Can be recreated easily
No collaboration needed

Comments

Please accept the "Functionality" cookie category to view and post comments.