This article covers best practices for organizing projects, cleaning up storage, and managing large files in JupyterLab.
(1) Organizing Folders and Files in JupyterLab
A structured folder system improves efficiency and makes projects easier to navigate.
Recommended Folder Structure
/my_project
│── data/ # Raw and processed datasets
│── notebooks/ # Jupyter notebooks (.ipynb)
│── scripts/ # Python scripts (.py)
│── output/ # Exported reports, images, models
│── docs/ # Documentation files
│── env/ # Virtual environment (optional)
└── README.md # Project overview
Move Notebooks into a Dedicated Folder
mv *.ipynb notebooks/
Moves all Jupyter notebooks to the notebooks/
directory.
Rename Files for Better Clarity
mv untitled.ipynb data_analysis.ipynb
Renames a default untitled notebook to something meaningful.
Create Subfolders for Large Datasets
If you work with multiple datasets, categorize them into subfolders:
mkdir data/raw data/processed data/models
This structure separates raw, cleaned, and model-related data.
(2) Checking Disk Usage in JupyterLab
Find Large Files
!du -sh *
Displays the size of all files and folders in the current directory.
Note: "!" - is needed if you run Linux commands inside JupyterLab
List Top 10 Largest Files
!find . -type f -exec du -h {} + | sort -rh | head -n 10
Helps identify space-consuming files.
Check Available Disk Space
!df -h
Shows overall disk usage and free space.
Other useful commands:
# list all files in folder in MB
!ls -lah /home/shared/
# sorted by size
!ls -lahSr /home/shared
# check first lines of file
!head /home/shared/data.csv -n100
# check last lines of file
!tail /home/shared/data.csv -n100
(3) Deleting Unnecessary Files
Delete All .ipynb_checkpoints
Folders
JupyterLab automatically creates checkpoint files, which can take up space. To remove them:
!find . -type d -name ".ipynb_checkpoints" -exec rm -rf {} +
Clear Large Unused Files
Delete files larger than 100MB:
Find all big files by:
# bigger than size
find . -type f -size +1M
# list 5 top files by size
ls -S /path/to/folder | head -5
!find . -type f -size +100M -exec rm -i {} +
This prompts confirmation before deletion.
Clear Unused Output in Notebooks
from IPython.display import clear_output
clear_output()
This prevents large output cells from taking up space in your notebooks.
(4) Managing Large Files Efficiently
Use Compressed Formats for Data Storage
Instead of saving raw .csv
files, store them in compressed formats:
import pandas as pd
df = pd.read_csv("data.csv")
df.to_csv("data.csv.gz", index=False, compression="gzip") # Save compressed
This significantly reduces storage space while keeping data accessible.
Load Large Files in Chunks
If a dataset is too large to fit in memory:
chunk_size = 100000
for chunk in pd.read_csv("data.csv", chunksize=chunk_size):
process(chunk) # Replace with actual processing logic
This avoids memory overload issues.
(5) Managing Temporary & Cache Files
Delete Python Cache (__pycache__
)
I recommend first to check this thread: Python3 project remove pycache folders and .pyc files
find . -type d -name "__pycache__" -exec rm -rf {} +
Removes compiled Python files that are not needed.
(6) Using External Storage for Large Files
Move Large Files to External Storag*
If working with large datasets, move them to an external drive or cloud storage:
mv large_dataset.csv /mnt/external_drive/
Use Cloud Storage (Google Drive, AWS, etc.)
For Google Drive integration in JupyterLab - https://pypi.org/project/gdown/:
pip install gdown
gdown https://drive.google.com/uc?id=FILE_ID
This downloads large files directly into JupyterLab.
For AWS S3 - https://pypi.org/project/boto3/:
pip install boto3
aws s3 cp s3://my-bucket/large_file.csv .
Conclusion
Keeping JupyterLab organized and managing disk space efficiently ensures better performance and easier navigation. By structuring folders, compressing files, clearing caches, and using external storage, you can avoid clutter and improve productivity.