This article covers best practices for organizing projects, cleaning up storage, and managing large files in JupyterLab.

(1) Organizing Folders and Files in JupyterLab

A structured folder system improves efficiency and makes projects easier to navigate.

/my_project
│── data/        # Raw and processed datasets
│── notebooks/   # Jupyter notebooks (.ipynb)
│── scripts/     # Python scripts (.py)
│── output/      # Exported reports, images, models
│── docs/        # Documentation files
│── env/         # Virtual environment (optional)
└── README.md    # Project overview

Move Notebooks into a Dedicated Folder

mv *.ipynb notebooks/

Moves all Jupyter notebooks to the notebooks/ directory.

Rename Files for Better Clarity

mv untitled.ipynb data_analysis.ipynb

Renames a default untitled notebook to something meaningful.

Create Subfolders for Large Datasets

If you work with multiple datasets, categorize them into subfolders:

mkdir data/raw data/processed data/models

This structure separates raw, cleaned, and model-related data.

(2) Checking Disk Usage in JupyterLab

Find Large Files

!du -sh *

Displays the size of all files and folders in the current directory.

Note: "!" - is needed if you run Linux commands inside JupyterLab

List Top 10 Largest Files

!find . -type f -exec du -h {} + | sort -rh | head -n 10

Helps identify space-consuming files.

Check Available Disk Space

!df -h

Shows overall disk usage and free space.

Other useful commands:

# list all files in folder in MB
!ls -lah /home/shared/

# sorted by size
!ls -lahSr /home/shared

# check first lines of file
!head /home/shared/data.csv -n100

# check last lines of file
!tail /home/shared/data.csv -n100 

(3) Deleting Unnecessary Files

Delete All .ipynb_checkpoints Folders

JupyterLab automatically creates checkpoint files, which can take up space. To remove them:

!find . -type d -name ".ipynb_checkpoints" -exec rm -rf {} +

Clear Large Unused Files

Delete files larger than 100MB:

Find all big files by:

# bigger than size
find . -type f -size +1M

# list 5 top files by size
ls -S /path/to/folder | head -5
!find . -type f -size +100M -exec rm -i {} +

This prompts confirmation before deletion.

Clear Unused Output in Notebooks

from IPython.display import clear_output
clear_output()

This prevents large output cells from taking up space in your notebooks.

(4) Managing Large Files Efficiently

Use Compressed Formats for Data Storage

Instead of saving raw .csv files, store them in compressed formats:

import pandas as pd

df = pd.read_csv("data.csv")
df.to_csv("data.csv.gz", index=False, compression="gzip")  # Save compressed

This significantly reduces storage space while keeping data accessible.

Load Large Files in Chunks

If a dataset is too large to fit in memory:

chunk_size = 100000
for chunk in pd.read_csv("data.csv", chunksize=chunk_size):
    process(chunk)  # Replace with actual processing logic

This avoids memory overload issues.

(5) Managing Temporary & Cache Files

Delete Python Cache (__pycache__)

I recommend first to check this thread: Python3 project remove pycache folders and .pyc files

find . -type d -name "__pycache__" -exec rm -rf {} +

Removes compiled Python files that are not needed.

(6) Using External Storage for Large Files

Move Large Files to External Storag*

If working with large datasets, move them to an external drive or cloud storage:

mv large_dataset.csv /mnt/external_drive/

Use Cloud Storage (Google Drive, AWS, etc.)

For Google Drive integration in JupyterLab - https://pypi.org/project/gdown/:

pip install gdown
gdown https://drive.google.com/uc?id=FILE_ID

This downloads large files directly into JupyterLab.

For AWS S3 - https://pypi.org/project/boto3/:

pip install boto3
aws s3 cp s3://my-bucket/large_file.csv .

Conclusion

Keeping JupyterLab organized and managing disk space efficiently ensures better performance and easier navigation. By structuring folders, compressing files, clearing caches, and using external storage, you can avoid clutter and improve productivity.