Inference time gradual slow down #1965

vincentw1997 · 2025-01-08T09:14:58Z

Branch

main branch (mmpretrain version)

Describe the bug

Prior Checklist

I already checked for issues, but cannot find a similar issue.
Isolate the problem using Minimal Reproducible Example, as the whole thing is a down stream task (combining other Python libraries) for solving other problem.
Tested on different hardware
Tried to check memory usage, seems stable

Describe the problem
I am using this library to train a ConvNeXt model for classification tasks using my private dataset. The results based on the metric are working fine. However, there is a problem when this library is used in conjunction with other libraries for downstream task. We managed to isolate the problem in a Minimal Reproducible Example using only this library and some standard Python libraries.

To simulate the use case, the Minimal Reproducible Example moves images from a main folder containing all the images (around 10.000 images) into a temporary folder that only has 5 images at one time (simulating multi-camera setup). Then, the 5 images are used as input for the prediction of the classification task. These steps are repeated indefinitely until the main folder is empty. We cannot do it in large batches, because this is the setup that we will have in the downstream task.

The Minimal Reproducible Example code can be seen here:

from mmcv import image
from mmpretrain import ImageClassificationInferencer
import os
import time
import shutil
from pathlib import Path
import time
import logging
import sys
from datetime import datetime
# import tracemalloc
import gc
import signal

def classify_images(image_folder, intermediate_folder, inferencer, batch_size=5):
    """
    Perform inference on batches of images from a main folder.

    Args:
        image_folder (str): Path to the main folder containing all images.
        intermediate_folder (str): Path to the intermediate folder for batch processing.
        inferencer (function): A function that performs inference on a batch of images.
        batch_size (int): Number of images to process at a time.

    Returns:
        None
    """
    # Ensure intermediate folder exists and is empty
    Path(intermediate_folder).mkdir(parents=True, exist_ok=True)
    for file in os.listdir(intermediate_folder):
        os.remove(os.path.join(intermediate_folder, file))
    try:
        while True:
            # Run garbage collection explicitly
            gc.collect()  # Optional: can help in tight memory situations
            uncollectable = len(gc.garbage)
            if uncollectable > 0:
                logging.warning(f"Uncollectable objects detected: {uncollectable}")

            # Get all images from the main folder
            all_images = [f for f in os.listdir(image_folder) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]

            if len(all_images) == 0:
                print("No more images to process. Exiting...")
                break

            # Take the first `batch_size` images from the main folder
            batch = all_images[:batch_size]

            # Move batch of images to the intermediate folder
            for image in batch:
                src_path = os.path.join(image_folder, image)
                dest_path = os.path.join(intermediate_folder, image)
                shutil.move(src_path, dest_path)

            # print(f"Processing batch: {batch}")
            # logging.info(f"This is the processed batch: {batch}")

            # Perform inference on the batch
            batch_images = [os.path.join(intermediate_folder, img) for img in batch]
            results = inferencer(batch_images)
            # print(f"Inference results: {results}")

            # Remove processed images from the intermediate folder
            for image in batch:
                os.remove(os.path.join(intermediate_folder, image))

            print(f"Batch processed and cleared. Waiting for the next batch...")
            # log_memory_snapshot("After Processing Loop")
    except Exception as e:
        logging.error("An error occurred: %s", e)


# Mock inferencer function for demonstration
def mock_inferencer(image_list):
    """Mock inferencer simulating model inference."""
    results_anomaly = None
    current_inference_time = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    logging.info(f"This is the starting time: {current_inference_time}")
    time_start_inference = time.time()
    results = inferencer(image_list)
    # print(f"\n This is the classification result: \n {results}")
    # logging.info(f"This is the classification result: {results}")
    time_end_infernce = time.time()
    time_inference = time_end_infernce - time_start_inference
    print(f"\ntotal time for inference:\n {time_inference}")
    logging.info(f"This is the total classification inference time needed: {time_inference}")

def signal_handler(sig, frame):
    logging.info("Script interrupted by user.")
    sys.exit(0)

# Function to take and log memory snapshots
def log_memory_snapshot(snapshot_label="Snapshot"):
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')  # Group by line number
    logging.info(f"Memory snapshot: {snapshot_label}")
    logging.info("[ Top 10 memory usage ]")
    for stat in top_stats[:10]:  # Log the top 10 memory-using locations
        logging.info(stat)

# Main setup
if __name__ == "__main__":

    path = r'path\to\folder'
    config = path + "path\to\config\file"
    checkpoint = path + "path\to\pretrained\model" # this is the tiny model
    samples_path = path + "/Bilder/Samples_compressed_2" # main folder
    save_folder = path + "/Bilder/offline_minimum_classification" # the temporary folder
    
    log_folder = r'path\to\log\folder'

    # create the log folder if it does not exist
    os.makedirs(log_folder, exist_ok=True)

    # Generate a filename with the current date, hour, minute, and second
    log_filename = os.path.join(log_folder, datetime.now().strftime("output_%Y-%m-%d_%H-%M-%S.log"))

    # init logging
    logging.basicConfig(
        filename=log_filename,  # Log to this file
        level=logging.INFO,      # Set the logging level
        format='%(asctime)s - %(levelname)s - %(message)s'
    )
    
    # Create a stream handler to also print to console
    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setLevel(logging.INFO)
    formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    console_handler.setFormatter(formatter)

    # Add the console handler to the root logger
    logging.getLogger().addHandler(console_handler)
    
    # Register the signal handler for Ctrl+C
    signal.signal(signal.SIGINT, signal_handler)

    # Log start of the script
    logging.info("Logging started.")

    inferencer = ImageClassificationInferencer(model=config, pretrained=checkpoint)

    # # Start tracemalloc before the main program starts
    # tracemalloc.start()

    # Place garbage collection monitoring in the while loop
    gc.enable()  # Enable automatic garbage collection

    # log_memory_snapshot("Before while True Loop")

    classify_images(samples_path, save_folder, mock_inferencer)

The inference was called only using a ImageClassificationInferencer function (from mmpretrain ImageClassificationInferencer). The model is saved as a .pth file and the config file is edited to conform to our private dataset.

All of the inferencing is done only on the CPU, while the training was done on the GPU.

The repetitive inferencing starts fast but gradually slows down. We have some test that shows in the first minute it runs around 0.6 seconds and then in the next 5 minutes, it took 1.0 seconds for the inferencing. We implemented a logging function to save time for the inference and plot the inference time results to see the gradual increase. We also tested on different CPU-only devices (with different specs slower and faster specs) and the gradual increase of the inference time still happens (so most likely not an individual hardware problem and does not solve it if we use a faster CPU as the same pattern happens in faster CPU devices). We tried to track the memory usage using tracemalloc library and the top 10 most used memory seems stable (it also significantly slows down the inference time).

If anyone has a suggestion, document or material that I can refer to it would be very helpful.

Environment

{'sys.platform': 'win32',
'Python': '3.10.14 | packaged by Anaconda, Inc. | (main, May 6 2024, '
'19:44:50) [MSC v.1916 64 bit (AMD64)]',
'CUDA available': False,
'MUSA available': False,
'numpy_random_seed': 2147483648,
'GCC': 'n/a',
'PyTorch': '2.1.2+cpu',
'TorchVision': '0.16.2+cpu',
'OpenCV': '4.10.0',
'MMEngine': '0.10.4',
'MMCV': '2.1.0',
'MMPreTrain': '1.1.1+'}

Other information

Code modifications are mainly for fitting our private image dataset for training. There are no other notable modifications.
Optimized for batch inferencing not for "stop and go" use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference time gradual slow down #1965

Inference time gradual slow down #1965

vincentw1997 commented Jan 8, 2025

Inference time gradual slow down #1965

Inference time gradual slow down #1965

Comments

vincentw1997 commented Jan 8, 2025

Branch

Describe the bug

Environment

Other information