You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I already checked for issues, but cannot find a similar issue.
Isolate the problem using Minimal Reproducible Example, as the whole thing is a down stream task (combining other Python libraries) for solving other problem.
Tested on different hardware
Tried to check memory usage, seems stable
Describe the problem
I am using this library to train a ConvNeXt model for classification tasks using my private dataset. The results based on the metric are working fine. However, there is a problem when this library is used in conjunction with other libraries for downstream task. We managed to isolate the problem in a Minimal Reproducible Example using only this library and some standard Python libraries.
To simulate the use case, the Minimal Reproducible Example moves images from a main folder containing all the images (around 10.000 images) into a temporary folder that only has 5 images at one time (simulating multi-camera setup). Then, the 5 images are used as input for the prediction of the classification task. These steps are repeated indefinitely until the main folder is empty. We cannot do it in large batches, because this is the setup that we will have in the downstream task.
The Minimal Reproducible Example code can be seen here:
frommmcvimportimagefrommmpretrainimportImageClassificationInferencerimportosimporttimeimportshutilfrompathlibimportPathimporttimeimportloggingimportsysfromdatetimeimportdatetime# import tracemallocimportgcimportsignaldefclassify_images(image_folder, intermediate_folder, inferencer, batch_size=5):
""" Perform inference on batches of images from a main folder. Args: image_folder (str): Path to the main folder containing all images. intermediate_folder (str): Path to the intermediate folder for batch processing. inferencer (function): A function that performs inference on a batch of images. batch_size (int): Number of images to process at a time. Returns: None """# Ensure intermediate folder exists and is emptyPath(intermediate_folder).mkdir(parents=True, exist_ok=True)
forfileinos.listdir(intermediate_folder):
os.remove(os.path.join(intermediate_folder, file))
try:
whileTrue:
# Run garbage collection explicitlygc.collect() # Optional: can help in tight memory situationsuncollectable=len(gc.garbage)
ifuncollectable>0:
logging.warning(f"Uncollectable objects detected: {uncollectable}")
# Get all images from the main folderall_images= [fforfinos.listdir(image_folder) iff.lower().endswith(('.png', '.jpg', '.jpeg'))]
iflen(all_images) ==0:
print("No more images to process. Exiting...")
break# Take the first `batch_size` images from the main folderbatch=all_images[:batch_size]
# Move batch of images to the intermediate folderforimageinbatch:
src_path=os.path.join(image_folder, image)
dest_path=os.path.join(intermediate_folder, image)
shutil.move(src_path, dest_path)
# print(f"Processing batch: {batch}")# logging.info(f"This is the processed batch: {batch}")# Perform inference on the batchbatch_images= [os.path.join(intermediate_folder, img) forimginbatch]
results=inferencer(batch_images)
# print(f"Inference results: {results}")# Remove processed images from the intermediate folderforimageinbatch:
os.remove(os.path.join(intermediate_folder, image))
print(f"Batch processed and cleared. Waiting for the next batch...")
# log_memory_snapshot("After Processing Loop")exceptExceptionase:
logging.error("An error occurred: %s", e)
# Mock inferencer function for demonstrationdefmock_inferencer(image_list):
"""Mock inferencer simulating model inference."""results_anomaly=Nonecurrent_inference_time=datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
logging.info(f"This is the starting time: {current_inference_time}")
time_start_inference=time.time()
results=inferencer(image_list)
# print(f"\n This is the classification result: \n {results}")# logging.info(f"This is the classification result: {results}")time_end_infernce=time.time()
time_inference=time_end_infernce-time_start_inferenceprint(f"\ntotal time for inference:\n{time_inference}")
logging.info(f"This is the total classification inference time needed: {time_inference}")
defsignal_handler(sig, frame):
logging.info("Script interrupted by user.")
sys.exit(0)
# Function to take and log memory snapshotsdeflog_memory_snapshot(snapshot_label="Snapshot"):
snapshot=tracemalloc.take_snapshot()
top_stats=snapshot.statistics('lineno') # Group by line numberlogging.info(f"Memory snapshot: {snapshot_label}")
logging.info("[ Top 10 memory usage ]")
forstatintop_stats[:10]: # Log the top 10 memory-using locationslogging.info(stat)
# Main setupif__name__=="__main__":
path=r'path\to\folder'config=path+"path\to\config\file"checkpoint=path+"path\to\pretrained\model"# this is the tiny modelsamples_path=path+"/Bilder/Samples_compressed_2"# main foldersave_folder=path+"/Bilder/offline_minimum_classification"# the temporary folderlog_folder=r'path\to\log\folder'# create the log folder if it does not existos.makedirs(log_folder, exist_ok=True)
# Generate a filename with the current date, hour, minute, and secondlog_filename=os.path.join(log_folder, datetime.now().strftime("output_%Y-%m-%d_%H-%M-%S.log"))
# init logginglogging.basicConfig(
filename=log_filename, # Log to this filelevel=logging.INFO, # Set the logging levelformat='%(asctime)s - %(levelname)s - %(message)s'
)
# Create a stream handler to also print to consoleconsole_handler=logging.StreamHandler(sys.stdout)
console_handler.setLevel(logging.INFO)
formatter=logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter)
# Add the console handler to the root loggerlogging.getLogger().addHandler(console_handler)
# Register the signal handler for Ctrl+Csignal.signal(signal.SIGINT, signal_handler)
# Log start of the scriptlogging.info("Logging started.")
inferencer=ImageClassificationInferencer(model=config, pretrained=checkpoint)
# # Start tracemalloc before the main program starts# tracemalloc.start()# Place garbage collection monitoring in the while loopgc.enable() # Enable automatic garbage collection# log_memory_snapshot("Before while True Loop")classify_images(samples_path, save_folder, mock_inferencer)
The inference was called only using a ImageClassificationInferencer function (from mmpretrain ImageClassificationInferencer). The model is saved as a .pth file and the config file is edited to conform to our private dataset.
All of the inferencing is done only on the CPU, while the training was done on the GPU.
The repetitive inferencing starts fast but gradually slows down. We have some test that shows in the first minute it runs around 0.6 seconds and then in the next 5 minutes, it took 1.0 seconds for the inferencing. We implemented a logging function to save time for the inference and plot the inference time results to see the gradual increase. We also tested on different CPU-only devices (with different specs slower and faster specs) and the gradual increase of the inference time still happens (so most likely not an individual hardware problem and does not solve it if we use a faster CPU as the same pattern happens in faster CPU devices). We tried to track the memory usage using tracemalloc library and the top 10 most used memory seems stable (it also significantly slows down the inference time).
If anyone has a suggestion, document or material that I can refer to it would be very helpful.
Branch
main branch (mmpretrain version)
Describe the bug
Prior Checklist
Describe the problem
I am using this library to train a ConvNeXt model for classification tasks using my private dataset. The results based on the metric are working fine. However, there is a problem when this library is used in conjunction with other libraries for downstream task. We managed to isolate the problem in a Minimal Reproducible Example using only this library and some standard Python libraries.
To simulate the use case, the Minimal Reproducible Example moves images from a main folder containing all the images (around 10.000 images) into a temporary folder that only has 5 images at one time (simulating multi-camera setup). Then, the 5 images are used as input for the prediction of the classification task. These steps are repeated indefinitely until the main folder is empty. We cannot do it in large batches, because this is the setup that we will have in the downstream task.
The Minimal Reproducible Example code can be seen here:
The inference was called only using a ImageClassificationInferencer function (from mmpretrain ImageClassificationInferencer). The model is saved as a .pth file and the config file is edited to conform to our private dataset.
All of the inferencing is done only on the CPU, while the training was done on the GPU.
The repetitive inferencing starts fast but gradually slows down. We have some test that shows in the first minute it runs around 0.6 seconds and then in the next 5 minutes, it took 1.0 seconds for the inferencing. We implemented a logging function to save time for the inference and plot the inference time results to see the gradual increase. We also tested on different CPU-only devices (with different specs slower and faster specs) and the gradual increase of the inference time still happens (so most likely not an individual hardware problem and does not solve it if we use a faster CPU as the same pattern happens in faster CPU devices). We tried to track the memory usage using tracemalloc library and the top 10 most used memory seems stable (it also significantly slows down the inference time).
If anyone has a suggestion, document or material that I can refer to it would be very helpful.
Environment
{'sys.platform': 'win32',
'Python': '3.10.14 | packaged by Anaconda, Inc. | (main, May 6 2024, '
'19:44:50) [MSC v.1916 64 bit (AMD64)]',
'CUDA available': False,
'MUSA available': False,
'numpy_random_seed': 2147483648,
'GCC': 'n/a',
'PyTorch': '2.1.2+cpu',
'TorchVision': '0.16.2+cpu',
'OpenCV': '4.10.0',
'MMEngine': '0.10.4',
'MMCV': '2.1.0',
'MMPreTrain': '1.1.1+'}
Other information
The text was updated successfully, but these errors were encountered: