Decrease overhead of WorkerThreadPool task processing #72716

myaaaaaaaaa · 2023-02-04T16:11:26Z

Changes threads to accept and process work in batches so that they synchronize less often.

As a bonus, this also hoists the loop for add_template_group_task() into the header file so that the loop body can be inlined.

This way, add_template_group_task() and its derivatives (namely, parallel foreach()) can have optimal performance, so there is no longer any need to do manual batching like in RaycastOcclusionCull::Scenario::_transform_vertices_thread():

godot/modules/raycast/raycast_occlusion_cull.cpp

Lines 339 to 390 in 13f0158

    
           void RaycastOcclusionCull::Scenario::_update_dirty_instance(int p_idx, RID *p_instances) { 
        
           	OccluderInstance *occ_inst = instances.getptr(p_instances[p_idx]); 
        
           	if (!occ_inst) { 
        
           		return; 
        
           	} 
        
           	Occluder *occ = raycast_singleton->occluder_owner.get_or_null(occ_inst->occluder); 
        
           	if (!occ) { 
        
           		return; 
        
           	} 
        
           	int vertices_size = occ->vertices.size(); 
        
           	// Embree requires the last element to be readable by a 16-byte SSE load instruction, so we add padding to be safe. 
        
           	occ_inst->xformed_vertices.resize(vertices_size + 1); 
        
           	const Vector3 *read_ptr = occ->vertices.ptr(); 
        
           	Vector3 *write_ptr = occ_inst->xformed_vertices.ptr(); 
        
           	if (vertices_size > 1024) { 
        
           		TransformThreadData td; 
        
           		td.xform = occ_inst->xform; 
        
           		td.read = read_ptr; 
        
           		td.write = write_ptr; 
        
           		td.vertex_count = vertices_size; 
        
           		td.thread_count = WorkerThreadPool::get_singleton()->get_thread_count(); 
        
           		WorkerThreadPool::GroupID group_task = WorkerThreadPool::get_singleton()->add_template_group_task(this, &Scenario::_transform_vertices_thread, &td, td.thread_count, -1, true, SNAME("RaycastOcclusionCull")); 
        
           		WorkerThreadPool::get_singleton()->wait_for_group_task_completion(group_task); 
        
           	} else { 
        
           		_transform_vertices_range(read_ptr, write_ptr, occ_inst->xform, 0, vertices_size); 
        
           	} 
        
           	occ_inst->indices.resize(occ->indices.size()); 
        
           	memcpy(occ_inst->indices.ptr(), occ->indices.ptr(), occ->indices.size() * sizeof(int32_t)); 
        
           } 
        
           void RaycastOcclusionCull::Scenario::_transform_vertices_thread(uint32_t p_thread, TransformThreadData *p_data) { 
        
           	uint32_t vertex_total = p_data->vertex_count; 
        
           	uint32_t total_threads = p_data->thread_count; 
        
           	uint32_t from = p_thread * vertex_total / total_threads; 
        
           	uint32_t to = (p_thread + 1 == total_threads) ? vertex_total : ((p_thread + 1) * vertex_total / total_threads); 
        
           	_transform_vertices_range(p_data->read, p_data->write, p_data->xform, from, to); 
        
           } 
        
           void RaycastOcclusionCull::Scenario::_transform_vertices_range(const Vector3 *p_read, Vector3 *p_write, const Transform3D &p_xform, int p_from, int p_to) { 
        
           	for (int i = p_from; i < p_to; i++) { 
        
           		p_write[i] = p_xform.xform(p_read[i]); 
        
           	} 
        
           }

RandomShaper

Looks great overall.

core/object/worker_thread_pool.cpp

myaaaaaaaaa requested review from a team as code owners February 4, 2023 16:11

Chaosus added enhancement topic:core labels Feb 6, 2023

Chaosus added this to the 4.x milestone Feb 6, 2023

Chaosus requested review from RandomShaper and reduz February 6, 2023 04:40

myaaaaaaaaa changed the title ~~Decrease granularity of WorkerThreadPool task processing~~ Decrease overhead of WorkerThreadPool task processing Feb 16, 2023

myaaaaaaaaa mentioned this pull request Feb 28, 2023

Implement parallel for_range() for easier multithreading (reverted) #72784

Merged

Calinou added the performance label Mar 1, 2023

myaaaaaaaaa mentioned this pull request May 12, 2023

Fix multiple issues in WorkerThreadPool #76945

Merged

myaaaaaaaaa mentioned this pull request Jun 8, 2023

Convert _scene_cull() to use parallel foreach() #78016

Closed

RandomShaper reviewed Jun 9, 2023

View reviewed changes

core/object/worker_thread_pool.cpp Outdated Show resolved Hide resolved

core/object/worker_thread_pool.cpp Outdated Show resolved Hide resolved

core/object/worker_thread_pool.cpp Show resolved Hide resolved

Decrease overhead of WorkerThreadPool task processing

dcc0131

This was referenced Jul 14, 2023

Convert shader compilation and occlusion buffer to use parallel for_range() #79481

Closed

Convert simple uses of WorkerThreadPool to parallel for_range() #79490

Closed

myaaaaaaaaa mentioned this pull request Jul 27, 2023

Revert "Implement parallel foreach() for easier multithreading" #79953

Merged

myaaaaaaaaa closed this Aug 18, 2023

myaaaaaaaaa deleted the batch-multithreading branch August 18, 2023 19:07

akien-mga added the archived label Aug 18, 2023

YuriSizov removed this from the 4.x milestone Dec 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease overhead of WorkerThreadPool task processing #72716

Decrease overhead of WorkerThreadPool task processing #72716

myaaaaaaaaa commented Feb 4, 2023 •

edited

Loading

RandomShaper left a comment

	void RaycastOcclusionCull::Scenario::_update_dirty_instance(int p_idx, RID *p_instances) {
	OccluderInstance *occ_inst = instances.getptr(p_instances[p_idx]);

	if (!occ_inst) {
	return;
	}

	Occluder *occ = raycast_singleton->occluder_owner.get_or_null(occ_inst->occluder);

	if (!occ) {
	return;
	}

	int vertices_size = occ->vertices.size();

	// Embree requires the last element to be readable by a 16-byte SSE load instruction, so we add padding to be safe.
	occ_inst->xformed_vertices.resize(vertices_size + 1);

	const Vector3 *read_ptr = occ->vertices.ptr();
	Vector3 *write_ptr = occ_inst->xformed_vertices.ptr();

	if (vertices_size > 1024) {
	TransformThreadData td;
	td.xform = occ_inst->xform;
	td.read = read_ptr;
	td.write = write_ptr;
	td.vertex_count = vertices_size;
	td.thread_count = WorkerThreadPool::get_singleton()->get_thread_count();
	WorkerThreadPool::GroupID group_task = WorkerThreadPool::get_singleton()->add_template_group_task(this, &Scenario::_transform_vertices_thread, &td, td.thread_count, -1, true, SNAME("RaycastOcclusionCull"));
	WorkerThreadPool::get_singleton()->wait_for_group_task_completion(group_task);

	} else {
	_transform_vertices_range(read_ptr, write_ptr, occ_inst->xform, 0, vertices_size);
	}

	occ_inst->indices.resize(occ->indices.size());
	memcpy(occ_inst->indices.ptr(), occ->indices.ptr(), occ->indices.size() * sizeof(int32_t));
	}

	void RaycastOcclusionCull::Scenario::_transform_vertices_thread(uint32_t p_thread, TransformThreadData *p_data) {
	uint32_t vertex_total = p_data->vertex_count;
	uint32_t total_threads = p_data->thread_count;
	uint32_t from = p_thread * vertex_total / total_threads;
	uint32_t to = (p_thread + 1 == total_threads) ? vertex_total : ((p_thread + 1) * vertex_total / total_threads);
	_transform_vertices_range(p_data->read, p_data->write, p_data->xform, from, to);
	}

	void RaycastOcclusionCull::Scenario::_transform_vertices_range(const Vector3 p_read, Vector3 p_write, const Transform3D &p_xform, int p_from, int p_to) {
	for (int i = p_from; i < p_to; i++) {
	p_write[i] = p_xform.xform(p_read[i]);
	}
	}

Decrease overhead of WorkerThreadPool task processing #72716

Decrease overhead of WorkerThreadPool task processing #72716

Conversation

myaaaaaaaaa commented Feb 4, 2023 • edited Loading

RandomShaper left a comment

Choose a reason for hiding this comment

myaaaaaaaaa commented Feb 4, 2023 •

edited

Loading