Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added AVX1 support for salsa and chacha rounds #1

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

kangaderoo
Copy link

Code is in C for better maintainabilty. ASM derived from these files
might increase speed slightly.
Current speed increase compared to SSE routines about 10%

Code is in C for better maintainabilty. ASM derived from these files
might increase speed slightly.
Current speed increase compared to SSE routines about 10%
Make the config work with the new files
@ghostlander
Copy link
Owner

Thanks, I plan to add the AVX/XOP assembly code in the future and may use your inline assembly as a reference. SSE2 4-way is also going to be improved.

@kangaderoo
Copy link
Author

I was kind of wondering where your speed increase from the 4-way is
originated.
Guess I still have to rewrite the the KDF compress to inline assembly.
I guess this function is a good candidate to optimize to 4-way, or maybe
8-way, depending on the XMM requirements.

The original CpuMiner had a scrypt 3-way and a SHA256 4-way, resulting
is the best result running a 12-way on AVX1.
Scrypt 3-way contained 3 'matrices' in XMM registers, keeping 4 XMM
register free for calculating functions etc.
It seems that XMM//XMM operations run 3 times faster then XMM//Memory
operations.

Due to the mixing behavior (4 times a 4x4 matrix) of neo-scrypt it looks
like that for salsa and cha-cha 1-way would need the minimum of
memory moves.

Unfortunately my development environment doesn't have AVX2, but the
in-line assembly code could easily be rewritten to
support the 256bits YMM registers.

John Doering schreef op 2/17/2015 om 5:14 PM:

Thanks, I plan to add the AVX/XOP assembly code in the future and may
use your inline assembly as a reference. SSE2 4-way is also going to
be improved.


Reply to this email directly or view it on GitHub
#1 (comment).

Increase hashing speed by running 3 calc in parallel.
Eliminate simd latency by smart sequencing.
~25% speed increase observed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants