-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NAND] Handling of bad blocks and ECC errors #43
Comments
It seems that LX_NAND_ERROR_CORRECTED is not handled consistently. For example lx_nand_flash_sector_read is tolerant to the error, but lx_nand_flash_data_page_copy will return LX_ERROR. I think there should be more error handling in LevelX. I'd say an erase error or program error should result in a new bad block. I'm unsure if handling of errors is even possible at the driver level alone. These low level functions have no contextual information which operation is executed in the layer above. And blindly marking sectors as bad, without further appropriate steps in the layer above will probably lead to inconsistency. One note about the issue with accumulating bit errors. Actively relocating the data is probably the best way of handling this. But at least the wear leveling is helping a bit, as long as there are write operations performed from time to time. As soon as the wear leveling is moving a block, its correctable errors should disappear. |
You're right, error handling cannot happen at the driver level, since LevelX wouldn't know about that, and would misbehave in some way or other. Given these problems, but especially due to the generally poor performance of LevelX (the amount of "physical" memory I/O relative to the "logical" amount of data handled by the application), and the unresponsiveness of the developers, I've moved to LittleFS. |
I can't figure out if and how LevelX handles these two error conditions:
A) blocks that go bad (not factory-marked as bad)
B) correctable ECC errors
For bad blocks: when a block must be erased, the driver function LX_NAND_FLASH.lx_nand_flash_driver_block_erase is called.
This function will trigger an erase in the NAND chip, which can fail if the block has gone bad. In this scenario, the block wasn't bad at first, so it is not yet marked as bad, and LevelX still thinks it is good.
The function can return an error; it seems that all the LevelX functions that call it, check its result. These functions are:
lx_nand_flash_block_data_move
lx_nand_flash_driver_block_erase
lx_nand_flash_driver_block_erased_verify
lx_nand_flash_format
lx_nand_flash_metadata_allocate
lx_nand_flash_sector_release
lx_nand_flash_sector_write
Upon error, they all call _lx_nand_flash_system_error, which in turn calls the driver function LX_NAND_FLASH.lx_nand_flash_driver_system_error
However, after this, as the error is reported up the call stack, it seems that one of two things happen:
When the error is ignored, it also seems that this causes corruption of the LevelX internal data: the operation itself is reported as successful, but then LevelX starts misbehaving (eg. in future operations, it asks the driver to access non-existing block 0xFFFF).
I tried calling _lx_nand_flash_block_status_set from within the driver function lx_nand_flash_driver_system_error, to let LevelX know that the block is bad, but it didn't work; the call seemed to succeed, but LevelX misbehaved anyway.
Also, I don't think I can just mark the block as bad in hardware, as in this case LevelX wouldn't know it.
How can I handle these blocks that go bad? Should I call some LevelX utility, inside the system error function or elsewhere?
Should I just ignore it (just report the error), and LevelX will automatically take care of it during the next operation?
For correctable ECC errors: when a page is read, and there is an ECC error, and this error is corrected, the data can be used normally. This works. However, to prevent the data from accumulating errors, thus making them uncorrectable (or even undetectable) in the future, the corrupted page should be moved to another page, so that the data and ECC code is rewritten, thus restoring it to a 0-errors condition.
LevelX doesn't seem to do that.
The page will eventually be moved elsewhere, thus removing the error, as a consequence of other operations. However, this can be arbitrarily far in the future, especially if the data is mostly read, and rarely written, so there is no guarantee about when the error will disappear, and in the meantime the error might get worse. So, we cannot just let the error be, and wait.
In case of a corrected ECC error, I return the error LX_NAND_ERROR_CORRECTED from these driver functions:
LX_NAND_FLASH.lx_nand_flash_driver_pages_read
LX_NAND_FLASH.lx_nand_flash_driver_pages_copy
The error is then handled by these LevelX functions:
lx_nand_flash_metadata_allocate
lx_nand_flash_open
lx_nand_flash_sector_read
lx_nand_flash_sector_release
lx_nand_flash_sector_write
It is handled by calling lx_nand_flash_driver_system_error, and then continuing with the operation, without a failure.
This is ok, but it seems that LevelX never tries to move the page elsewhere to remove the error.
How can I handle this "repair" of the errored page? Should I call some LevelX utility, inside the system error function or elsewhere?
Should I just ignore it (just report the error)?
The text was updated successfully, but these errors were encountered: