[NAND] Handling of bad blocks and ECC errors #43

stefano-zanotti · 2024-04-12T08:28:02Z

I can't figure out if and how LevelX handles these two error conditions:
A) blocks that go bad (not factory-marked as bad)
B) correctable ECC errors

For bad blocks: when a block must be erased, the driver function LX_NAND_FLASH.lx_nand_flash_driver_block_erase is called.
This function will trigger an erase in the NAND chip, which can fail if the block has gone bad. In this scenario, the block wasn't bad at first, so it is not yet marked as bad, and LevelX still thinks it is good.
The function can return an error; it seems that all the LevelX functions that call it, check its result. These functions are:
lx_nand_flash_block_data_move
lx_nand_flash_driver_block_erase
lx_nand_flash_driver_block_erased_verify
lx_nand_flash_format
lx_nand_flash_metadata_allocate
lx_nand_flash_sector_release
lx_nand_flash_sector_write
Upon error, they all call _lx_nand_flash_system_error, which in turn calls the driver function LX_NAND_FLASH.lx_nand_flash_driver_system_error
However, after this, as the error is reported up the call stack, it seems that one of two things happen:

the whole operation fails because of the error
the error is ignored (eg. when _lx_nand_flash_metadata_write calls _lx_nand_flash_metadata_allocate, here)
When the error is ignored, it also seems that this causes corruption of the LevelX internal data: the operation itself is reported as successful, but then LevelX starts misbehaving (eg. in future operations, it asks the driver to access non-existing block 0xFFFF).

I tried calling _lx_nand_flash_block_status_set from within the driver function lx_nand_flash_driver_system_error, to let LevelX know that the block is bad, but it didn't work; the call seemed to succeed, but LevelX misbehaved anyway.
Also, I don't think I can just mark the block as bad in hardware, as in this case LevelX wouldn't know it.

How can I handle these blocks that go bad? Should I call some LevelX utility, inside the system error function or elsewhere?
Should I just ignore it (just report the error), and LevelX will automatically take care of it during the next operation?

For correctable ECC errors: when a page is read, and there is an ECC error, and this error is corrected, the data can be used normally. This works. However, to prevent the data from accumulating errors, thus making them uncorrectable (or even undetectable) in the future, the corrupted page should be moved to another page, so that the data and ECC code is rewritten, thus restoring it to a 0-errors condition.
LevelX doesn't seem to do that.
The page will eventually be moved elsewhere, thus removing the error, as a consequence of other operations. However, this can be arbitrarily far in the future, especially if the data is mostly read, and rarely written, so there is no guarantee about when the error will disappear, and in the meantime the error might get worse. So, we cannot just let the error be, and wait.
In case of a corrected ECC error, I return the error LX_NAND_ERROR_CORRECTED from these driver functions:
LX_NAND_FLASH.lx_nand_flash_driver_pages_read
LX_NAND_FLASH.lx_nand_flash_driver_pages_copy
The error is then handled by these LevelX functions:
lx_nand_flash_metadata_allocate
lx_nand_flash_open
lx_nand_flash_sector_read
lx_nand_flash_sector_release
lx_nand_flash_sector_write
It is handled by calling lx_nand_flash_driver_system_error, and then continuing with the operation, without a failure.
This is ok, but it seems that LevelX never tries to move the page elsewhere to remove the error.

How can I handle this "repair" of the errored page? Should I call some LevelX utility, inside the system error function or elsewhere?
Should I just ignore it (just report the error)?

Chabrol · 2025-01-20T13:18:58Z

It seems that LX_NAND_ERROR_CORRECTED is not handled consistently. For example lx_nand_flash_sector_read is tolerant to the error, but lx_nand_flash_data_page_copy will return LX_ERROR.
The reaction to erase errors looks inconsistent too. For example _lx_nand_flash_format checks the case and mark the sector as bad, but _lx_nand_flash_sector_write will trigger an error.

I think there should be more error handling in LevelX. I'd say an erase error or program error should result in a new bad block. I'm unsure if handling of errors is even possible at the driver level alone. These low level functions have no contextual information which operation is executed in the layer above. And blindly marking sectors as bad, without further appropriate steps in the layer above will probably lead to inconsistency.

One note about the issue with accumulating bit errors. Actively relocating the data is probably the best way of handling this. But at least the wear leveling is helping a bit, as long as there are write operations performed from time to time. As soon as the wear leveling is moving a block, its correctable errors should disappear.

stefano-zanotti · 2025-01-20T14:49:30Z

You're right, error handling cannot happen at the driver level, since LevelX wouldn't know about that, and would misbehave in some way or other.
Actively moving data is what should happen in case of LX_NAND_ERROR_CORRECTED, but it can't be done outside of LevelX – the driver can't just do that on its own, and the application code has no way to enforce the move of a specific block, as it has no visibility into such details.

Given these problems, but especially due to the generally poor performance of LevelX (the amount of "physical" memory I/O relative to the "logical" amount of data handled by the application), and the unresponsiveness of the developers, I've moved to LittleFS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NAND] Handling of bad blocks and ECC errors #43

[NAND] Handling of bad blocks and ECC errors #43

stefano-zanotti commented Apr 12, 2024

Chabrol commented Jan 20, 2025

stefano-zanotti commented Jan 20, 2025

[NAND] Handling of bad blocks and ECC errors #43

[NAND] Handling of bad blocks and ECC errors #43

Comments

stefano-zanotti commented Apr 12, 2024

Chabrol commented Jan 20, 2025

stefano-zanotti commented Jan 20, 2025