Reliability testing

It's all about the code!
Post Reply
tomiata
contributor
contributor
Posts: 234
Joined: Sat Aug 02, 2014 2:30 am
Location: Texas

Reliability testing

Post by tomiata »

Here is the setup I made to try to reproduce a config corruption problem. See http://rusefi.com/forum/viewtopic.php?f=3&t=660&start=270#p21437
power-cycle-testing.jpg
power-cycle-testing.jpg (58.52 KiB) Viewed 17209 times
I've done 7000+ power cycles with no problems on just a Discovery running board with a similar setup. And I'm working on this sequence next.
As next step of endurance testing can we try adding

set_int 2472 100
writeconfig
set_int 2472 101
writeconfig
set_int 2472 102
writeconfig
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

Very important work, shame on me for not having it already. At some point I should recreate the same exact setup, so please keep posted with pictures and sketches!
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
puff
contributor
contributor
Posts: 2961
Joined: Mon Nov 11, 2013 11:28 am
Location: Moskau

Re: Reliability testing

Post by puff »

but what will you get from it? even if you manage to detect a fault, you won't be able to reproduce it and track it back, will you? what's the current assumption? you get it destroyed when writing down config?

i'd prefer to make some educated guess.

what happens when the board itself is off, but you have some voltage on the adc pins? could it be some sort of parasite power? how is the battery voltage ADC input connected, for instance? o2 sensor? anything else, that's out of frankenso board?
tomiata
contributor
contributor
Posts: 234
Joined: Sat Aug 02, 2014 2:30 am
Location: Texas

Re: Reliability testing

Post by tomiata »

puff wrote:but what will you get from it? even if you manage to detect a fault, you won't be able to reproduce it and track it back, will you? what's the current assumption? you get it destroyed when writing down config?
We can try different write operations and vary the timing of power off to see what causes corruption. If it can reliably reproduce the corruption, we can add fixes to prevent it and verify they fixes are good.
puff
contributor
contributor
Posts: 2961
Joined: Mon Nov 11, 2013 11:28 am
Location: Moskau

Re: Reliability testing

Post by puff »

I think it's too early for such tests. From sucessfull 7,000 tests, it doesn't seem we will see any failure, at least without changing something outside.
Is there a sequence on uart to reset the board? Could it be that some noise on uart was interpreted as a reset command?

I think, we'd better collect as much data of such occasions as possible, including the initial settings, hardware config prior to occasion, connected usb ports, etc. I think that having such statistics in our hand and analysing it we'd be able to achieve much more. But this requires sort of a record book of all manipulations with the board (sometimes I can't describe exactly what I did to the board, in a right sequence). :cry:
tomiata
contributor
contributor
Posts: 234
Joined: Sat Aug 02, 2014 2:30 am
Location: Texas

Re: Reliability testing

Post by tomiata »

What about flash write cycles on the internal STM32 internal flash? It looks like these are only good for about 10,000 writes. Is there any wear leveling implemented for the space for writeconfig?

See the app note:
http://www.st.com/content/ccc/resource/technical/document/application_note/ec/dd/8e/a8/39/49/4f/e5/DM00036065.pdf/files/DM00036065.pdf/jcr:content/translations/en.DM00036065.pdf
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

No leveling, the assumption is that any flash in real life is a human activity so the number is not so huge.

I will add flash region offset at least as compile-time parameter so that for custom test version we can touch the areas we do not normally touch, so that test units stay useful for road use.

I wonder how many cycles on my continues integration board.
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

https://sourceforge.net/p/rusefi/tickets/334/

Reliability issues are too infrequent as is (mostly since tiny user base) so the power off after write sweep test is a good balance between building testing framework and suspecting we have an issue there, just need to reproduce it
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
User avatar
kb1gtt
contributor
contributor
Posts: 3758
Joined: Tue Sep 10, 2013 1:42 am
Location: ME of USA

Re: Reliability testing

Post by kb1gtt »

A less commonly known issue with most flash's is that they generally are not good long term storage. Generally they are only good for around 1 to 5 years. After that you start to suffer from occasional bit flops. Granted that after like 5 years it's probably only 1 bit, but still that can be an real issue. There are chips out there that offer flash with 10+ year life times, but those are generally kind of expensive and would need to be an external chips. The alternative is that you assume that the board will be powered, you track the time sense it was last written to and you occasional re-write the flash. I understand that most OS's hardware abstraction layers do this automatically when they know it's a SD card, flash, etc. They look at the system time and once a year they simply read, then re-write the entire memory.

Any how this is not the issue at hand. I just wanted to make note of this issue, as it does pertain to long term reliability. I remember a circuit cellar article about this which detailed it fairly well. I'll see if I can find it. However right now I don't have much for cite-able sources.
Welcome to the friendlier side of internet crazy :)
puff
contributor
contributor
Posts: 2961
Joined: Mon Nov 11, 2013 11:28 am
Location: Moskau

Re: Reliability testing

Post by puff »

It relates to the physical size of a cell. In case of less dense memory technologies (e.g. previous generations of SSD), this size is larger, so it takes more effort for the charge to drain, hence it retains for a longer time.
Therefore, the chips with relatively small eeprom size can store data for long time. In case of solid state memory, it is advisable to make writes at higher temperature and store devices in cold.
Manufacturers usually provide specifications on how their devices store data.
They become less reliable after a certain number of writes.

Some good discussion in russian
https://geektimes.ru/post/250406/
with the link to the original report http://www.jedec.org/sites/default/files/Alvin_Cox%20%5BCompatibility%20Mode%5D_0.pdf
tomiata
contributor
contributor
Posts: 234
Joined: Sat Aug 02, 2014 2:30 am
Location: Texas

Re: Reliability testing

Post by tomiata »

I've been playing with the test script to drive the power cycle testing.
The raspberry pi I'm using is not very fast and my python is not very efficient, sometimes loses some of the console output. I'll try a faster pi to see if it works better.

The test sequence goes like this:

1. Start with ON-time of 10 seconds
2. Tell arduino to turn on power
3. wait for rusefi output "Running main loop"
4. Look for "MIATA" in the output to check the config is intact.
5. Send command "fl 0" to turn off verbose logging output.
6. Wait until CMD-Delay seconds lapsed from power on
7. Send commands to cause flash write: "set_int 2472 100" and "writeconfig"
8. Wait for arduino to turn off power
9. Adjust ON-time and repeat cycle

I'm just guessing on the timing of when to send flash write commands.
I ran a 100 iteration loop with the CMD-Delay time set to 6 seconds varying the ON time from 10 to 12 seconds and that worked ok.

I adjusted the CMD-Delay time to 8 seconds and I got it to break and lose the configuration.

on startup, console output usually looks like:
>>> line:20:msg,readFromFlash(),
>>> line:40:msg,Got valid configuration from flash!,

Failed case looks like this:
>>> line:20:msg,readFromFlash(),
>>> line:46:msg,Need to reset flash to default due to CRC,

I'll try again to see if I can get it to break consistently.
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

:roll: https://sourceforge.net/p/rusefi/tickets/335/

This issue should be easy to address - we will write two copies each time we burn, we would need at least one on read. Hopefully power loss would only damage one copy of the configuration.
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
tomiata
contributor
contributor
Posts: 234
Joined: Sat Aug 02, 2014 2:30 am
Location: Texas

Re: Reliability testing

Post by tomiata »

russian wrote: This issue should be easy to address - we will write two copies each time we burn, we would need at least one on read. Hopefully power loss would only damage one copy of the configuration.
How about keep an online copy and an offline copy, only update the offline copy, then swap pointers to make offline to be online? Then you always have previous state and extend life of that flash space.
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

tomiata wrote:then swap pointers
both images in flash and pointer in flash as well so it would fail while we are swapping pointers so I think the pointer would not work.

but I can have version number in each image so that I will overwrite the older and always have two copies - current and previous.
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
tomiata
contributor
contributor
Posts: 234
Joined: Sat Aug 02, 2014 2:30 am
Location: Texas

Re: Reliability testing

Post by tomiata »

russian wrote: ..version number in each image so that I will overwrite the older and always have two copies - current and previous.
Yes, sounds good.
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

https://sourceforge.net/p/rusefi/tickets/335/ commited - revision #11504 firmware version 20170214
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
tomiata
contributor
contributor
Posts: 234
Joined: Sat Aug 02, 2014 2:30 am
Location: Texas

Re: Reliability testing

Post by tomiata »

russian wrote:https://sourceforge.net/p/rusefi/tickets/335/ commited - revision #11504 firmware version 20170214
Thanks, this should address the problem of corrupted config data from powering off at a bad time. I'll retest to confirm.

We had an email conversation about testing for corruption due to a low voltage condition or brown out.

One suggestion was to do a similar loop as before while running off a 12v battery that slowly discharges. Loop on boot up, wait a bit, do a write, then reset while 12v supply is dropping. Seems like this would hit one loop where the 5v regulated supply would lose regulation. What would you expect that 5v supply to do at that point?
User avatar
mobyfab
Posts: 139
Joined: Tue Oct 29, 2013 10:09 am
Location: Versailles, France

Re: Reliability testing

Post by mobyfab »

I'm actually surprised you guys didn't take this into account from the beginning, it was bound to happen. (no offense)

I use an spi eeprom with crc and wear leveling: https://github.com/fpoussin/MotoLink/blob/master/code/app/storage.c#L271
Feel free to copy/try it.

Everytime you write to the flash, execution stops (including interrupts) unless you are running from ram. Not a good thing.
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

mobyfab wrote:I'm actually surprised you guys didn't take this into account from the beginning, it was bound to happen. (no offense)
Any other scenarios we might have fixed?

As for internal/external flash that's technically a different subject altogether, for now our workaround is to only write flash while engine is not running.
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

Just commited https://sourceforge.net/p/rusefi/tickets/354/

It has new commands to get and set BOR setting. Turned out that
OB_BOR_OFF ((uint8_t)0x0C) /!< Supply voltage ranges from 1.62 to 2.10 V /
is the default option.

All available options are:

Code: Select all

define OB_BOR_LEVEL3 ((uint8_t)0x00) /!< Supply voltage ranges from 2.70 to 3.60 V /
define OB_BOR_LEVEL2 ((uint8_t)0x04) /!< Supply voltage ranges from 2.40 to 2.70 V /
define OB_BOR_LEVEL1 ((uint8_t)0x08) /!< Supply voltage ranges from 2.10 to 2.40 V /
define OB_BOR_OFF ((uint8_t)0x0C) /!< Supply voltage ranges from 1.62 to 2.10 V /
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
User avatar
kb1gtt
contributor
contributor
Posts: 3758
Joined: Tue Sep 10, 2013 1:42 am
Location: ME of USA

Re: Reliability testing

Post by kb1gtt »

Does it work with OB_BOR_LEVEL1, and if so can you set the default to that?
Welcome to the friendlier side of internet crazy :)
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

kb1gtt wrote:Does it work with OB_BOR_LEVEL1, and if so can you set the default to that?
Why level1 2.1-2.4v and not level3 2.7-3.6?
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
User avatar
kb1gtt
contributor
contributor
Posts: 3758
Joined: Tue Sep 10, 2013 1:42 am
Location: ME of USA

Re: Reliability testing

Post by kb1gtt »

level 3 will trigger turning off the MCU under mild ripple. Level 1 will trigger only when there is risk of corrupted memory. I fear that level 3 would turn the STM off more often than preferred. We probably have a bit of ripple on the power supply, and we don't want false trips, but we do want protected memory. I guess we should check if it seems to work with all of these setting, or if we see neaucence trips with any of the settings.
Welcome to the friendlier side of internet crazy :)
User avatar
mobyfab
Posts: 139
Joined: Tue Oct 29, 2013 10:09 am
Location: Versailles, France

Re: Reliability testing

Post by mobyfab »

russian wrote:
mobyfab wrote:I'm actually surprised you guys didn't take this into account from the beginning, it was bound to happen. (no offense)
Any other scenarios we might have fixed?

As for internal/external flash that's technically a different subject altogether, for now our workaround is to only write flash while engine is not running.
Okay, I though you were doing that when the engine is running for live tuning. It's no big deal then, just a little bit less convenient. (since you have to erase a full page each time)
I think as long as you use wear leveling and CRC you should be fine. BOR is a good idea as well.
STM32 flash endurance is rated for 10k+ cycles at 85 degrees, so probably 50k+ between 10-40 degrees. Not sure about data retention.
(http://hackaday.com/2014/12/04/flash-memory-endurance-testing/)

If you have a spare SPI/I2C you should think about adding an eeprom, it costs almost nothing, and you can write it while the engine is running.
User avatar
kb1gtt
contributor
contributor
Posts: 3758
Joined: Tue Sep 10, 2013 1:42 am
Location: ME of USA

Re: Reliability testing

Post by kb1gtt »

I believe we are using pre-made EEPROM code. I believe it does not have wear leveling. Perhaps we should push ChibiOS or who ever made the EEPROM code to offer wear leveling. I seem to recall the EEPROM code was separate from ChibiOS and it simply took data and put it in the specified memory location. Either ChibiOS could deal with changing the location, or perhaps the EEPROM code could do it. It would be handy if we don't have to deal with the index to find the actual data instead of drilling down to the low level memory addresses.

Also I believe a common feature of wear leveling is to keep a time stamp for the memory, such that once ever year, a maintenance routine will simply copy the EEPROM memory from one location to another. Many people think that EEPROM's are permanent, but they typically start to suffer from bit rot once some where around 1 year to 5 years. So most SD hard drives will need to be powered and will occasionally copy the memory around to ensure data integrity. I believe this kind of memory protection is also currently missing. It would probably be good to put on the long term goals list.
Welcome to the friendlier side of internet crazy :)
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

mobyfab wrote:If you have a spare SPI/I2C you should think about adding an eeprom, it costs almost nothing, and you can write it while the engine is running.
I would like to have both eeprom and RAM on http://rusefi.com/forum/viewtopic.php?f=4&t=1166
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
User avatar
AndreyB
Site Admin
Posts: 14292
Joined: Wed Aug 28, 2013 1:28 am
Location: Jersey City
Github Username: rusefillc
Slack: Andrey B

Re: Reliability testing

Post by AndreyB »

Looks like setting BOR had unplanned consequences - just wasted three hours tracking the problem down :(

BOR_Set disabled for now see https://github.com/rusefi/rusefi/issues/364
Very limited telepathic abilities - please post logs & tunes where appropriate - http://rusefi.com/s/questions

Always looking for C/C++/Java/PHP developers! Please help us see https://rusefi.com/s/howtocontribute
Post Reply