Charli3: Incident Report

Charli3
8 min readApr 10, 2024

--

CHARLI3 Oracles is focused on high-quality product assurance. We boast a proud operational uptime and accuracy of 99.99% since our main net product launch on October 14th 2022, 1.5 years ago.

The Charli3 team is dedicated to transparency and continuous improvement. The overall impact to users of the incident described in report was negligible, however we recognize the potential severity of the incident. Our team has been working tirelessly to implement short and long term preventative measures

The Charli3 team takes full responsibility for the accuracy, reliability, and quality of our services. It was a preventable incident and preventative measures have been taken to ensure it does not occur again.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — -

Date and Time of Incident:

April 6th 2024

Incident Severity:

Major — temporary disruption of some services and unreliable data source posted on-chain

Service Affected:

Price Feeds:

SHEN / USD:

  • Delayed price update

NMKR / ADA:

  • Delayed price update

IUSD / USD:

  • Delayed price update

DJED / USD:

  • Delayed price update
  • Limited reliable data sources during fix
  • Unreliable data from specific data source

BOOK / USD:

  • Delayed price update

ADA / C3:

  • Delayed price update

Brief Description of Incident:

Datum Mismatch and Misplacement: On April 6th, 2024, the Blockfrost API began exhibiting erratic behaviour, with a noticeable mismatch and misplacement of datum types within our oracle system. incorrect associations between data hashes and the inline datum led to the retrieval of unexpected and erroneous data, severely impacting the integrity of price feed updates.

Specific Incidents of Data Mismatch: The Blockfrost API was found to be delivering aggstate datum within Oracle NFT UTXOs and vice versa, causing a breakdown in the processing and validation of data critical for the accurate updating of price feeds. This resulted in delayed updates for several CNT focused feeds and, more severely, the transmission of a unreliable data value for the DJED/USD feed.

Impact on DJED/USD Feed: The integrity of the DJED/USD price feed was notably compromised. During an attempt to mitigate the effects of the data mismatch by removing blockfrost from the system, some data sources were innaccessible and therefor allowed unreliable data value from an affected data source (Sundaeswap) that led to incorrect price updates.

Root Cause Analysis, Action Taken, and Timeline

Phase 1: BLOCKFROST ERROR IDENTIFIED ON V3_MAINNET ADA_CHARLI3

Issue: Blockfrost mainnet issues detected on V3_MAINNET_ADA_CHARLI3 (April 2, 16h:03 UTC-5) by the following alert:

Oracle Alerts

DeserializeException(“Unexpected constructor ID for <class ‘charli3_offchain_core.datums.NodeDatum’>. Expect 122, got 123 instead.”)

  1. Node operators logs (April 2, 16h:12 UTC-5) :

One of 5 Node-backend operators encountered a failure while attempting to update its node-utxo value using Blockfrost, within the following minutes, transaction updates from other nodes, associated with V3_MAINNET_ADA_CHARLI3, also began to experience failures.

Phase 2: ELIMINATION OF BLOCKFROST DEPENDENCIES ON V3_MAINNET_ADA_CHARLI3.

Action 1: (April 2 16h:13m UTC-5): We decided to reduce the system’s dependency on Blockfrost to improve robustness from ADA_CHARLI3. Specifically, we removed data sources that depend on Blockfrost, such as Minswap and VyFinance.

Result 1: (At April 3, 2:49 UTC-5): All node servers have undergone an update, eliminating Blockfrost dependent data sources (VyFinance and Minswap), consequently halving the number of data sources for CNT feeds. Following this update, all nodes resumed normal operations, with

Phase 2.5: IMPLEMENTING OGMIOS AND KUPO INSTEAD OF BLOCKFROST

Action 1 (April 2 17h:00 UTC-5): We chose to pursue a data acquisition strategy leveraging Ogmios and Kupo, and faced some compatibility problems.

Action 2 (April 3 12h:34m UTC-5): Compatibility problems identified. We upgraded our off-chain system from Pycardano v.0.9.0 to Pycardano v.0.10.0 to address them.

Action 3 (April 4, 11h:33m UTC-5): The pull request is currently undergoing testing.

Action 4 (April 5, 10h:40m UTC-5): We initiated the upgrade of our internal server by integrating the Kupo component into 2 out of every 5 nodes for each feed.

  • Everything works as expected on Apr 5th 2024 Friday during rollout
  • Another testing phase is scheduled to last 12 hours to ensure that no issues arise on mainnet.

Phase 3: KUPO ROLLOUT ISSUE AND AGGREGATION ISSUE

Event 1: Node failures affect 2 out of 5 nodes;

  • Issue: (April 6th, 2024. UTC -5 ): Following the configuration, two nodes across all servers began to experience failures. Kupo’s utilization of all available disk space on the VM has prevented 2 out of 5 nodes from updating information from the CNT C3 networks. It’s crucial to note that aggregation is carried out by 3 nodes, allowing it to continue without impacting fees.

Action 1: (April 7th, 17:21 UTC -5): Node operators required to increase disk space for each form 300 GB to 500 GB

INCIDENT 1: Regularly scheduled heartbeats were “delayed” due to corrupted data from Blockfrost, causing the error to propagate across all feeds.

  • Issue: First affected feed: V3_MAINNET_NMKR_ADA April 6 at 20:05 (UTC-5). The error arises due to an issue with Blockfrost’s decoding of the UTxO from the oracle-feed associated with V3_MAINNET_ADA_CHARLI3. This results in incorrect value decoding, which subsequently impacts the dynamic pricing mechanism across all other feeds.

Action: (April 7 at 23:50 UTC-5): We have decided to exclude data sources dependent on Blockfrost, such as Minswap and VyFinance, from all feeds. Consequently, some Charli3 networks, like DJED, are now operational but with only 50% of their initial data sources. This decision allows us to gather the V3_MAINNET_ADA_CHARLI3 feed via Ogmios (dynamic price) and generate aggregations for all affected oracle feeds efficiently.

For more details on the error from blockfrost, please review mismatch and misplacement of datum types.

INCIDENT 2: DJED / USD Feed reports (assumingly) bad data value

Issue 1: (April 7, 23:50 UTC-5): Due to unreliable data values from Blockfrost, we’ve removed 2 out of 4 data sources (that rely on the service) from the DJED feed (and other CNT feeds). Consequently, the DJED feed now operates with 3/5 nodes and 2/4 data sources.

  • Removed: Vyfi and Minswap
  • Remaining Data sources: Sundaeswap and Wingriders
  • Team works on integrating additional data sources as soon as possible

Issue 2: (April 7, 2:05 UTC)

  • Sundaeswap sent an unreliable data value (0.005735) over a 20 minute span triggering deviation updates of the feed (April 8 at 10:45 UTC-5)
  • Wingrigers sent 0.98339477 during this period
  • The C3 network worked as expected, however, with only 2 data sources being used at the time, it is important to note that when submitting node information to the C3 Network, a median function is used. In situations where there is an even number of data sources, the off-chain backend randomly selects a value that is closest to the median to post on-chain (node-utxo).
  • Thus a 50/50 operation concluded with the corrupt 0.005 DJED value unfortunately being presented for use.

Phase 5: Identification and Resolution of DJED Feed

Since our systems worked as expected and no alerts were triggered, the DJED values in question only became known the next morning.

  • The team identified the unreliable data value on April 8 at 10:45 UTC-5.
  • As of April 8th 2024 by increasing the number of data sources for all CNT feeds from 2 to 4, implementing temporary sources Coingecko and Coinmarketcap

Preventative Measures

Short term

  • Removed dependency on blockfrost
  • Removed Sundaeswap as a data source from all feeds until more data sources added in
  • Implementation of 2 more data sources on feeds with less than 3 (coingecko & coinmarketcap)

DJED

  • - Sundae
  • - Wingriders
  • - Coingecko
  • - Coinmarketcap

SHEN

  • - Sundae
  • - Wingriders
  • - Coingecko
  • - Coinmarketcap

IUSD

  • - Muesliswap
  • - Wingriders
  • - Coingecko
  • - Coinmarketcap

Started rolling out server upgrade with Kupo and Ogmios integration

  • Testing phase of 12 hours
  • Testing phase is anticipated to start on April 8 at 17:00 (UTC-5) and conclude on April 10 at 03:00 (UTC-5).
  • Production integration
  • Expected upgraded servers April 10at 12:00h (UTC-5)
  • Expected result: re-adding Minswap and VyFinance as data sources, removing coingecko and coinmarketcap as primary sources(keeping as backups) (CG and CMC are old/delayed data with no security inputs, which is a risk to use them for primary triangulation or production level operations)

Long term

  • During our upcoming audit, we will focus limiting dependencies on external parties
  • Blockfrost has been informed of all issues and is working to reproduce and resolve them
  • Review handling of errors on all dependencies, if any remain
  • Review our node software upgrade process to ensure feeds never drop below 3 data sources
  • Public notice of feeds that are on a low number of data sources (likely most CNTs); increase transparency to Cardano community of feeds at higher risk (e.g. less data sources, low robustness).
  • Proposed: enhance alert and monitoring system when data sources drop to 5 or less, different alerts apply; for example, our current alert system catches 0/null values (off-chain), while our median step and aggregation step remove extreme values if, and only if, there are 3 or more nodes (on-chain). The C3 team can implement a system that shuts down feeds below 3 data sources temporarily or implement a circuit breaking device only on extreme differences between 2 data sources. Research required for implementation.
  • Increased team size to handle incidents ensuring we have someone always available on any time-zone (e.g. backup policies for when regularly scheduled engineers are unavailable for emergency circumstances like a death in the family)

Impact

At this time there has been no known impact to our, or others, users. It is known that Liqwid protocol consumes the Charli3 DJED feed, among others (SHEN, iUSD), and their failsafe system on their end prevented any of their users from suffering any damages due to the unreliable data source. The Charli3 team takes this incident seriously and have been working tirelessly non-stop since April 6th 2024 with everyone on the team focused on implementing the preventative measures moving forward.

It is important to note that this is the first major incident in the history of our service, with a single minor delayed price update 12 months prior. Our uptime, measured in scheduled updates that occur on-time with reliable data, stands still at 99.99%+ over 547 days.

--

--