AI4REALNET evaluation protocols

Showing 62 of 62 cards

Ability to anticipate

KPI-AS-001

Task 4.3

Human user experience

special evaluation setup

Description:

“The ability to anticipate. Knowing what to expect or being able to anticipate developments further into the future, such as potential disruptions, novel demands or constraints, new opportunities, or changing operating conditions” (Hollnagel, 2015, p. 4). The human operator’s ability to anticipate further into the future can be measured by calculating the ratio of (proactively) prevented deviations to actual deviations. In addition, the extent to which the anticipatory sensemaking process of the human operator is supported by AI-based assistants can be measured using the Rigor-Metric for Sensemaking (Zelik et al., 2018) or similar. The instrument needs to be further developed and adapted to the AI context.

Objective Description:

This KPI contributes to evaluating Human user experience of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1-5)

Unit:

Lickert-Scale or similar

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Acceptance

KPI-AS-002

Task 4.3

AI acceptability, trust and trustworthiness

special evaluation setup

Description:

Acceptance of the system by a human user.

Objective Description:

This KPI contributes to evaluating AI acceptability, trust and trustworthiness of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O2 main project objective.

Formula:

Using a TAM model (technology acceptance model) or similar eg. the AI-Acceptance model (KIAM) (Scheuer, 2020)

Unit:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1-5)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Human intervention frequency

KPI-HS-003

Task 4.3

Social-technical decision quality

semi-automated evaluation

Description:

The Human Intervention Frequency KPI measures the proportion of instances in which a human operator intervenes in an automated decision-making process. While this KPI was initially developed for railway traffic control scenarios, it has been generalized to assess the reliability and autonomy of any AI-assisted system. It reflects the trust placed in the AI by quantifying how often human corrections are required during routine operations.

Objective Description:

This KPI contributes to evaluating Social-technical decision quality of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective: - To evaluate the effectiveness of the AI system in operating autonomously. - To provide a performance benchmark for minimizing human interventions across various domains. - To identify areas where the AI may require additional refinement or support.

Formula:

(Number of human interventions / Total AI decision instances) x 100

Unit:

Percentage (%) of AI decisions requiring human intervention.

Modules:

Recommendation module

Domains:

Railway

Power Grid

ATM

Agreement score

KPI-AS-005

Task 4.3

AI acceptability, trust and trustworthiness

special evaluation setup

Description:

This KPI represents human operators’ self-reported agreement with individual AI-generated solutions/decisions on a scale of 0–100.

Objective Description:

This KPI contributes to evaluating AI acceptability, trust and trustworthiness of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O2 main project objective. It is also relevant to protocols and concepts defined in D1.1 such as “Agreement score”.

Formula:

Self-reported agreement with specific solutions on a scale of 0–100.

Unit:

Interval scale response

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

AI co-learning capability

KPI-AS-006

Task 4.3

AI-human learning curves

special evaluation setup

Description:

This KPI represents human operators’ self-reported assessment of the AI ability to adapt to the operators’ preferences measured with a questionnaire.

Objective Description:

This KPI contributes to evaluating AI-human learning curves of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective. It is also relevant to protocols and concepts defined in D1.1 such as “AI co-learning capability”.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Assistant alert accuracy

KPI-AF-008

Task 4.1

Effectiveness

fully automated evaluation

Description:

Assistant alert accuracy is based on the number of times the AI assistant agent is right about forecasted issues ahead of time. Even if forecasted issues concern all events that lead to a grid state out of acceptable limits (set by operation policy), use cases of the project focus on managing overloads only: this KPI therefore only focuses on alerts related to line overloads. The calculation of KPI relies on simulation of 2 parallel paths (starting from the moment the alert is raised): - Simulation of the “do nothing” path, to assess the truth values - Application of remedial actions to the “do nothing” path, to assess solved cases To calculate the KPI, all interventions by an agent or operator are fixed to a specific plan since every alert is related to a specific plan (e.g. remedial actions). Note: line contingencies for which alerts can be raised are the lines that can be attacked in the environment (env.alertable_line_ids in grid2Op), so this should be properly configured beforehand.

Objective Description:

This KPI contributes to evaluating Effectiveness of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective.

Formula:

The formula to compute the KPI is the confusion matrix (see Calculation Methodology): - TP, True positive cases, forecast alerts were raised by the AI assistant, and overloads did occur on the transmission grid) - FP, False positive cases, forecast alerts were raised by the AI assistant, but no overload occurred on the transmission grid - TN, True negative cases, the AI assistant raised no forecast alert, and no overload occurred on the transmission grid - FN, False negative cases, the AI assistant raised no forecast alert, but overloads occurred on the transmission grid - Starting from True positive cases, TPS, the True positive cases solved, represent the alert effectively solved by the recommendations. The KPI can be computed per episode, across several episodes of one scenario, or even across scenarios.

Unit:

None (counting)

Modules:

Human machine interaction module

Digital environment

Domains:

Power Grid

Assistant disturbance

KPI-AS-009

Task 4.3

Human user experience

special evaluation setup

Description:

Assistant disturbance KPI aims to measure if the AI assistant's notifications are disturbing the human operator's activity.

Objective Description:

This KPI assesses whether the inputs of the operators are according to their real psychophysiology. This can act as a verification methodology but also support the AI to adapt. This KPI contributes to evaluating Human-user experience of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective.

Formula:

For each notification, the score ranges in [0, 5] with: - 0 meaning that the notification was not considered disturbing at all by the human operator - 5 means that the human operator considered the notification as fully disturbing This KPI is still under analysis on how it will be implemented. If with a single manual questionnaire or with a pop-up in the dashboard.

Unit:

None (score)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Carbon intensity

KPI-CF-012

Task 4.1

Solution quality

fully automated evaluation

Description:

Carbon intensity selectivity estimates the overall carbon intensity of the action recommendation provided by the AI assistant to the human operator: goal of carbon intensity KPI is to measure how much the actions will directly contribute to greenhouse gases emission, by focusing on CO2 (which is unfortunately not the only greenhouse gas). which is calculated as the weighted averaged emission factor of generation variation, including: Redispatching actions, Curtailment actions.

Objective Description:

This KPI is calculated to estimate the relative performance compared to a baseline. The main difficulty of evaluating and assessing this KPIs lies in the difficulty to establish a proper deadline: - There is no history of human actions on the digital environments used for evaluation (since they are synthetic ones), - Comparison with KPI calculated on real grid’s operations (TenneT or RTE) is not relevant since each grid has its own generation mix, and each TSO has its own operation policies (and own redispatching offers). This KPI contributes to evaluating Solution quality of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective.

Formula:

See calculation steps

Unit:

kgCO2/MWh

Modules:

Digital environment

Domains:

Power Grid

Comprehensibility

KPI-CS-013

Task 4.3

AI acceptability, trust and trustworthiness

special evaluation setup

Description:

This KPI represents human operators’ self-reported ability to understand and thus make use of the AI-generated solution/decision, measured with a questionnaire.

Objective Description:

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Decision support satisfaction

KPI-DS-015

Task 4.3

Human user experience

special evaluation setup

Description:

This KPI represents human operators’ self-reported satisfaction with the system’s support for their decision-making process when working with the AI assistant measured with a questionnaire.

Objective Description:

This KPI contributes to evaluating Human user experience of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective. It is also relevant to protocols and concepts defined in D1.1 such as “Decision support for the human operator”, “Decision support satisfaction”.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Delay reduction efficiency

KPI-DF-016

Task 4.1

Effectiveness

fully automated evaluation

Description:

The Delay Reduction Efficiency KPI quantifies the effectiveness of the AI-driven re-scheduling system in reducing overall train delays. By comparing delays before and after AI intervention, this metric provides insight into the system's capability to optimize train schedules and minimize disruptions.

Objective Description:

This KPI contributes to evaluating Effectiveness of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective: - To assess the impact of AI-based re-scheduling on reducing delays in railway operations. - To ensure that AI interventions lead to measurable improvements in punctuality. - To provide a performance benchmark for AI-driven traffic management solutions in railway networks.

Formula:

(Total delay duration before AI implementation - Total delay duration after AI implementation) / Total delay duration before AI implementation.

Unit:

Percentage (%) reduction in total delay time.

Modules:

Digital environment

Domains:

Railway

Human control/autonomy over the process

KPI-HS-018

Task 4.3

AI-human task allocation balance

special evaluation setup

Description:

This KPI represents human operators’ perceived autonomy over the process when working with the AI assistant measured with a questionnaire.

Objective Description:

This KPI contributes to evaluating AI-human task allocation balance of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective. It is also relevant to protocols and concepts defined in D1.1 such as “Decision support for the human operator”, “Human Agency and Oversight”, “Human control/autonomy over the process”.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Human learning

KPI-HS-021

Task 4.3

AI-human learning curves

special evaluation setup

Description:

Human learning is a complex process that leads to lasting changes in humans, influencing their perceptions of the world and their interactions with it across physical, psychological, and social dimensions. It is fundamentally shaped by the ongoing, interactive relationship between the learner's characteristics and the learning content, all situated within the specific environmental context of time and place and the continuity over time.

Objective Description:

This KPI contributes to evaluating AI-human learning curves of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1-5)

Unit:

Lickert-Scale or similar

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Human motivation

KPI-HS-022

Task 4.3

Human user experience

special evaluation setup

Description:

“Intrinsic motivation is defined as doing an activity for its inherent satisfaction rather than for some separable consequence. When intrinsically motivated, a person is moved to act for the fun or challenge entailed rather than because of external products, pressures, or rewards” (Ryan & Deci, 2000, p. 56).

Objective Description:

This KPI contributes to evaluating Human-user experience of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1-5)

Unit:

Lickert-Scale or similar

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Human response time

KPI-HS-023

Task 4.3

Human user experience

semi-automated evaluation

Description:

Human response time KPI evaluates time needed to react to AI advisory/information.

Objective Description:

Formula:

The time should be measured directly from user input and automatically by the system in background (dismiss a window when they feel satisfied after evaluating a scenario): - LOW less than 5 min, - MEDIUM 5-10 min, - HIGH more than 15 minutes. Then it is translated into % across the operator's multiple interactions with AI-generated solutions.  This KPI is still under analysis on how it will be implemented. The objective is that it is transversal to all domains, but this means that an implementation will need to be done in each virtual environment. This implementation is still not defined and will need to be discussed with other Tasks/WPs

Unit:

LOW, MED, HIGH response time %

Modules:

Human machine interaction module

Domains:

Railway

Power Grid

ATM

Network utilization

KPI-NF-024

Task 4.1

Effectiveness

fully automated evaluation

Description:

Network utilization KPI is based on the relative line loads of the network, indicating to what extent the network and its components are utilized.

Objective Description:

This KPI contributes to evaluating Effectiveness of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective.

Formula:

This KPI yield a vector with 6 values, that are calculated over all scenarios’ steps: - the maximum line’s load in N state, - the maximum line’s load in N-1 state, - the average of the maximum line’s load in N state per step, - the average of the maximum line’s load in N-1 state per step, - the share of lines where the line’s load in N state is greater than 90%, - the share of lines where the line’s load in N-1 state is greater than 100%. Line’s load is referred to as rho in Grid2Op and is defined as the observed current flow divided by the thermal limit of the line.

Unit:

Vector of 6 values expressed in percent (decimal number between 0% and 100%)

Modules:

Digital environment

Domains:

Power Grid

Punctuality

KPI-PF-026

Task 4.1

Effectiveness

fully automated evaluation

Description:

Punctuality measures the percentage of trains arriving at their destinations on time (the train doesn’t arrive after planned arrival) and the train didn’t depart before planned departure time. The goal is to maintain a high level of reliability and minimize delays for passengers and freight services.

Objective Description:

This KPI contributes to evaluating Effectiveness of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective: - Improve customer satisfaction by ensuring timely arrivals - Guarantee maximal planned connection - Minimize operational disruptions caused by delays - Meet regulatory and stakeholder benchmarks for punctuality This KPI is linked with project’s Long Term Expected Impacts (LTEI) (LTEI1)KPIS-3: - 10% increase in punctuality in long-range traffic - 5% increase in punctuality in regional traffic (with realistic disturbances)

Formula:

(Number of on-time arrivals / Total number of arrivals) x 100

Unit:

Percentage (%)

Modules:

Digital environment

Domains:

Railway

Reduction in delay

KPI-RF-027

Task 4.1

Effectiveness

fully automated evaluation

Description:

The reduction in delay KPI aims to quantify the time gained overall and for each airplane, with the introduction of AI.

Objective Description:

This KPI aims to quantify the efficiency gains of AI integration by measuring how AI impacts execution time and delays. Specifically, it helps determine whether AI: - Reduces execution time deviations - Minimizes delays - Enhances consistency and reliability in operations. By evaluating these metrics, we can assess the AI’s effectiveness in improving human decision-making, reducing intervention time, and optimizing operational workflows. This KPI contributes to evaluating Effectiveness of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective. This KPI is linked with project’s Long Term Expected Impact (LTEI) (LTEI1)KPIS-4, 3-6% improvement in flight capacity and mile extension.

Formula:

Performance Deviation measures the percentage deviation of actual time from expected time Delay Measurement measures the absolute delay in arrival time These formulas will be applied to both human-only performance and human-AI collaborative performance, resulting in Human performance and Human-AI performance

Unit:

Percentage and seconds

Modules:

Digital environment

Domains:

ATM

AI Response time

KPI-AF-029

Task 4.1

Effectiveness

fully automated evaluation

Description:

The Response Time KPI measures the time taken by the AI-assisted railway re-scheduling system to generate a new operational schedule in response to a disruption. This metric evaluates how quickly the system reacts to unexpected events, ensuring minimal delays and maintaining operational efficiency.

Objective Description:

This KPI contributes to evaluating Effectiveness of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective: - To assess the speed of AI-assisted decision-making in railway operations. - To ensure rapid re-scheduling of trains in response to disturbances, minimizing the impact on passengers and freight. - To compare AI-assisted response times with traditional manual re-scheduling approaches.

Formula:

Average time taken from disruption detection/prediction to suggestion of adjusted schedule(s)

Unit:

Time (minutes or seconds)

Modules:

Digital environment

Domains:

Railway

Significance of human revisions

KPI-SS-030

Task 4.3

Social-technical decision quality

semi-automated evaluation

Description:

This KPI represents human operators’ subjective assessment of necessary revisions for the AI-generated solutions by the human operator, self-reported by the operator with Likert-scale questions.

Objective Description:

This KPI contributes to evaluating Social-technical decision quality of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective. It is also relevant to protocols and concepts defined in D1.1 such as “Significance of human revisions”.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

Human machine interaction module

Domains:

Railway

Power Grid

ATM

Situation awareness

KPI-SS-031

Task 4.3

Human user experience

special evaluation setup

Description:

“Situation Awareness is the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future” (Endsley, 1988).

Objective Description:

This KPI contributes to evaluating Human-user experience of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1-5)

Unit:

Lickert-Scale or similar

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

System efficiency

KPI-SS-032

Task 4.1

Effectiveness

semi-automated evaluation

Description:

System efficiency measures the efficiency of the system in delivering trustworthy solutions requiring less effort and time to deliver an appropriate response by the operator.

Objective Description:

The System efficiency KPI aims to evaluate the effectiveness of the AI solution in real operational conditions, considering not just its raw response time but also the quality and usability of its assistance. This includes how the AI presents its advice, its ease of use, the accuracy of its recommendations, and how well it integrates with existing data and workflows. The evaluation will measure the AI-human collaboration, focusing on: - Response efficiency: The time taken for the AI to generate advice and for the human operator to act on it. - Advice clarity & usability: How well structured, coherent, and understandable the AI’s suggestions are. - Data integration quality: How seamlessly the AI incorporates relevant information into its recommendations. - Human correction factor: Whether and how often the operator needs to correct or refine the AI’s output. - Decision-making speed: The overall reduction in response time achieved through AI-assisted operation. By considering these factors, the tests aim to assess how well the AI minimizes human intervention while maximizing efficiency, accuracy, and usability in decision-making. This KPI contributes to evaluating Effectiveness of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective.

Formula:

Number of tests where time it takes the human to compute a solution is greater than time it takes for the AI to compute a solution and the human operator to accept the solution

Unit:

Percentage (%)

Modules:

Human machine interaction module

Digital environment

Domains:

ATM

Topological action complexity

KPI-TF-034

Task 4.1

Solution quality

fully automated evaluation

Description:

Topological action complexity KPI quantifies the topological utilization of the grid and gives insights into how many topological actions are utilized: performing too complex or too many topology actions can indeed navigate the grid into topologies that are either unknown or hard to recover from for operators.

Objective Description:

This KPI contributes to evaluating Solution quality of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective.

Formula:

This KPI yields a vector with 6 values, that are calculated over all scenarios’ steps: - The minimum, maximum and average number of topological actions performed by the AI assistant per timestamp, - The minimum, maximum and average share of modified buses per timestamp.

Unit:

Vector of 6 values expressed as: - Number (first 3 values), - Percent (decimal number between 0% and 100%, last 3 values).

Modules:

Digital environment

Domains:

Power Grid

Total decision time

KPI-TS-035

Task 4.1

Effectiveness

semi-automated evaluation

Description:

It is based on the overall time needed to decide, thus including the respective time taken by the AI assistant and human operator. This KPI can be detailed to specifically distinguish the time needed by the AI assistant to provide a recommendation. An assumption is that a Human Machine Interaction (HMI) module is available.

Objective Description:

This KPI addresses the following objectives: 1. Given an alert, how much time is left until the problem occurs? The longer the better since it gives more time to make a decision. 2. Given an alert, how much time does the AI assistant take to come up with its recommendations to mitigate the issue? The shorter the better. 3. Given the recommendations by the AI assistant, how much time does the human operator take to make a final decision? The shorter the better since it indicates that the recommendations are clear and convincing for the human operator. In case there is no interaction possible between the AI assistant and the human operator, this overall split is not possible. Then there is only one overall time needed to decide, starting from the alert and ending with the final decision by the human operator. This KPI contributes to evaluating Effectiveness of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective.

Formula:

See KPI calculation methodology

Unit:

Time (minutes or seconds)

Modules:

Human machine interaction module

Digital environment

Domains:

Power Grid

Operation score

KPI-OF-036

Task 4.1

Solution quality

fully automated evaluation

Description:

The operation score KPI for operating a power grid includes the cost of a blackout, the cost of energy losses on the grid, and the cost of remedial actions.

Objective Description:

This KPI contributes to evaluating Solution quality of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective. This KPI is linked with project’s Long Term Expected Impacts (LTEI): - (LTEI1)KPIS-1, 15%-20% reduction in renewable energy curtailment due to optimal exploration of network flexibility with AI (see “Sum of curtailed RES energy volumes”) - (LTEI1)KPIS-2, 20%-30% avoided electricity demand shedding (see “Sum of remaining energy to be supplied in case of blackout”)

Formula:

This KPI yields a vector with 8 values per episode: - Number topological actions performed by the AI assistant, - Number of redispatching actions (including but not limited to storage) performed by the AI assistant, - Sum of redispatched energy volumes, - Sum of balanced energy volumes, Note: this element is influenced by the actions implemented in the environment to compensate imbalances between loads and generations - Number of RES curtailment actions performed by the AI assistant, Such actions correspond to cases where the agent decreases generation from renewable energy sources (from what would be possible given the current weather) - Sum of curtailed RES energy volumes, - Sum of energy losses (estimated as difference between active power values of generations and loads), - Sum of remaining energy to be supplied in case of blackout.

Unit:

Vector of 8 values expressed as: - Number, - Number, - Energy in MWh, - Energy in MWh, - Number, - Energy in MWh, - Energy in MWh, - Energy in MWh. These values are expressed as raw values and will be possibly normalized during the evaluation to get fixed range values.

Modules:

Digital environment

Domains:

Power Grid

Trust in AI solutions score

KPI-TS-038

Task 4.3

AI acceptability, trust and trustworthiness

special evaluation setup

Description:

This KPI represents human operators’ self-reported trust (attitude) for individual AI-generated solutions measured with a questionnaire.

Objective Description:

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Trust towards the AI tool

KPI-TS-039

Task 4.3

AI acceptability, trust and trustworthiness

special evaluation setup

Description:

(Dis)trust is defined here as a sentiment resulting from knowledge, beliefs, emotions, and other elements derived from lived or transmitted experience, which generates positive or negative expectations concerning the reactions of a system and the interaction with it (whether it is a question of another human being, an organization or a technology)” (Cahour & Forzy, 2009, p. 1261).

Objective Description:

This KPI contributes to evaluating AI acceptability, trust and trustworthiness of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O2 main project objective.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1-5)

Unit:

Lickert-Scale or similar

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Workload

KPI-WS-040

Task 4.3

Human user experience

special evaluation setup

Description:

Workload KPI is based on the workload assessment of human operators of the AI assistant. After each testing session using the system, the workload of human operators due to the AI assistant will be evaluated to understand in which scenarios (and depending on the AI level of support) it contributes for a higher workload.

Objective Description:

This KPI assesses whether the inputs of the operators are according to their real psychophysiology. This can act as a verification methodology but also support the AI to adapt. This KPI will be analyzed together with the “Impact on workload” KPI-IS-041. This KPI contributes to evaluating Human-user experience of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective.

Formula:

It shall be determined according to the NASA-TLX methodology or similar. This KPI is still under analysis on how it will be implemented. If with a single manual questionnaire or with a pop-up in the dashboard.

Unit:

None (rating scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Impact on workload

KPI-IS-041

Task 4.3

AI-human task allocation balance

special evaluation setup

Description:

Impact on the workload KPI assesses operators’ perception of the system impact on their workload (either positive or negative) 

Objective Description:

This KPI compares if the inputs of the operators are according to their real psychophysiology. This can act as a verification methodology but also support the AI to adapt. This KPI will be analyzed together with the “Workload” KPI-WS-040. This KPI contributes to evaluating AI-human task allocation balance of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective.

Formula:

It is measured directly from user input using a 7-point Likert scale: - From 1 (Huge Increase in workload) - To 7 (Huge decrease of workload). This KPI is still under analysis on how it will be implemented. If with a single manual questionnaire or with a pop-up in the dashboard.

Unit:

Value between 1 and 7

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Network Impact Propagation

KPI-NF-045

Task 4.1

Solution quality

fully automated evaluation

Description:

The Network Impact Propagation KPI measures how disruptions in one part of the railway network affect the overall system, including delay propagation and congestion spillover. This KPI helps evaluate the cascading effects of local disturbances and the efficiency of AI-assisted re-scheduling in mitigating these effects.

Objective Description:

This KPI contributes to evaluating Solution quality of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective. - To assess the ripple effects of disruptions across the railway network. - To quantify how effectively AI-assisted re-scheduling contains and mitigates propagation of delays. - To support decision-making in optimizing re-scheduling strategies for network-wide efficiency.

Formula:

Number of trains affected (or Affected Network Nodes) divided by Total number of trains (or Total Network Nodes)

Unit:

Percentage (%)

Modules:

Digital environment

Domains:

Railway

Cognitive Performance & Stress

KPI-CS-049

Task 4.3

Human user experience

semi-automated evaluation

Description:

Cognitive Performance & Stress KPI performs an implicit assessment of the human cognitive performance status and stress levels along the different task that will be performed. The output provides information about the operator mental status and aims to be used to integrate the AI system to contribute as a reward to better adapt decision system.

Objective Description:

The computation of the metrics will be made on the Human Assessment Module and will be integrated in the system that will Tune the autonomy Level of the system. Taking this into account, the objective is to be able to tune the system autonomy level based on the implicit assessment in real time. For example, higher traffic or hard situations/decisions will be detected with any interference with the human operator, implicitly providing information to be used by the decision system. This KPI will not focus on the final results when this module is integrated, but in the calculation of personalized cognitive and stress metrics of a single human based on an individual assessment protocol. If we are not able to perform such protocol, then this module will be generic and not personalized, removing this KPIs. In the personalization we aim to achieve a 20-30% improvement on performance of the model based for a single individual data, enabling a high level of personalization. This KPI contributes to evaluating Human-user experience of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective.

Formula:

Performance of the model to measure cognitive status and stress of a single user.

Unit:

Percentage (%)

Modules:

Human Assessment module

Domains:

Railway

Power Grid

ATM

AI-Agent Scalability Training

KPI-AF-050

Task 4.1

Scalability

fully automated evaluation

Description:

AI-Agent Scalability Training measures the elapsed time required by an AI-agent to reach a predefined performance threshold. Time measured both as wallclock time (seconds) as well as steps or episodes according to the domain needs. The performance is defined by the native reward formulation defined by the digital environment or by domain experts. The time to threshold is measured across: (i) Different instance complexities; (ii) Different hardware availability. The performance threshold is set empirically and is defined by the cumulative reward formulation specific to the application domain. Note that the reward formulation used to train the agent may differ. For case (i), the type of hardware used should be logged to interpret the wallclock time measurements.

Objective Description:

This KPI contributes to evaluating Scalability of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective.

Formula:

Time taken to achieve a specific performance level during the training phase of an AI-agent, considering varying instance complexities and hardware availability

Unit:

Steps or Episodes and wall-clock time

Modules:

Prediction module

Domains:

Railway

Power Grid

ATM

AI-Agent Scalability Testing

KPI-AF-051

Task 4.1

Scalability

fully automated evaluation

Description:

Compare multiple trained agents, RL-based or not, based on the average inference time to sample one or multiple actions while increasing the complexity of the scenario analysed. Complexity is a domain-relevant concept that must be defined.

Objective Description:

This KPI contributes to evaluating Scalability of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective.

Formula:

Inference time and performance of the trained AI agents as a function of instance complexity on standardized hardware.

Unit:

Time to be measured in seconds Performance to be measured using the environment native reward function or a suitably chosen use-case specific metric. Complexity to be defined in a use-case specific way, e.g., using a sequence of pre-defined scenarios increasing in complexity, such as increasing area, number of vehicles, nodes in the network.

Modules:

Recommendation module

Simulation engine

Domains:

Railway

Power Grid

ATM

Domain shift adaptation time

KPI-DF-052

Task 4.2

Reliability

fully automated evaluation

Description:

The time or number of episodes required for the agent to regain a specific level of performance in the shifted domain after the domain shift has occurred. It can be used to evaluate how quickly an agent can adapt to new environmental conditions.

Objective Description:

Domain adaptation (DA) is a sub-field of transfer learning. DA can be defined as the capability to deploy a model trained in one or more source domains into a different target domain. We consider that the source and target domains have the same feature space. In this sense, it is important for RL based agents to have a reasonable adaptation time to a new domain which may present a slight shift from the source domain. However, the adaptation time should also consider the performance drop into its computation, as a high performance drop after the adaptation could not be tolerated. This KPI contributes to evaluating Reliability of the AI-based assistant when dealing with real-world conditions which may be slightly different from source domain, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

The adaptation time could be computed as the sum of episodes required for an agent to regain a specific level of performance in the shifted domain after the domain shift has occurred.

Unit:

Time, number of time steps, number of episodes

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Domain shift generalization gap

KPI-DF-053

Task 4.2

Reliability

fully automated evaluation

Description:

Domain shift – generalization gap evaluates the absolute difference between the performance (e.g., rewards) in the training domain and the shifted domain. This metrics quantifies the extent of performance loss due to domain shift.

Objective Description:

The objective is to verify to which extent the AI-based assistant performance deteriorates when the target domain presents some changes in comparison to the source domain. If an agent can retain the same performance expectations in shifted domain, it will be qualified as reliable. This KPI contributes to evaluating Reliability of the AI-based assistant when dealing with real-world conditions which may be slightly different from source domain, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Absolute difference between the rewards) in the training domain and the shifted domain

Unit:

No units

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Domain shift out of domain detection accuracy

KPI-DF-054

Task 4.2

Reliability

fully automated evaluation

Description:

Domain shift – out of domain detection accuracy measures the accuracy with which the agent can detect whether it is operating in a domain that is different from the one it was trained on. It is useful for systems that need to switch strategies or request human intervention when a domain shift is detected. A recent paper proposed by Nasvytis et al. (2024) introduce various approaches for detection of OOD in RL.

Objective Description:

It is crucial for an AI-based assistant to determine whether it can make reliable decisions in a given configuration. AI algorithms tend to be more dependable when they have been trained on similar configurations. Therefore, if the AI assistant can accurately detect out-of-domain configurations, it can seek human feedback to reduce uncertainty, leading to more adapted and reliable decisions in future scenarios. This KPI allows to determine if AI-based system could detect the shift before decision making. This KPI contributes to evaluating Reliability of the AI-based assistant when dealing with real-world conditions which may be slightly different from source domain, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

If a detection algorithm or tool is used, the accuracy of OOD detection is given by: (TP+TN)/( TP+TN+FP+FN) This formula provides a measure of how well the agent can detect domain shifts, balancing both the correct identification of OOD and ID scenarios. It is essential for systems that need to adapt their strategies or seek human intervention when a domain shift is detected. Otherwise, compute a distribution-based distance (e.g. Wasserstein) between source and target domains and if this distance is greater than a predefined threshold, we can validate the hypothesis that there is a shift in the data.

Unit:

Percentage (%) of correctly identified OOD cases

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Domain shift policy robustness

KPI-DF-055

Task 4.2

Reliability

fully automated evaluation

Description:

Domain shift – Policy robustness KPI calculates a ratio of the performance in the shifted domain to the performance in the original domain. A score close to 1 indicates high robustness, while a lower score indicates reduced performance due to the domain shift. It can be used to assess the generalization of a policy learned in a simulated environment when applied to a real-world scenario.

Objective Description:

To evaluate the robustness and generalization capability of a policy by measuring its performance ratio between a shifted domain and the original domain, ensuring that policies trained in simulated environments maintain high effectiveness when applied to real-world scenarios. This KPI contributes to evaluating Reliability of the AI-based assistant when dealing with real-world conditions which may be slightly different from source domain, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

If we present by Rshift the performance or reward obtained in shifted domain and by Roriginal the performance or reward in the source domain, the ratio is computed as: Rshift/Roriginal

Unit:

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Domain shift robustness to domain parameters

KPI-DF-056

Task 4.2

Reliability

fully automated evaluation

Description:

Robustness to domain parameters KPI evaluates the sensitivity of the agent’s performance (e.g., Reward) to changes in specific domain parameters (e.g., generators type including renewables in power grid domain). It helps to identify which environmental factors most affect the agent’s performance.

Objective Description:

To assess the sensitivity of the agent's performance to variations in domain parameters, identifying key environmental factors that significantly impact the agent’s effectiveness and robustness, thereby guiding improvements in adaptability and resilience across different scenarios. This KPI contributes to evaluating Reliability of the AI-based assistant when dealing with real-world conditions which may be slightly different from source domain, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Calculating the variance or standard deviation of the rewards obtained by the agent after introducing changes in the source domain and comparing it to the standard deviation before the changes, can provide insights into the robustness of the agent's performance under varying domain parameters. To formalize the definition, let: - Rbefore represent the rewards obtained by the agent before introducing changes. - Rafter represent the rewards obtained after introducing changes. - σbefore be the standard deviation of Rbefore - σafter be the standard deviation of Rafter - Δσ be the difference between the two standard deviations. The formula to quantify the change in variability due to domain changes is: Δσ =σafter−σbefore

Unit:

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Domain shift success rate drop

KPI-DF-057

Task 4.2

Reliability

fully automated evaluation

Description:

Domain shift – success rate drop KPI measures drop in the performance of the agent after the occurrence of a shift in the source domain.

Objective Description:

To quantify the decline in the agent's performance after a shift in the source domain, providing insights into the agent's ability to maintain effectiveness under altered conditions. This KPI helps in evaluating the agent's resilience, adaptability, and the robustness of its training, facilitating the identification of weaknesses and the development of strategies to improve its performance in dynamic or unpredictable environments. This KPI contributes to evaluating Reliability of the AI-based assistant when dealing with real-world conditions which may be slightly different from source domain, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

A formula to quantify the drop in performance of the agent after a domain shift could be: Performance drop=Roriginal−RshiftedRoriginal Where the R could be a performance metric of the AI-based agent like the cumulated Reward. This formula yields a ratio representing the relative drop in performance, with a higher value indicating a more significant drop due to the domain shift.

Unit:

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Robustness to operator input

KPI-RS-058

Task 4.2

Robustness

semi-automated evaluation

Description:

The KPI should measure or evaluate how the trained agent behaves in terms of robustness if, during the decision-making process where a human operator makes the final decisions, a human operator occasionally intervenes and significantly overrides the autonomous decisions of the trained agent. For agents trained using machine learning methods, this can cause an offset between the type of states encountered in the training data and during deployment, especially for agents trained using reinforcement learning or similar methods where the agent itself decides which actions to execute. As a consequence of this offset, the agent might make poorer decisions if the human operator does not always follow the proposed actions of the agents. To measure how sensitive the agent is to such offsets, this KPI proposes to use a “simulated operator” that does not fully follow the course of actions suggested by the agents, and instead overwrites certain action variables set by the agents in a fraction of time steps.

Objective Description:

Overall, this KPI contributes to evaluating Robustness of the AI-based assistant when dealing with real-world conditions, as part of Task 4.2 evaluation objectives, and O4 main project objective. The KPI is related to Tasks 3.1 and 3.3. Specifically, it is related to goal (4) of Task 3.1 (“Analysis of the impact of human intervention in the decision process on AI agents developed and trained towards fully autonomous behavior”), goal (1) of Task 3.3 (“Develop and expand order-agnostic network architectures adapted to the RL setting to use human-data or human-like perturbations and ensure the system can also make good decisions in the context where actions are partially chosen by the human partner”) and goal (2) of Task 3.4 (“Detect risks early on and potentially inform human supervisors, e.g. relinquish control to a human supervisor or transition into “safety mode” when necessary”).

Formula:

A simulated operator is defined that deviates from the agent's suggestions in a certain percentage of time steps. If agents have to set multiple variables, this deviation can concern only certain variables. The simulated operator can be based on logged data, or in the absence of such data can be a random agent. The performance of the primary AI agent (e.g., environment native reward function) is then measured in the precence of these deviations.

Unit:

Environment reward, or the unit of measurement of a suitably-chosen use-case specific metric.

Modules:

Recommendation module

Simulation engine

Digital environment

Domains:

Railway

Power Grid

ATM

Assistant adaptation to user preferences

KPI-AS-068

Task 4.1

Solution quality

semi-automated evaluation

Description:

Assistant adaptation to user preferences assesses how the AI assistant adapts to operator’s choices and preferences. The assistant provides several recommendations which represent different trade-offs of different objectives, and the operator eventually makes one single choice. This KPI assume that an estimation of epistemic uncertainty is calculated for each action recommendation, which can be used later by the human to select the action in a multi-objective setting. This KPIs thus aims at measuring: - Whether the choice that the operator makes is in the set of recommendations proposed by the assistant, - How is the recommendation chosen by the operator ranked compared to the other ones, - Whether the recommendation chosen by the operator has a high epistemic uncertainty compared to the other recommendations.

Objective Description:

This KPI contributes to evaluating Solution quality of the AI-based assistant, as part of Task 4.1 evaluation objectives, and O2 main project objective.

Formula:

See calculation steps: for this KPI, raw values are given as lists to allow different possible summary calculations.

Unit:

Vector with 6 values without units, for each step: - the lowest epistemic uncertainty of recommendations - the highest epistemic uncertainty of recommendations - the epistemic uncertainty of the recommendation chosen by the operator - the rank of the recommendation chosen by the operator - the total number of proposed recommendations - whether the choice that the operator makes is in the set of recommendations proposed by the assistant

Modules:

Recommendation module

Domains:

Power Grid

Drop-off in reward

KPI-DF-069

Task 4.2

Robustness

fully automated evaluation

Description:

Drop-off in reward calculates difference in reward between situation with perfect information and imperfect information either through natural malfunctions while measuring data or through intentional perturbations by an attacker.

Objective Description:

This KPI contributes to evaluating Robustness of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Total reward obtained with perfect information - Total reward obtained with imperfect information

Unit:

Same unit as reward or percentage of reward with perfect information

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Frequency changed output AI agent

KPI-FF-070

Task 4.2

Robustness

fully automated evaluation

Description:

Frequency changed output AI agent calculates the number of times the output of the AI agent (i.e. the action the agent chooses) is changed due to perturbations

Objective Description:

This KPI contributes to evaluating Robustness of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

While running the environment feed the AI agent both the unperturbed and perturbed input, compare the actions the agent chooses and count how many times the actions are different

Unit:

None (number)

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Severity of changed output AI agent

KPI-SF-071

Task 4.2

Robustness

fully automated evaluation

Description:

Severity of changed output AI agent KPI measures similarity of the action chosen by AI agent based on a perturbed input to the action chosen with perfect information. Average pre-defined similarity score per changed action indicating how different the new action is from the original one.

Objective Description:

This KPI contributes to evaluating Robustness of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Assign similarity score to every pair of actions the AI agent can take and sum up this score for every time the agent's action is changed by perturbations

Unit:

Average similarity score per action change

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Steps survived with perturbations

KPI-SF-072

Task 4.2

Robustness

fully automated evaluation

Description:

Steps survived with perturbations KPI calculates the number of steps the AI agent is able to survive in environment with perturbation agent

Objective Description:

This KPI contributes to evaluating Robustness of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Count number of steps before a game over in the environment when including a perturbation agent

Unit:

Number of steps

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Vulnerability to perturbation

KPI-VF-073

Task 4.2

Robustness

fully automated evaluation

Description:

Vulnerability to perturbation KPI measures vulnerability of specific value in observed state to perturbations, i.e. how likely it is that perturbing the value will result in a change in action chosen by the AI agent

Objective Description:

This KPI contributes to evaluating Robustness of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

For value x1 in observed state, count how many times x1 is perturbed significantly during the episode and count how many times it is perturbed when the AI agent's chosen action is changed by the perturbations and divide the latter by the former

Unit:

Proportion of times perturbing the value resulted in a changed action

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Area between reward curves

KPI-AF-074

Task 4.2

Resilience

fully automated evaluation

Description:

Area between reward curves calculates area between the curve corresponding to the reward obtained in each step in an environment where the AI agent has perfect information and the curve for an environment where the agent's input is perturbed

Objective Description:

This KPI contributes to evaluating Resilience of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Use the trapezoidal rule for numerical integration to compute the area underneath the two curves and subtract

Unit:

None (cumulative reward)

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Degradation time

KPI-DF-075

Task 4.2

Resilience

fully automated evaluation

Description:

Number of steps/episodes until reward reaches its lowest point after introducing perturbations to the input of the AI agent

Objective Description:

This KPI contributes to evaluating Resilience of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Find step/episode where reward is the lowest and get the difference to the step/episode the perturbations were introduced

Unit:

Number of steps/episodes

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Restorative time

KPI-RF-076

Task 4.2

Resilience

fully automated evaluation

Description:

Number of steps/episodes until reward recovers to its highest point after reaching the lowest point after introducing perturbations to the input of the AI agent

Objective Description:

This KPI contributes to evaluating Resilience of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Find step/episode where reward is the highest and get the difference to the step/episode with the lowest reward from KPI-DF-075

Unit:

Number of steps/episodes

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Similarity state to unperturbed situation

KPI-SF-077

Task 4.2

Resilience

fully automated evaluation

Description:

Similarity state to unperturbed situation KPI measures similarity of the state in an environment where AI agent's input is perturbed to the state in the same context of an environment with perfect information

Objective Description:

This KPI contributes to evaluating Resilience of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Choose a metric to measure the similarity between states, e.g. cosine similarity, Euclidean distance, etc., and compute similarity between the state in each step of environment with perfect information and one with perturbed input. Plot curve of similarity in each step and evaluate using KPI-AF-074, KPI-DF-075 and KPI-RF-076

Unit:

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Reward per action

KPI-RF-078

Task 4.2

Robustness

fully automated evaluation

Description:

Reward per action KPI calculates average reward obtained for each action performed by the AI agent

Objective Description:

This KPI contributes to evaluating Robustness of the AI-based assistant, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Total reward obtained / Number of actions performed

Unit:

Same unit as reward

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Explainability Robustness

KPI-EF-086

Task 4.2

Robustness

fully automated evaluation

Description:

The Explainability Robustness KPI evaluates the stability of explanations against small input perturbations, assuming the model’s output remains relatively unchanged. A robust explanation should not fluctuate significantly when the input is slightly modified. The Average Sensitivity Metric quantifies this stability by applying small perturbations to the input data and measuring how much the explanation changes. Since computing sensitivity over all possible perturbations is impractical, Monte Carlo sampling is used to estimate these variations efficiently.

Objective Description:

This KPI ensures that AI-driven explanations remain reliable and aligned with the actual decision-making process of the model. It helps evaluate interpretability methods in AI systems used in critical applications. This KPI contributes to evaluating AI trustworthiness, acceptability and trust of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O4 main project objective.

Formula:

Average of explanation differences computed over multiple runs Explanation differences measure the sensitivity estimate for each sample with p-norm (e.g., L1 or L2 distance) applied to the difference between: - the explanation for the original input - the explanation for the perturbed input

Unit:

Change in explanation values (e.g., L1 or L2 norm difference), Normalized score indicating robustness

Modules:

AI agent

Digital environment

Domains:

Power Grid

Explainability Faithfulness

KPI-EF-087

Task 4.2

Robustness

fully automated evaluation

Description:

The Faithfulness KPI assesses whether the feature importance scores provided by an explanation method accurately reflect the model’s decision-making process. It systematically removes or alters features and measures the impact on the model’s predictions. The assumption is that if a feature is truly important, removing or altering it should significantly affect the model’s output.

Objective Description:

Formula:

Sum of the absolute difference between the model prediction for the original input and the model prediction when a feature is removed, masked, or replaced, over the total number of evaluated samples

Unit:

Change in model confidence score (e.g., probability difference), Normalized score indicating faithfulness

Modules:

AI agent

Digital environment

Domains:

Power Grid

Perceived decision novelty

KPI-PS-089

Task 4.3

Social-technical decision quality

special evaluation setup

Description:

This KPI represents human operators’ self-reported subjective assessment of nontriviality for the AI-generated solutions measured with a questionnaire.

Objective Description:

This KPI contributes to evaluating Social-technical decision quality of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Domain shift forgetting rate

KPI-DF-090

Task 4.2

Reliability

fully automated evaluation

Description:

The rate at which an agent forgets its performance in the original domain after being exposed to a shifted domain. It helps to measure the extent to which learning in the new domain negatively impacts the agent’s ability to perform in the original domain.

Objective Description:

The objective of computing the Forgetting Rate in Domain Shift is to quantify the decline in an agent's performance on the original domain after being trained or exposed to a shifted domain. This metric helps assess the extent of negative transfer, ensuring that adaptation to the new domain does not excessively degrade prior knowledge. A higher forgetting rate indicates a more significant loss of previously learned knowledge due to domain shift. This KPI contributes to evaluating Reliability of the AI-based assistant when dealing with real-world conditions which may be slightly different from source domain, as part of Task 4.2 evaluation objectives, and O4 main project objective.

Formula:

Let: - P[init/orig] be the agent’s performance (e.g., accuracy, reward, or another metric) in the original domain before exposure to the new domain. - P[post/orig] be the agent’s performance in the original domain after training in the shifted domain. The forgetting rate (FR) can be computed as: FR=( P[init/orig]− P[post/orig])/P[init/orig]

Unit:

Proportion or Percentage

Modules:

Digital environment

Domains:

Railway

Power Grid

ATM

Reflection on operator trust

KPI-RS-091

Task 4.3

Long-term consequences of AI assistants

special evaluation setup

Description:

This KPI represents self-reported human operators’ perception of the changes in their trust for the AI assistant over time (increased/decreased) on a Likert scale.

Objective Description:

This KPI contributes to evaluating Long-term consequences of AI assistants of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective. It is also relevant to protocols and concepts defined in D1.1 such as “Transparency”, “Human Agency and Oversight”, “Credibility and Intimacy”. Furthermore, it is also relevant to the overall project KPI-ET-7 "% of acceptance of human operators regarding AI4REALNET solutions".

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Reflection on operator agency

KPI-RS-092

Task 4.3

Long-term consequences of AI assistants

special evaluation setup

Description:

This KPI represents self-reported human operators’ perception of the changes in their agency working with the AI assistant over time (increased/decreased) on a Likert scale.

Objective Description:

This KPI contributes to evaluating Long-term consequences of AI assistants of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective. It is also relevant to protocols and concepts defined in D1.1 such as “Transparency”, “Decision support for the human operator”, “Human Agency and Oversight”.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Reflection on operator de-skilling

KPI-RS-093

Task 4.3

Long-term consequences of AI assistants

special evaluation setup

Description:

This KPI represents self-reported human operators’ perception of the changes in their own skills working with the AI assistant over time (increased/decreased) on a Likert scale.

Objective Description:

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Reflection on over-reliance

KPI-RS-094

Task 4.3

Long-term consequences of AI assistants

special evaluation setup

Description:

This KPI represents self-reported human operators’ perception of their potential over-reliance on the AI assistant on a Likert scale.

Objective Description:

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Reflection on additional training

KPI-RS-095

Task 4.3

Long-term consequences of AI assistants

special evaluation setup

Description:

This KPI represents self-reported human operators’ perception of the additional training necessary to adopt the AI assistant on a Likert scale.

Objective Description:

This KPI contributes to evaluating Long-term consequences of AI assistants of the AI-based assistant, as part of Task 4.3 evaluation objectives, and O3 main project objective. It is also relevant to protocols and concepts defined in D1.1 such as “Additional training about AI for human operators” and “Societal and Environmental Well-being”.

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Reflection on biases

KPI-RS-096

Task 4.3

Long-term consequences of AI assistants

special evaluation setup

Description:

This KPI represents self-reported human operators’ perception of biased decisions potentially produced by the AI assistant with respect to gender/ethnicity/age or commercial interests on a Likert scale.

Objective Description:

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

Predicted long-term adoption

KPI-PS-097

Task 4.3

Long-term consequences of AI assistants

special evaluation setup

Description:

This KPI represents predicted adoption of the AI assistant by users, stakeholders, or experts on a Likert scale.

Objective Description:

Formula:

As operationalized by the questionnaire (normally Likert-scales with several items that are rated on a scale of e.g. 1–5 or 1–7).

Unit:

Ordinal data response on a Likert scale (or potentially a similar response on an interval scale)

Modules:

(non applicable)

Domains:

Railway

Power Grid

ATM

AI4REALNET evaluation protocols

Filters

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains:

Description:

Objective Description:

Formula:

Unit:

Modules:

Domains: