Conversational agents for mental health apps: now with added artificial empathy

hal9000

For those new to the area, a conversational agent is a software application in which users have a conversation with a computer via natural language (spoken or written dialogue). This type of agent is generally programmed to help us achieve something. The idea of conversational agents is to remove the need for us to use visual interfaces and manual methods of input, enabling us to seamlessly interact with a computer just by talking to it. If you own a smartphone or smart speaker, you’ll most likely already have a conversational agent such as Apple’s Siri, Android’s Google Now or Amazon’s Alexa.

When it comes to conversational agents for mental health, people generally mistake ELIZA as being the first. While this is not entirely untrue it’s worth us digging into exactly why ELIZA was originally created. Weizenbaum created ELIZA during the mid-1960s to demonstrate the superficiality of communication between humans and machines. It was never designed as a therapeutic tool. Strangely for Weizenbaum, his motive for building ELIZA backfired when he found a number of individuals attributing human-like feelings towards ELIZA. The program continued to float around in research for the duration of the 60s and 70s, but little was done to try and improve or build upon the program before it faded away to become a footnote in history.

Some 40 years later we find a re-emergence of the subject with an array of conversational agents for mental health, including a chatbot therapist study covered by Andres Fonseca in an elf blog a few months ago. I myself have spent the last four years researching the acceptability of a conversational agent for mental health with older adults. Even with this bias I keep myself comfortably on the fence as to whether this type of intervention is beneficial and await to see what further advancements appear. “Why?” you may ask. The big problem I have with conversational agents is how we get them to show appropriate empathy. I don’t believe we’re at a stage where Artificial Intelligence (A.I) can be easily applied to enable an agent to read a person’s emotional state and respond in an appropriate manner under all circumstances, particularly in those that involve discussing mental health problems. If an agent manages to convince you that it’s showing empathy at present, it’s generally due to clever script writing and set paths and not because of a fancy algorithm designed to read your sentiment and respond accordingly.

While writing up my thesis, I have spent a large amount of time thinking about how to resolve this issue. My conclusion is that not all solutions to the problem require an A.I capable of reading emotions. It may be possible to create an algorithm that enables the program to give the most appropriate response to a question through the use of peer support community data. This idea helped form the technology behind the NewMind feasibility project I am currently involved in. A recent paper by Morris, Kouddous, Kshirsagar, & Schueller (2018) has shown that while my idea wasn’t as unique as I’d thought, there may still be some exploration to be had with the idea. So with that in mind, let’s take a look at what was found.

The goal of the work was to take some initial steps towards building a conversational agent that can respond immediately, convincingly and with credible empathy. The work was split into two studies:

  1. Study 1 conducted preliminary testing to assess the performance metric and user perceptions;
  2. Study 2 was a controlled study to examine how users would perceive an empathetic agent if it was capable of performing at the same level as a human peer.
Empathy is a key component of mental health conversations, but can computers ever show sufficient appropriate empathy towards humans?

Empathy is a key component of mental health conversations, but can computers ever show sufficient appropriate empathy towards humans?

Study 1 – Preliminary testing

Methods

So how did the system work behind the scenes?

A dataset of previously answered questions from the Koko platform was extracted. The dataset was comprised of peer interactions: 72,785 posts and 339,983 responses. A back-end system was written to enable the automatic pairing of previously achieved responses with incoming posts, and a front-end system was developed to display resources to solicit user feedback. An information retrieval approach was used to automatically return responses to the user. For incoming posts, a search for similar posts was carried out in the peer support interactions dataset. Once a post was found, the associated responses were reviewed and the one that was rated favourably returned to the user.

What did users of the system see?

Once a pre-existing response had been retrieved, it was presented to the user as if it was algorithmically generated by a robot. The users were given no indication that the agent was passing off predefined responses as their own, and immediately after posting in the chat box, the chat bot would inform the user that it might have a response of its own: ”While you wait for responses, I may have an idea that might help you…”. After the participants read the response they were asked to rate the response on a three-point Likert scale (good, ok, bad).

Who participated?

Participants included 37,169 individuals who signed up for Koko between mid-August and mid-September of 2016. No demographic data was taken for the study.

Results

The user ratings of responses taken from both the agent and peers were evaluated. 3,770 responses from the bot and 43,596 from peers. The data indicated that response by peers was significantly more likely to be rated as good, compared to responses that came from the agent, but it is worth noting that 79.20% of the responses from the system were deemed ok or good, suggesting that most users found the results acceptable.

Conclusions

The automated system while useful within the context of the study, was not a sufficient solution on its own. The results of the study indicated peer responses were rated significantly higher than the responses selected by the system.

The thing that stood out for me was the confusion the system had with identifying gender. Male users in some cases were being assumed to be female and vice versa. The knock on effect could potentially alienate users and undermine confidence in the system.

When you consider how simplistic the model used to generate a response from the dataset was, and how well it scored, it’s not hard to think that with some refinement and advancement (hopefully not at the cost of the speed of the response) the system may have potential. But before I get too carried away, let’s take a look at the controlled study.

On a three-point Likert scale (good, ok, bad) 79% of computer responses were deemed ok or good by users, which suggests they are generally acceptable. However, peer responses were rated significantly higher than the responses selected by the system.

On a three-point Likert scale (good, ok, bad) 79% of computer responses were deemed ok or good by users, which suggests they are generally acceptable.

Study 2 – Controlled study

Methods

At sign up, a segment of Koko users were randomly assigned to one of two conditions. In both the control and experimental condition, users were shown responses from their peers as usual. However, the experimental condition users were told their responses were coming from an artificial agent. To limit the impact on the Koko community, the study allotted 2/3 of users to the control and 1/3 to the experimental condition.

The only part of the Koko experience that differed between groups was the notification that proceeded the delivery of peer responses. In the control group the bot responded with “Someone replied to your post. Let’s check it out” and in the experimental group before returning a response, the bot would say “While you wait for responses, maybe I can help… I’m just a robot and I’m still learning, but here’s a thought”.

The language used was ambiguous and submissive, in the hopes the user would be more likely to forgive the system, should it fail. It was also hoped this approach would lower expectations and mitigate against disappointment when users experienced the robot’s shortcomings. Again, both conditions had their responses rated on the three-point Likert scale (good, ok, bad).

Who participated?

Participants included 1,284 Koko users who joined the platform between January 18 and 23 in 2018.

Results

Users rated responses less favourably when they were told they had come from an agent. 51% rated the responses from the agent as good. While 60.6% rated responses from the peer as good.

Conclusions

Unsurprisingly (to me anyway) despite all responses coming from peers, the alleged source of responses led to distortions in the perceived quality of the response. It is clear some resistance to the concept of receiving empathetic responses from a computer program is present. However, it is not clear whether this is specific to users of the Koko platform.

Despite all responses actually coming from peers, users rated responses less favourably when they were told they had come from an agent.

Despite all responses actually coming from peers, users rated responses less favourably when they were told they had come from an agent.

Strengths and limitations

Study 1

Volunteers were recruited via registration to a pre-existing platform Koko, no demographic information was taken. Age and gender were implied through the reporting of a separate survey of Koko users taken in 2017 (N=496), which in my eyes is not an acceptable substitute for the actual demographics.

Only one single three-point Likert scale (good, ok, bad) was used for reporting the quality of responses. The researchers made no attempt to measure for empathy, which to me is an oversight in what could have been achieved with the experiment. There were no therapeutic measures used, although at this early stage they may not have been appropriate or easily integrated.

The biggest issue for me with study 1, is the use of a dataset comprised of previously answered questions and responses. It’s likely that a number of different personality types and writing styles are represented within the dataset. Delivering responses based purely on ratings may result in reply styles that are not suited to a user’s personality type/communication style, which may result in a lower rating. Again the use of a single three point scale means we cannot pull apart and explore the motives behind the rating further. An open text box, in addition to the rating scale, asking for a bit of information regarding the choice of rating, could have been implemented to allow for some qualitative analysis.

Study 2

The study made no attempt to compare other available agents via other platforms to see if effects were visible across multiple platforms. The lack of a follow-up meant that there was no way to assess the credibility of the experimental manipulation or gauge whether participants actually bought into the study’s claim that they were interacting with a machine and not a person.

As with study 1, the single three-point Likert scale for reporting the quality of responses was limited and no other measures were taken.

Matching computerised responses to the personality types and communication styles of users is likely to improve the overall quality of conversational agents.

Matching computerised responses to the personality types and communication styles of users is likely to improve the overall quality of conversational agents.

Implications for practice

This paper has given me food for thought about the future of empathetic conversational agents. To me personally, it indicates that there may be more than one route to replicating empathy in programs of this nature. However, study 2 highlights that even if we ever do reach a point of perfect empathetic replication within a conversational agent, there may be an inbuilt resistance that stops us from ever fully embracing them. Exploring the source of this resistance, and whether it can be combatted, combined with improvements in the delivery of empathetic responses, will decide whether conversational agents have a future in mental health practice.

What do you think? Do conversational agents have a future in mental health practice?

What do you think? Do conversational agents have a future in mental health practice?

Conflicts of interest

I’ve been studying conversational agents for mental health with older adults for nearly four years as part of my PhD. I am also involved in NewMinds Plus Funded Feasibility Study into Developing an AI Empathy Agent.

Credits

My PhD research into conversational agents has been carried out via the supervision of Dr Abigail Millings, Dr Steven Kellet, Prof Gillian Anderson and Prof Roger Moore.

The NewMinds project is being lead by Dr Fuschia Sirois (Sheffield Uni) and carried out in conjunction with Dr Ian Tucker (Uni of East London), Dr Abigail Millings (Sheffield Uni), Dr Rafaela Neiva Ganga (Liverpool John Moore’s Uni) and Mr Paul Radin (NHS involvement volunteer).

Links

Primary paper

Morris RR, Kouddous K, Kshirsagar R, Schueller SM. (2018) Towards an Artificially Empathic Conversational Agent for Mental Health Applications: System Design and User Perceptions. J Med Internet Res 2018; 20 (6): e10148 DOI: 10.2196/10148

Photo credits

Share on Facebook Tweet this on Twitter Share on LinkedIn Share on Google+