vol. 21 no. 3, September, 2016

Cancer information seeking in social question and answer services: identifying health-related topics in cancer questions on Yahoo! Answers

Sanghee Oh, Yan Zhang, and Min Sook Park

Introduction. More and more information users turn to social media to seek cancer-related information. Little is known, however, about their cancer information needs in social question and answer sites; that is, community-based online services where users can ask and answer one another about a variety of topics.
Method. The current study investigates cancer-related topics that users seek and share by analysing questions on Yahoo! Answers. A total of 81,434 cancer questions were randomly collected using Yahoo! Answers application program interfaces.
Analysis. We analysed these questions using text mining techniques. First, we extracted terms related to health topics on cancer from the questions. Second, we analysed the terms based on a layered model of contexts for health information searching. The analysis revealed patterns of user needs at the population level.
Results. Of the terms extracted from the collected questions, a total of 420 terms were selected and classified to reveal demographic, cognitive, affective, social, situational, and technical information related to cancer information needs, demonstrating that askers provided rich information about their personal problems, emotions, social relationships, and life situations in questions.
Conclusions. Consumers' cancer information needs expressed in social question and answer sites are multi-dimensional. Health care professionals and system developers should examine consumers' cancer information needs in light of their specific demographic, cognitive, emotional, social, situational, and technical contexts."


Cancer is the No. 2 leading cause of death in the U.S. Approximately 14.5 million Americans have a history of cancer and about 1.7 million new cases would be diagnosed in 2015 (American Cancer Society, 2015). As a serious and complex disease, cancer leads to profound psychological, social, and financial changes in life. To cope with this illness, nearly half of all Americans (48.7%) have looked for cancer-related information. Percentages were higher for those who have been affected by cancer (63.1% of cancer survivors and 54.6% of those with family histories) (Hessee, Arora, Burke and Rutten, 2008). Cancer patients look for information on a wide range of topics, including treatment, diagnosis and prognosis, rehabilitation, interpersonal relationships, and financial and legal information (Adams, Boulton & Watson, 2009; Yetisgen-Yildiz and Pratt, 2006), from many different sources, such as health professionals, the Internet, mass media, family and friends, and health organizations (Hesse et al., 2008; Yetisgen-Yildiz and Pratt, 2006).

In recent years, online cancer support groups have become an important information source for cancer patients. These cancer-specific communities exist in various forms, such as mailing lists, bulletin boards, web-based discussion forums and social networking sites (Eysenbach, Powell, Englesakis, Rizo and Stern, 2004). In these communities, patients and caregivers seek and provide information, express personal opinions, share personal experiences, and offer encouragement (Blank, Schmidt, Vangsness, Monteiro and Santagata, 2010; Ginossar, 2008). The practical and experiential information from peers, such as how to deal with day-to-day challenges and how to face the consequences of being ill, is highly valued, as it makes people feel better informed, better prepared to manage their illness, and less lonely (Nambisan, 2011; Rozmovits and Ziebland, 2004; van Uden-Kraan, et al., 2008). Among various forms of communities, Social question and answer services, comparatively new social media platforms for health information, have received little attention. Particularly, little is known about them as a source of cancer information.

The purpose of the current study was to investigate the cancer-related topics that users of these services are interested in and discuss. Social question and answer sites are community-based online services through which users ask and answer questions of one another about a wide range of topics in everyday life, including various health-related topics. We chose to study this venue due to the great variety and the high volume of cancer-related topics on it. The research questions we investigated were:

  1. What are the health topics that users discuss and share in questions about cancer in social question and answer sites?
  2. How frequently have health topics been discussed in cancer questions on such sites?
  3. What are the information contexts of the health topics that we could observe from cancer questions on such sites?

Previous studies have investigated health information seeking on social question and answer sites. For example, Kim, Pinkerton, and Ganesh (2011) analysed questions relating to Influenza A Virus (H1N1) and answers on Yahoo! Answers and identified important topic categories pertaining to H1N1 information such as flu-specific terms, medical and non-medical concerns, and sources of information. Studies also found that major types of information needs that consumers presented in questions on social question and answer sites included facts, explanations, advice, personal stories and emotional support (Westbrook, 2015; Zhang, 2013). Furthermore, earlier studies revealed that questions on these sites also include information that helps contextualize requests. For example, when asking about eating disorders, users provided personal narratives centred on past experiences and effects (Oh, He, Jeng, Mattern and Bowler, 2013). Zhang (2010) further noted that consumers also provide contextual information, such as demographic and social and environmental information, in order to elicit relevant answers. These previous studies have generated valuable insights into ways in which individual users formulate and express health-related questions, but fell short in revealing possible patterns at the population level due to the small set of data that was used (Oh et al., 2013; Westbrook, 2015, Zhang, 2013).

In the current study, however, we observed user expressions pertaining to cancer from a holistic point of view by collecting and analysing a large number of cancer-related questions posted on a social question and answer site (about 80,000 questions). Our focus was to examine how users convey cancer-related questions, to identify relevant topics that they discuss, and to understand users' cancer-related questioning behaviour. The terms people use in their questions could represent the health topics of their concerns. Therefore, we first extracted and analysed the terms from cancer questions to identify the topics. And then the frequencies of questions associated with the terms were counted to observe the common issues and concerns across the health topics. We further analysed the health topics by classifying them into several categories of information contexts in order to understand users' cognitive, affective, social or other status when seeking and sharing cancer information on these sites.


Health information consumers' questioning behaviour

Acquiring information, or information seeking, in online health communities is often achieved by asking peers questions (Rubenstein, 2015; Wildemuth, de Bliek, Friedman and Miya, 1994). White (2000) reported that about 23% of the messages posted on a colon cancer list contained questions. Ginossar (2008) reported that 17% of the messages on a lung cancer forum and 19% of the messages on a chroniclymphocytic leukaemia forum were questions.

Consumers' questioning behaviour in online health communities has been conceptualised and analysed from several different perspectives. First, it is viewed as a linguistic behaviour and researchers focus on identifying the linguistic characteristics of questions. For example, Smith, Stavri, and Chapman (2002) found that the majority of consumer vocabularies describing the clinical findings and features in questions that consumers submitted to a cancer information service matched with controlled health care terminologies. Slaughter, Soergel, and Rindflesch (2006) characterised the semantic relationships in consumer questions on a medical Website based on the relationship classes and structure of the Unified Medical Language System semantic network. Other linguistic characteristics, such as the length of questioning messages (Zhang, 2010) and the number of sentences contained in questions (Slaughter et al., 2006) were also examined.

Secondly, consumers' questioning behaviour is analysed in relation to the function of expected answers to a question. For example, White (2000) analysed questions on a colon cancer listserv using Graesser's taxonomy of questions (Graesser, McMahen and Johnson, 1994), and found that there were eighteen types of questions, out of the twenty outlined in the taxonomy. Among them, verification questions accounted for nearly half of the questions, followed by directives, concept completion questions, and judgmental questions.

Third, questions are conceptualized as manifestations of consumer health information needs, enacted to fill gaps in an individual's state of knowledge (Belkin, Oddy and Brooks, 1982). Subsequently, consumers' questioning behaviour is viewed as a manifestation of information needs and a form of information seeking. For example, Weinberg and Schmale (1996) found that questions on a breast cancer community website mainly asked about how peers were doing and if they had similar experiences. Klemm, Reppert, and Visich (1998) found that colorectal cancer patients sought information about treatments, alternative therapies, medications, and new drug delivery systems in an online community. White (2000) found that half of the questions in a colon cancer electronic list were medical-related; the subjects of particular concern were medications, diagnosis, treatments, epidemiology, prognosis, and diet.

Contexts for health information searching

Some studies went further to view questions as manifestations of consumers' current states of knowledge (Belkin et al., 1982) that can help contextualize their specific requests. White (2000) pointed out that the majority of cancer questions were accompanied by some type of contexts and rarely asked for information in complete isolation. Consumers established contexts by referring to previous messages, by describing reasons or purposes, or by telling a story. Gooden and Winefield (2007) observed that some people were unaware of exactly what information they needed, so that they provided incidental information to prompt others to discover potentially helpful questions. Following this line of thinking, Zhang (2010) recently examined health-related questions in social question and answer services and found that, in order to elicit relevant information from peers, consumers provided a variety of contextual information. Building on this analysis as well as the existing information behaviour models (Saracevic, 1997; Wilson, 1999), she further developed a layered model to describe how consumers contextualized their inquiries to acquire personally relevant information (Zhang, 2013).

In this study, we adopted Zhang's (2013) layered model of context for health information searching in order to help frame our exploration of how users convey their cancer information needs in questions. We chose Zhang's model for two reasons. First, it provides a comprehensive and systematic view of how users contextualize their health-related requests. Most widely used health information seeking models primarily focus on the health information searching process (Freimuth, Stein and Kean, 1989; Lenz, 1984) or antecedents to information seeking behaviour (e.g., use of a certain channel for health information (Johnson, 1997) with little attention to users' articulation of health information needs. Secondly, it was developed by examining health questions on social question and answer sites, in particular, and by analysing user expressions in depth. In this model, five layers of contextual factors were identified that consumers used to contextualise their requests when asking health questions on such sites, specifically,

We used the model as a baseline for classifying the health and other topics identified from our data set and then to further develop and revise the model.


Data collection

A total of 81,434 questions posted from 2009 to 2014 was randomly collected from the cancer category in Yahoo! Answers, using its application programming interface. Yahoo! Answers is one of the most frequently used social question and answer services with approximately 4.2 million visitors a month as of June 2016 (Quantcast, 2016). In our dataset, cancer questions vary in length, ranging from 2 to 1,691 words (mean = 84.1). Long questions often contain detailed descriptions of personal health situations (see Figure 1.)

Figure1: An example of a cancer question posted in <em>Yahoo! Answers</em>
Figure 1: An example of a cancer question posted in Yahoo! Answers

Data analysis

Text mining is a machine-supported process of discovering new patterns of knowledge and information from unstructured textual data by utilizing various techniques and algorithms for information retrieval, information extraction, and natural language processing (Feldman & Dagan, 1995; Hotho, Nürnberger and Paaß, 2005). Text mining software, IBM® SPSS® Modeler Premium (SPSS Modeler), was used to extract terms from the 81,434 cancer questions.

Text mining often consists of three steps, (1) text pre-processing, (2) text representation, and (3) knowledge discovery (Hu and Liu, 2012). We carried out text mining following these steps. First, the SPSS Modeler was used to pre-process the questions by removing stop words, combining different forms of a word (stemming), and cleaning up synonyms, acronyms, and typos in cancer questions. The American-English Dictionary and Medical Subject Headings were used to identify both generic and medical terms in questions.

Second, the SPSS Modeler extracted terms from the pre-processed text for text representation. If a term appears multiple times in a question, it was counted only once in our analysis. Thus, the unique number of questions per term was counted. SPSS Modeler enables extraction of up to 5,000 concepts from a data set by default. We used z-score to find an appropriate cut-off point to represent the common health topics discussed in questions. The calculation using a z-score cut off of 0 resulted in a total of 751 terms, each of which appeared in more than 189 questions.

Third, the 751 terms and a set of associated questions were reviewed to identify the contextual layers related to cancer information seeking in the questions. The five layers in Zhang's (2013) model served as the top-level categories. In the meantime, the definition of each layer as well as subcategories under some layers evolved. Also at this stage, the terms that were too general to identify contexts (i.e., people, person, health, place, life) were removed from further analysis. As a result, 420 specific terms were identified; they appeared in 79,125 questions that represent askers' cancer-related concerns.

All three authors participated in classifying the terms into categories and subcategories at the third stage of analysis and each of them coded a random sample of 118 terms (about 16% of the total number of terms used for the manual review) independently. The intercoder reliability among the three coders, measured using Cohen's (1960) κ reached 0.81, indicating an almost-perfect level of agreement. According to Landis and Koch's (1997) scale of Cohen's κ, the value between 0.81 and 1.00 indicates almost perfect in the degree of concordance (Fair - 0.21–0.40; Moderate - 0.41–0.60; Substantial - 0.61–0.80).


Our analyses resulted in six layers of context to account for consumers' questioning behaviour. The six layers are:

Table 1 shows an overview of the six layers and subcategories of each layer. The percentages of the terms and questions were calculated based on the total number of terms (n= 420) and the questions (n= 79,125), respectively.

Table 1: An overview of the six layers and categories of terms under each layer
Contextual layers Term categories Terms Questions
n % n %
Demographic layer Sex 8 1.9% 6,529 8.3%
Age 5 1.2% 8,635 10.9
Total 13 3.1% 13,744 17.4%
Cognitive layer Diseases and conditions 33 7.9% 56,491 71.4%
Body parts and body systems 85 20.2% 43,324 54.8%
Symptoms 107 25.5% 37,641 47.6%
Treatments 23 5.5% 22,369 28.3%
Tests 19 4.5% 13,901 17.6%
Total 267 63.6% 75,488 95.4%
Affective layer Negative feelings 17 4.0% 9,890 12.5%
Positive feelings 3 0.7% 1,860 2.4%
Total 20 4.8% 11,275 14.2%
Situational layer Habitual or addictive behaviour 5 1.4% 8,146 10.3%
Exposure 18 4.3% 5,781 7.3%
Dietary 8 1.9% 4,566 5.8%
Socio-economic 8 1.9% 4,451 5.6%
Sexual behaviour, pregnancy, and birth 6 1.4% 2,522 3.2%
Total 49 11.7% 23,234 29.4%
Social layer Relationships with family members 18 4.3% 23,235 29.4%
Relationships with health care providers/services 14 3.3% 21,319 26.9%
Relationships with acquaintance 2 0.5% 5,172 6.5%
Dating relationships 3 0.7% 1,723 2.2%
Total 37 8.8% 38,082 48.1%
Technical layer Social supports 17 4.0% 31,791 40.2%
Information sources 17 4.0% 9,750 12.3%
Total 34 8.1% 35,754 45.2%

Information related to the cognitive layer appeared in 95.4% of the questions, followed by information related to the social, technical, and situational layers. Demographic and affective information appeared in relatively fewer questions. As for the number of terms, the cognitive layer contained the greatest number (267 terms), with one-third being about symptoms or body locations/systems. The situational layer contained the second greatest number of terms (49 terms), followed by the social layer (37 terms) and the technical layer (34 terms).

Figure2: Term and question distributions across the contextual layers

Figure 2. Term and question distributions across the contextual layers

Figure 2 shows a comparison of the terms and question distributions across contextual layers. The cognitive layer contains the greatest number of both terms and questions, compared with other categories. The terms assigned to social and technical layers are less than 10% of each (social: 8.8%, technical: 8.1%), but these terms appeared in almost half of the questions (social: 48.1%, and technical: 45.2%) in the data set.

The details of the terms and the layers are explained in the rest of this section. The numbers of terms assigned to each layer ranged widely, from 13 to 267. Thus, the top five most frequently occurring terms in each term category or sub-category are presented in the paper. The rest of the terms in each layer/table are available from the project website (http://socialqa.cci.fsu.edu/cancer/).

Demographic layer

Thirteen terms, primarily concerning age groups and sex, appeared in a total of 13,744 (17.3 %) questions (see Table 2). The percentage in Table 2 was calculated based on the total number of questions in the demographic layer (n = 13,744).

Table 2: The top five terms related to sex and age
Demographic layer guy1,83213.3%
Age groupschild2,79420.3%
young adult3812.8%

Askers may or may not specify the demographic groups of interests in questions. The analysis we made in this study was based on askers' descriptions in questions. It was observed that askers' interests in information pertaining to sex were specified using terms including male, female, man, woman, guy, or lady. Their interests in information pertaining to age groups were specified using terms including child, teenager, young adult or young people, and adult. Some askers also provided information about ethnicity, such as Asian (in 99 questions) and Hispanic (in 28 questions). These terms, however, were not included in the analysis due to their low frequencies of occurrence.

Cognitive layer

A total of 267 terms were identified from 75,488 questions (95.4%). Due to a high volume of terms and questions in this category, the terms pertaining to specific diseases and conditions, body parts and systems, symptoms, treatments, and tests were further classified into sub-categories.

Specific diseases and conditions

Thirty three terms indicating cancer or other conditions were identified from 56,491 questions. Among them, 26 were cancer diseases (mentioned in 55,171 questions (73.1%)) and seven were other conditions (mentioned in 4,465 questions (5.9%)). Table 3 shows the top five most frequently occurring cancer-related conditions, excluding the general terms, cancer and tumour. The percentage in Table 3 was calculated based on the total number of questions in the cognitive layer (n = 75,488).

Table 3: The top five terms related to cancer and other conditions
Cancerbreast cancer8,00110.6%
lung cancer3,2114.3%
brain tumour2,3333.1%
Other conditionsinfection1,9252.6%
human papilloma virus5270.7%
sexually transmitted diseases3020.4%

When mentioning specific cancers, askers were attempting to specify the type of cancer that they wanted to know about. Other conditions were often mentioned as an alternative diagnosis to cancer, an accompanying disease to cancer, or a condition that may lead to cancer.

Body location and systems

Askers specified body location or body systems to indicate the sites from which the cancer emanates or from where the symptoms are shown; 85 terms were identified from 43,324 questions (54.8%) and categorised based on the body location or system schema adapted from MedlinePlus (U.S. National Library of Medicine, 2015) (See Table 4). The percentages in Table 4 were calculated based on the total number of questions in the cognitive layer (n = 75,488).

Table 4: The top terms in each sub-category related to human body locations/systems
Reproductive systembreast5,8397.7%
Heads and necksneck3,7725.0%
Bones, joints, and musclesarm2,1912.9%
bone tissue2,1442.8%
Digestive systemstomach2,6243.5%
Lungs and breathinglung2,2523.0%
Immune systemlymph nodes4,2635.6%
bone tissue marrow5100.7%
immune system 335 0.4%
spleen 204 0.3%
Cells and genes cell 1,390 1.8%
cancer cells 960 1.3%
tissue 924 1.2%
gene 916 1.2%
leukocyte 499 0.7%
Skin, hair, and nails skin 2,807 3.7%
hair 2,165 2.9%
Endocrine system thyroid gland 1,238 1.6%
glands 702 0.9%
lymph 445 0.6%
hormone 432 0.6%
pancreas 338 0.4%
Brain and nerves brain 2,215 2.9%
nerve 328 0.4%
lobe 242 0.3%
Brain and nerves heart 1,123 1.5%
urine 603 0.8%
vein 387 0.5%
blood platelet 205 0.3%

A variety of body parts and systems were mentioned in questions. Terms related to reproductive systems were the most frequently observed, followed by many other body parts and systems shown in Table 4 (The categories were ordered by the number of terms assigned to each category). For example, askers often observed lump or bumpy texture in breasts, necks, and armpits. They also mentioned immune systems, endocrine systems and cells and genes about which they may not have observed physical changes or sensational pains directly but in relation to their medical histories of having certain conditions in the past, i.e., a change in results of blood tests.


The greatest number of terms (107) was classified into the symptoms category, which shows a variety of symptoms that cause askers to be suspicious of having cancer. These terms appeared in 37,641 (47.6%) questions. The general term symptom appeared in 6,828 questions. The remaining 106 terms were classified into two categories depending on whether they are visually observable or can only be felt by the patient (See Table 5). The percentage in Table 5 was calculated based on the total number of questions in the cognitive layer (n = 75,488).

Table 5. The top five terms in each sub-category related to symptoms
Visual or physical changeslump, mass 8,857 11.7%
size 3,338 4.4%
red 2,723 3.6%
bump 2,354 3.1%
spotting 1,870 2.5%
Physical sensationspain 8,736 11.6%
headache 2,354 3.1%
pressure 1,402 1.9%
fever 1,034 1.4%
burning 1,001 1.3%

Askers discussed their conditions when noticing a change in their bodies, such as lumps, masses, bumps moles, and bleeding (haemorrhage or spotting), or any changes in colour (e.g., red, dark) or size (e.g., swelling) of these existing problems. Major symptoms that can only be felt by the patients included pain, headache, pressure, fever, and burning. In addition, askers mentioned fatigue, nausea, a change in appetite, a problem in respiration, diarrhoea, back pain, dizziness, hearing problem, and anaemia.


Twenty-three treatment-related terms were identified from 22,369 (28.3%) questions. The term treatment was in 7,495 questions (9.9%). The remaining terms were grouped into three sub-categories, therapy, procedure, and medicine (See Table 6). The percentage in Table 6 was calculated based on the total number of questions in the cognitive layer (n = 75,488).

Table 6: The top terms in each sub-category related to treatments
Therapy chemotherapy 6,748 8.9%
radiation therapy 3,067 4.1%
gene therapy 712 0.9%
electroconvulsive therapy 227 0.3%
Procedure surgery 3,925 5.2%
procedure 698 0.9%
removal 425 0.6%
transplant 349 0.5%
mastectomy 322 0.4%
Medication medicine 3,015 4.0%
antibiotics 905 1.2%
shot 782 1.0%
side effect 748 1.0%
dose 488 0.6%

Chemotherapy appeared the most frequently, followed by radiation therapy. Many surgical procedures were discussed, with the term surgery appearing the most frequently. Antibiotics and shots were mentioned in order to ask about medicines people were taking or willing to take before or after their cancer treatments. Askers were also concerned about side effects and dosages of medications.


A total of 19 test-related terms were identified from 13,901 (17.6%) questions. The general term test and examination appeared in 4,632 (6.1%) questions. The remaining terms were classified into diagnostic/monitoring tests and screening tests (see Table 7). The percentage in Table 7 was calculated based on the total number of questions in the cognitive layer (n = 75,488).

Table 7. The top five terms in each sub-category related to tests
Diagnostic/ monitoring tests scanning 2,969 3.9%
biopsy 2,142 2.8%
hematologic test 1,894 2.5%
ultrasound 1,534 2.0%
x-ray 1,408 1.9%
Screening tests mammography 752 1.0%
check up 596 0.8%
pap smear 544 0.7%
cbc (complete blood count) 409 0.5%
screening 391 0.5%

The most frequently occurring terms were common diagnostic or monitoring tests, including scanning, biopsy, haematological tests, ultrasound, and x-ray. Askers posted questions about what a test is, what it entails, or what would follow; or they simply provided tests as a piece of background information to help readers interpret their questions. Among screening tests, the most frequently mentioned was mammography. Askers primarily sought help with interpreting test results.

Affective layer

Twenty terms describing negative or positive feelings were identified from a total of 11,275 questions (see Table 8). The percentage in Table 8 was calculated based on the total number of questions in the affective layer (n = 11,275).

Table 8. The top terms related to askers' emotional status
Negative feelings worry 1,895 16.8%
freaking 1,295 11.5%
anxiety 1,294 11.5%
paranoid 886 7.9%
stress 839 7.4%
Positive feelings love 1,318 11.7%
trust 319 2.8%
fun 254 2.3%

Most feelings were negative, with worry, freaking, and anxiety topping the list. One term was worth noting, paranoid. It suggests that questioners acknowledged that their concerns might not be rational. Nevertheless, positive feelings also appeared, with “love” appearing the most frequently.

Situational layer

Forty-nine terms related to this layer were identified from 23,234 questions (29.4%). These terms were classified into five categories (see Table 9). The percentage in Table 9 was calculated based on the total number of questions in the situational layer (n = 23,234).

Table 9: The top five terms in each sub-category related to health behaviour or lifestyles
Habitual or addictive behaviour smoking 5,975 25.7%
drinking 2,486 10.7%
marijuana 765 3.3%
exercise 608 2.6%
nicotine 264 1.1%
Dietary eating 3,409 14.7%
fat 499 2.1%
vitamin 328 1.4%
fruit 281 1.2%
meat 260 1.1%
Socio-economic situations money 2,165 9.3%
health insurance 1,217 5.2%
job 758 3.3%
dollars 258 1.1%
bills 249 1.1%
Sexual behaviour, pregnancy and birth pregnancy 949 4.1%
sex 891 3.8%
birth 284 1.2%
virginity 237 1.0%
birth control 229 1.0%
Exposure sun 702 3.0%
vaccine 684 2.9%
tanning 531 2.3%
chemicals 458 2.0%
virus 444 1.9%

Three health-related activities, smoking, eating, and drinking, appeared in the greatest number of questions in this layer. Terms, including money, health insurance, and jobs, topped the list related to individuals' social and economic situations, illustrating askers' financial concerns. In terms of sexual behaviour, most of the top five terms were related to pregnancy and child birth. In terms of environmental exposure, the term vaccine mostly referred to human papilloma virus (cervical cancer) vaccines. Many debatable issues pertaining to getting such vaccines were posted. The terms sun and tanning were often associated with skin cancer.

Social layer

Thirty-seven terms designating users' social relationships were identified from 38,082 questions (48.1%). The social relationships primarily were family members and health care providers (see Table 10). The percentage in Table 10 was calculated based on the total number of questions in the social layer (n = 38,082).

Table 10: The top terms in each sub-category related to social relationships
Family members mother 8,169 21.5%
family 5,087 13.4%
father 4,230 11.1%
parents 1,625 4.3%
grandmother 1,637 4.3%
health care providers/services doctor 16,819 44.2%
hospital 3,439 9.0%
nurse 887 2.3%
oncologist 812 2.1%
specialist 734 1.9%
Acquaintance friend 4,885 12.8%
teacher 353 0.9%
Dating relationships boyfriend 967 2.5%
girlfriend 564 1.5%
partner 248 0.7%

In some cases, askers were concerned about family members' health conditions as care-givers. In others, family members were referred to in order to help explain the askers' own conditions. Askers mentioned health care providers, for example, to ask if they need to see doctors. In some cases, askers posted diagnoses from doctors to elicit second opinions.

Technical layer

Thirty-four terms related to this layer were identified from 35,754 questions (45.2%) (see Table 11). The percentage was calculated based on the total number of questions in the technical layer (n = 35,754)

Table 11: The top five terms related to information sources and social supports
Social supports help 16,212 45.3%
need 7,350 20.6%
question 5,933 16.6%
answer 5,692 15.9%
idea 2,857 8.0%
Information sources research 1,792 5.0%
web 933 2.6%
report 858 2.4%
internet 746 2.1%
news 520 1.4%

Terms in the social supports category mostly indicated users' information goals or their expectations about answers from peers in the community. Terms that topped the list were mostly general requests such as help, need, question, and answer. The most frequently occurring term related to information sources were research, web, internet, report, and news. Information sources were often provided as a piece of background information to set the stage for further questions.


An exploratory approach was taken in the current study to examine consumers' topics of interest and concerns about cancer by extracting and analysing terms that they used to express their needs in cancer questions. Findings demonstrated that, to seek personally-relevant cancer-related information on social question and answer sites, askers disclose multiple layers of personal information, including demographic, cognitive, affective, social, situational and technical information, to contextualise their requests. Our results confirmed earlier observations of consumers' health-related questions posted in online health communities (Weinberg and Schmale, 1996; Wilson, 1999; Zhang, 2010) in the sense that a variety of topics pertaining to health and other associated issues in life are discussed in questions.

Additionally, our analysis contributed to a comprehensive understanding of cancer information needs by revealing the most commonly appearing factors in each contextual layer. In the demographic layer, sex and age are the most frequently mentioned factors, indicating that askers believe that they could obtain more pertinent information by providing this information. This phenomenon is consistent with scientific research that both factors are important indicators of the risk of having certain types of cancers (Claus, Risch and Thompson, 1990; Harris, Zang, Anderson and Wynder, 1993).

The cognitive layer shows askers' representations of their current medical situations and their information needs. The diversity of the information needs of cancer patients and care givers confirmed results from prior studies (Rutten, Arora, Bakos, Aziz and Rowland , 2005; Rutten, Squiers and Treiman, 2006). Our study further revealed that askers often mentioned cancer together with cold, infection, human papilloma virus, sexually transmitted diseases, and diabetes, indicating that people may associate these conditions with a high risk of developing cancer or a higher probability of co-occurrence with a specific type of cancer. Many of these beliefs are consistent with scientific knowledge; it is well known that human papilloma virus and sexually transmitted diseases could be causes of cancers related to reproductive systems, such as cervical cancer and vulvar cancer (Centers for Disease Control and Prevention, 2015).

In the affective layer, feelings expressed in questions were mostly negative, represented by worry, freaking, and anxiety. We brought special attention to one term, paranoid. The term suggests the cyberchondria phenomenon, where people's health concerns are escalated irrationally over the course of information searching (White and Horvitz, 2009). We also found positive feelings, represented by love, trust, and fun. This may be explained by the fact that many of the questions were asked by care-givers and they tended to express their affection for their loved ones in questions.

The situational layer includes users' health behaviour and socio-economic situations. Health behaviour is one of the important topics in cancer information seeking (Shim, Kelly and Hornik, 2006), as major health behaviour and lifestyle changes can prevent certain types of cancer (Anand et al., 2008). Our study further revealed that the askers mentioned smoking, drinking, sex, drug use, vaccines, as well as sun exposure and tanning. Additionally, askers mentioned money and health insurance, indicating their struggle with or interest in financial-related issues, which has also been revealed in previous studies (Rutten et al., 2005).

In the social layer, terms represented family, friends, and acquaintances. They could be cancer patients, care-givers, or someone sharing an asker's family medical history. Their appearance vividly suggests that cancer is not a personal issue, rather, it affects individual's social ties, particularly families, in many different ways. Cancer patients search for information about medical systems, primarily looking for whether or not a physician has sufficient experiences or qualifications (Rutten et al., 2005). Our findings also indicate that askers post questions after consulting with their doctors, for various reasons; they may not have had enough time to fully discuss their concerns with doctors, or they may not trust their doctors (White and Horvitz, 2009).

The technical layer contains information sources and information-related social supports. Two terms web and Internet appeared frequently, indicating their status as major sources of cancer-related information. Earlier studies on questions in social question and answer environments indicated that askers requested a variety of information, including facts, explanations, advice, personal stories and emotional support (Westbrook, 2015; Zhang, 2013). The current study corroborated these findings by revealing that askers used terms, such as research, experience, idea, advice, and opinion to express their needs.

Our study has several implications. Theoretically, consumers' health information needs presented by natural language were systematically examined and framed within the contexts of health information behaviour. Findings, in one way, confirmed the layered model of contexts in health information searching by disclosing various topics that askers seek and share in questions. We adopted Zhang's model to guide our data analysis because it is specifically concerned about consumers' information needs in the context of health problems. Based on our empirical analysis, we revised the model, which specified the most common health topics in each layer in the case of cancer information seeking. It also highlighted the intense discussions about social and technical issues, in addition to the cognitive aspects of information seeking in health questions. Correspondingly we reorganized the layers in the original model and assign some of the layers with new definitions (see Figure 3).

Figure3: A comparison of Zhang's and the revised models of contexts in layers
Figure 3. A comparison of Zhang's and the revised models of contexts in layers

Specifically, the demographic layer focuses on the two most frequently discussed topics in cancer-related questions, sex and age. Zhang's model specifies three cognitive contextual factors: perceived topics of interest, types of information, and consumers' cognitive abilities to articulate their needs. The current model blended the three factors and emphasized the medical aspect of users' representations of their conditions (i.e., diseases, symptoms, treatments, and tests). Non-medical information associated with health behaviour and lifestyles are moved to the situational layer. The social and environmental layer was broken down into two layers: the social layer mainly focuses on the relationships with other people and the technical layer includes types and channels of information and social supports askers seek and share in questions. In the future, the revised model could be applied to analyse the context of health information needs discussed in other types of online communities or social media.

The major methodological implication of the study is that text mining, facilitated by appropriate theoretical lenses, could be an effective way to help understand information seeking behaviour at the population level. Several studies utilized the text mining method to examine users' information needs, but they mainly used cluster analysis to identify the most common topics appearing in online health communities (Chen, 2012; Kim et al., 2011). In our study, going beyond identifying the most common themes, we further analysed the extracted terms based on Zhang's (2013) layered model of context for health information searching. This model-based approach is fruitful. It allowed us to more systematically examine users' cancer-related questioning behaviour in the social question and answer context. This approach could be adopted to examine questions and answers regarding other topics in such contexts or other types of social media.

Several practical implications can be drawn from this study. First, the findings of this study could be useful for health care providers, especially physicians, to better understand their patients' concerns regarding cancer. They could learn about what kinds of symptoms cause their patients to believe that they may have cancer, what makes their patients hesitate to have tests or treatments, and what are the situational factors causing their patients to believe they may have cancer. Health care providers could develop materials for health promotion and education, for example, to deliver information that addresses consumers' concerns in questions. Second, the results highlighted the cyberchondria phenomenon related to health anxieties and the escalation of such anxieties in the social question and answerr environment. Prior studies found that cancer patients' stresses and anxiety were mitigated when they received personalised messages (Mayer et al., 2007). Social question and answer sites could be an ideal environment for seeking or receiving personalized answers due to its human computation nature; nevertheless, there is still room for developers to think about how to provide users with more personalised answers. For example, systems can provide certain metadata to describe answerers' age ranges or sex to help contextualise the answers. Our findings also could inform the design of general health information search systems. Specifically, an information search system could enable users to specify their cancer information inquiries according to various demographic, cognitive, affective, social, situational, and technical parameters to receive more tailored search results.


This study has a few limitations. First, the current study took a descriptive approach to analyse data and it may not reveal the latent and potential relationships among the topics in health questions. The current study was, however, useful to facilitate rich data gathering and analysis and could be used to develop follow-up studies, examining the distributions and applications of the topics in health information seeking and sharing. Secondly, the text mining technique that we used could be improved. For example, we could not infer the meaning of numbers and thus we were not able to determine whether a particular number refers to weight, height, age, or cancer stage. Also, unlike queries, which indicate specific subjects searched, our text mining techniques only allow us to identify what's being expressed, while lack the ability to differentiate what is being requested and what constitutes background information. Third, both the original and the augmented layered model of contexts were developed based on questions collected from Yahoo! Answers. Yahoo! Answers is one of the most frequently visited social question and answer sites, but may not represent users' questioning behaviour in all such sites.


The current study analysed a large number of cancer-related questions using the text mining method coupled with a manual review of a subset of questions. The analysis identified 420 terms distributed across six layers, including demographic, cognitive, affective, situational, social, and technical layers. These terms represent topics or issues that askers were concerned about the most and characterize people's question asking in social question and answer environments. The important findings include:

In future studies, we will further develop our text mining techniques by adopting semantic approaches with which to analyse the messages embedded in health questions, and possibly the answers as well, to examine the exchange of information and social support between askers and answerers.

About the author

Sanghee Oh is an Assistant Professor in the School of Information at Florida State University. She obtained her PhD in Information and Library Science from the University of North Carolina at Chapel Hill, and her Master of Library and Information Science from the University of California at Los Angeles. Her areas of research interest are health information behaviour, health informatics, social informatics, social media use, human-computer interaction, and digital libraries. She can be contacted at shoh@cci.fsu.edu.
Yan Zhang is an Associate Professor in the School of Information at the University of Texas at Austin. She received her PhD from the University of North Carolina at Chapel Hill. Her research focus is information behaviour with emphases on consumer health information search behaviour and consumer health information system design. She can be contacted at yanz@ischool.utexas.edu.
Min Sook Park is a doctoral candidate in School of Information at the Florida State University. She also received her Masters in Library and Information Science from the Florida State University. Her research interests lie at the intersection of information behaviour, data science, social informatics, and information organization. She can be contacted at mp11j@my.fsu.edu.

  • Adams, E., Boulton, M. & Watson, E. (2009). The information needs of partners and family members of cancer patients: a systematic literature review. Patient Education and Counseling, 77(2), 179-186.
  • American Cancer Society. (2015). Cancer facts & figures 2015. Retrieved from http://www.cancer.org/acs/groups/content/@editorial/documents/document/acspc-044552.pdf (Archived by WebCite® at http://www.webcitation.org/6jiZxziab)
  • Anand, P., Kunnumakara, A. B., Sundaram, C., Harikumar, K. B., Tharakan, S. T., Lai, O. S., ... & Aggarwal, B. B. (2008). Cancer is a preventable disease that requires major lifestyle changes. Pharmaceutical Research, 25(9), 2097-2116.
  • Belkin, N. J., Oddy, R. N. & Brooks, H. M. (1982). ASK for information retrieval: Part I. background and theory. Journal of Documentation, 38(2), 61-71.
  • Blank, T.O., Schmidt, S.D., Vangsness, S.A., Monteiro, A.K. & Santagata, P.V. (2010). Differences among breast and prostate cancer online support groups. Computers in Human Behavior, 26(6), 1400-1404.
  • Centers for Disease Control and Prevention. (2015). HPV and cancer. Retrieved from http://www.cdc.gov/hpv/cancer.html (Archived by WebCite® at http://www.webcitation.org/6jiZt8Vwa)
  • Chen, A.T. (2012). Exploring online support spaces: using cluster analysis to examine breast cancer, diabetes and fibromyalgia support groups.Patient Education and Counselling, 87(2), 250-257.
  • Claus, E. B., Risch, N. J. & Thompson, W. D. (1990). Age at onset as an indicator of familial risk of breast cancer. American Journal of Epidemiology, 131(6), 961-972.
  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46.
  • Eysenbach, G., Powell, J., Englesakis, M., Rizo, C. & Stern, A. (2004). Health related virtual communities and electronic support groups: systematic review of the effects of online peer to peer interactions British Medical Journal, 328(7449), 1166-1171.
  • Feldman, R. & Dagan, I. (1995). Knowledge discovery in textual databases (KDT). In Usama M. Fayyad & Ramasamy Uthurusamy (Eds.), Proceedings of the First International Conference on Knowledge Discovery and Data Mining(pp. 112-117). Palo Alto, CA: Association for the Advancement of Artificial Intelligence. Retrieved from http://www.aaai.org/Papers/KDD/1995/KDD95-012.pdf (Archived by WebCite® at http://www.webcitation.org/6jiaUyBKh)
  • Freimuth, V. S., Stein, J. A. & Kean, T. J. (1989). Searching for health information: the cancer information service model. Philadelphia, PA: University of Pennsylvania Press.
  • Ginossar, T. (2008). Online participation: a content analysis of differences in utilization of two online cancer communities by men and women, patients and family members.Health Communication, 23(1), 1-12.
  • Gooden, R.J. & Winefield, H.R. (2007). Breast and prostate cancer online discussion boards: a thematic analysis of gender differences and similarities.Journal of Health Psychology, 12(1), 103-114.
  • Graesser, A.C., McMahen, C.L. & Johnson, B.K. (1994). Question asking and answering. In M.A. Gernsbacher (Ed.), Handbook of psycholinguistics, San Diego, CA: Academic Press.
  • Harris, R.E., Zang, E.A., Anderson, J.I. & Wynder, E.L. (1993). Race and sex differences in lung cancer risk associated with cigarette smoking. International Journal of Epidemiology, 22,(4), 592-599.
  • Hesse, B. W., Arora, N. K., Burke Beckjord, E. & Rutten, L. J. (2008). Information support for cancer survivors. Cancer, 112(S11), 2529-2540.
  • Hotho, A., Nürnberger, A. & Paaß, G. (2005). A brief survey of text mining. In LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 20(1), 19-62. Retrieved from http://www.kde.cs.uni-kassel.de/hotho/pub/2005/hotho05TextMining.pdf (Archived by WebCite® at http://www.webcitation.org/6jia5nMzS)
  • Hu, X. & Liu, H. (2012). Text analytics in social media. In C.C. Aggarwal & C. Zhai (Eds.), Mining Text Data (pp. 385-414). New York, NY: Springer.
  • Kim, S., Pinkerton, T. & Ganesh, N. (2011). Assessment of H1N1 questions and answers posted on the Web. American Journal of Infection Control, 40(3), 211-217.
  • Klemm, P., Reppert, K. & Visich, L. (1998). A nontraditional cancer support group: the Internet. Computer in Nursing, 16(1), 31-26.
  • Johnson, J. D. (1997). Cancer-related information seeking. Chicago, IL: Hampton Press.
  • Landis, J.R. & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
  • Lenz, E. R. (1984). Information seeking: a component of client decisions and health behavior. Advances in Nursing Science, 6(3), 59-72.
  • Mayer, D. K., Terrin, N. C., Kreps, G. L., Menon, U., McCance, K., Parsons, S. K. & Mooney, K. H. (2007). Cancer survivors information seeking behaviors: a comparison of survivors who do and do not seek information about cancer. Patient Education and Counseling, 65(3), 342-350.
  • Nambisan, P. (2011). Information seeking and social support in online health communities: impact on patients' perceived empathy. Journal of the American Medical Association, 18(3), 298-304.
  • Oh, J. S., He, D., Jeng, W., Mattern, E. & Bowler, L. (2013). Linguistic characteristics of eating disorder questions on Yahoo! Answers–content, style, and emotion. Proceedings of the American Society for Information Science and Technology, 50(1), 1-10.
  • Quantcast (2016). Answers.Yahoo.com. Retrieved June 7, 2016 from:https://www.quantcast.com/answers.yahoo.com (Archived by WebCite® at http://www.webcitation.org/6jiaAXHL5
  • Rozmovits, L. & Ziebland, S. (2004). What do patients with prostate or breast cancer want from an internet site? A qualitative study of information needs. Patient Education and Counselling, 53, 57-64.
  • Rubenstein, E. L. (2015). They are always there for me: the convergence of social support and information in an online breast cancer community. Journal of the Association for Information Science and Technology, 66(7), 1418-1430.
  • Rutten, L.J.F., Arora, N.K., Bakos, A.D., Aziz, N. & Rowland, J. (2005).Information needs and sources of information among cancer patients: a systematic review of research(1980-2003). Patient Education and Counselling, 57(3), 250-261.
  • Rutten, L.J.F., Squiers, L. & Treiman, K. (2006). Requests for information by family and friends of cancer patients calling the National Cancer Institute's Cancer Information Service. PsychoOncology, 15(8), 664-672.
  • Saracevic, T. (1997). The stratified model of information retrieval interaction: extension and applications. Proceedings of the Annual Meeting-American Society for Information Science, 34, 313-327.
  • Shim, M., Kelly, B. & Hornik, R. (2006). Cancer information scanning and seeking behavior is associated with knowledge, lifestyle choices, and screening. Journal of Health Communication, 11(S1), 157-172.
  • Slaughter L.A, Soergel, D. & Rindflesch T.C. (2006). Semantic representation of consumer questions and physician answers. International Journal of Medical Informatics, 75(7), 513-529.
  • Smith, C. A., Stavri, P. Z. & Chapman, W. W. (2002). In their own words? a terminological analysis of e-mail to a cancer information service. In Proceedings of American Medical Informatics Association (AMIA) Symposium (pp. 697-701). Bethesda, MA: American Medical Informatics Association
  • U.S. National Library of Medicine. (2015). MedlinePlus: health topics. Retrieved from http://www.nlm.nih.gov/medlineplus/healthtopics.html (Archived by WebCite® at http://www.webcitation.org/6jiaMYFMA)
  • van Uden-Kraan, C. F., Drossaert, C. H. C., Taal, E., Shaw, B. R., Seydel, E. R. & van de Laar, M. A. F. J. (2008). Empowering processes and outcomes of participation in online support groups for patients with breast cancer, arthritis, or fibromyalgia. Qualitative Health Research, 18(3), 405-417.
  • Weinberg, N. & Schmale, J. (1996). Online help: Cancer patients participate in a computer-mediated support group. Health & Social Work, 21(1), 24-29.
  • Westbrook, L. (2015). Intimate partner violence online: Expectations and agency in question and answer websites. Journal of the Association for Information Science and Technology, 66(3), 599-615.
  • White, M.D. (2000).Questioning behavior on a consumer health electronic list. Library Quarterly, 70(3), 302-334. doi: 10.1086/603195
  • White, R.W. & Horvitz, E. (2009). Cyberchondria: studies of the escalation of medical concerns in Web search. ACM Transactions on Information Systems, 27(4), Article No. 23.
  • Wildemuth, B. M., de Bliek, R., Friedman, C. P. & Miya, T. S. (1994). Information-seeking behaviors of medical students: a classification of questions asked of librarians and physicians. Bulletin of the Medical Library Association, 82(3), 295-304.
  • Wilson, T. D. (1999). Models in information behaviour research. Journal of Documentation, 55(3), 249-270.
  • Yetisgen-Yildiz, M. & Pratt, W. (2006). Using statistical and knowledge-based approaches for literature-based discovery. Journal of Biomedical Informatics, 39(6), 600-611.
  • Zhang, Y. (2010). Contextualizing consumer health information searching: an analysis of questions in a social Q&A community. Proceedings of the ACM International Health Informatics Symposium (pp. 210-219). New York, NY: ACM.
  • Zhang, Y. (2013). Toward a layered model of context for health information searching: an analysis of consumer‐generated questions. Journal of the American Society for Information Science and Technology, 64(6), 1158-1172.
How to cite this paper

Oh, S., Zhang, Y. & Park, M. S. (2016). Cancer information seeking in social question and answer services: identifying health-related topics in cancer questions on Yahoo! Answers. Information Research, 21(3), paper 718. Retrieved from http://InformationR.net/ir/21-3/paper713.html (Archived by WebCite® at http://www.webcitation.org/6kRg52QGP)

Check for citations, using Google Scholar