RegEx and News APIs

It's been a very long time since my last blog post, although I've been updating this one /2017/10/12/kibana-dashboard-for-amazon-food-reviews/, as recently as yesterday under the "Troubleshooting" section.

Lately I've been working on a variety of models ranging from LSTMs for timeseries forecasting on disease data (screenshot of batch of 5 weeks as inputs), to text classification.

But what I thought was deserving of a short post was one of something not as sexy but important and powerful, which is regular expressions.

Apply regex to news feeds
I found a service called Webhose.io that offers a free 1000-monthly-requests API limit.

The API Playground is easy to use and comes with pre-made queries which I just hacked to make work for me, which is to retrieve news articles on antimicrobial resistant bacteria. So I changed peace to superbug and 1001 to 10`.

import requests    
url='http://webhose.io/filterWebContent?token=yourtokenhere&format=json&ts=1524849790259&sort=crawled&q=superbug%20social.facebook.likes%3A%3E10'
response = requests.get(url)
webhose= response.json()

Here's a sample of some of the text queried exactly at May 22, 2018 3:00 PM EST.

{u'moreResultsAvailable': 0,
u'next': u'/filterWebContent?token=ebcab9ce-f298-4d36-8578-b2a63df35c8f&format=json&ts=1527010168003&q=superbug+social.facebook.likes%3A%3E10&sort=crawled',
u'posts': [{u'author': u'The Conversation',
u'crawled': u'2018-04-28T03:11:06.041+03:00',
u'entities': {u'locations': [{u'name': u'gaza', u'sentiment': u'none'},
{u'name': u'egypt', u'sentiment': u'none'},
{u'name': u'israel', u'sentiment': u'none'}],
u'organizations': [],
u'persons': [{u'name': u'alexander the great', u'sentiment': u'none'}]},
u'external_links': [u'https://www.researchgate.net/publication/321692502_Urban_Warfare_Ecology_A_Study_of_Water_Supply_in_Basrah',
u'https://thenextweb.com/syndication/2017/07/05/desalination-nation-israel-helping-world-fight-water-shortage/',
u'http://xmlns.com/foaf/0.1/',
u'http://s.opencalais.com/1/type/cat/',
u'http://rdfs.org/sioc/types#',
u'https://msf-analysis.org/conflict-medicine-manifesto/',
u'https://duckduckgo.com/?q=PNIPH+A+Systematic+Literature+Review+and+Recommendations+on+Water+Usage+in+the+Gaza+Strip&t=ffsb&ia=web',
u'http://s.opencalais.com/1/type/sys/',
u'https://imed.pub/ojs/index.php/IAJAA/article/view/2135',
u'https://www.sup.org/books/title/?id=22595',
u'http://s.opencalais.com/1/pred/',
u'http://purl.org/rss/1.0/modules/content/',
u'http://s.opencalais.com/1/type/er/',
u'https://www.hurstpublishers.com/first-world-war-gaza-battle-palestine/',
u'http://s.opencalais.com/1/type/em/r/',
u'http://s.opencalais.com/1/type/lid/',
u'http://s.opencalais.com/1/type/em/e/',
u'http://s.opencalais.com/1/type/er/Geo/',
u'https://theconversation.com/to-defeat-superbugs-everyone-will-need-access-to-clean-water-95202',
u'https://www.counterpunch.org/2017/04/28/middle-eastern-surgeon-speaks-about-ecology-of-war/',
u'http://rdfs.org/sioc/ns#',
u'http://purl.org/dc/terms/',
u'http://s.opencalais.com/1/linkeddata/pred/'],
u'highlightText': u'',
u'highlightTitle': u'',
u'language': u'english',
u'ord_in_thread': 0,
u'published': u'2018-04-28T02:38:00.000+03:00',
u'rating': None,
u'text': u"No One Can Escape the Toxic 'Biosphere of War' in Gaza The environmental effects of war can be felt for years. Comments\nGaza has often been invaded for its water. Every army leaving or entering the Sinai desert, whether Babylonians, Alexander the Great, the Ottomans, or the British , has sought relief there. But today the water of Gaza highlights a toxic situation that is spiralling out of control.\nA combination of repeated Israeli attacks and the sealing of its borders by Israel and Egypt, have left the territory unable to process its water or waste. Every drop of water swallowed in Gaza, like every toilet flushed or antibiotic imbibed, returns to the environment in a degraded state.\nWhen a hospital toilet is flushed, for instance, it seeps untreated through the sand into the aquifer. There it joins water laced with pesticides from farms, heavy metals from industry, and salt from the ocean. It is then pumped back up by municipal or private wells, joined with a small fraction of freshwater purchased from Israel, and cycled back into people\u2019s taps. This results in widespread contamination and undrinkable drinking water , about 90% of which exceeds the World Health Organisation (WHO) guidelines for salinity and chloride. AlterNet\nIncredibly, conditions are getting worse, thanks to the emergence of \u201csuperbugs\u201d. These multi-drug resistant organisms have developed thanks to an over-prescription of antibiotics by doctors desperate to treat the victims of the seemingly endless assaults. The more injury there is, the more chance there is of re-injury. Less regular access to clean water means infections will spread faster , bugs will be stronger, more antibiotics will be prescribed \u2013 and the victims will be ever-more weakened.\nThe result is what has been termed a toxic ecology or \u201c biosphere of war \u201d, of which the noxious water cycle is just one part. A biosphere refers to the interaction of all living things with the natural resources that sustain them. The point is that sanctions, blockades and a permanent state of war affects everything that humans might require in order to thrive, as water becomes contaminated, air is polluted, soil loses its fertility and livestock succumb to diseases . People in Gaza who may have evaded bombs or sniper fire have no escape from the biosphere.\nWar surgeons, health anthropologists and water engineers \u2013 including ourselves \u2013 have observed this situation developing wherever protracted armed conflict or economic sanctions grind on, as with water systems in Basrah and health systems throughout Iraq or Syria . It\u2019s now well past time to clean it up.\nThere is water \u2013 for some\nIt\u2019s not as if there is no fresh water nearby to alleviate the situation in Gaza. Just a few hundred metres from the border are Israeli farms that use freshwater pumped from Lake Tiberias (the Sea of Galilee) to grow herbs destined for European supermarkets. As the lake is around 200km to the north and lies 200 metres below sea level, a massive amount of energy is used to pump all that water. The lake water is also fiercely contested by Lebanon, Jordan, Syria and Palestinians in the West Bank, each of which is seeking their legal entitlement of the Jordan River basin .\nMeanwhile, Israel desalinates so much seawater these days that its municipalities are turning it down. Excess desalinated water is being used to irrigate crops, and the country\u2019s water authority is even planning to use it to refill Tiberias itself \u2013 a bizarre and irrational cycle, considering the lake water continues to be pumped the other direction into the desert. There is now so much manufactured water that some Israeli engineers can declare that \u201ctoday, no one in Israel experiences water scarcity\u201d.\nBut the same cannot be said for Palestinians, especially not those in Gaza. People there have resorted to various ingenious filters, boilers, or under-the-sink or neighbourhood-level desalination units to treat their water. But these sources are unregulated, often full of germs, and just another reason children are prescribed antibiotics \u2013 thus continuing the pattern of injury and re-injury. Doctors, nurses, and water maintenance crews meanwhile try to do the impossible with the minimal medical equipment at their disposal.\nThe implications for all those who invest in Gaza\u2019s repeatedly destroyed water and health projects are clear. Providing more ambulances or water tankers \u2013 the \u201ctruck and chuck\u201d strategy \u2013 might work when conflicts are at their most acute, but they are never more than a band aid. Yes, things will get better in the short term, but soon enough Gaza will be onto the next generation of antibiotics, and dealing with teflon-coated superbugs.\nDonors must instead design programmes suited to the all-pervasive and incessant biosphere of war. This means training many more doctors and nurses, providing more medicines, and infrastructure support for health and water services. More importantly, donors should build-in political \u201ccover\u201d to protect their investments (if not the local children), perhaps by calling for those who destroy the infrastructure to foot the bill for repairs.\nAnd there is an even bigger message for the rest of us. Our research shows that war is more than simply armies and geopolitics \u2013 it extends across entire ecosystems. If the dehumanising ideology behind the conflict was confronted, and if excess water was diverted to people rather than to lakes, then the easily avoidable repeated injuries suffered by people

Regex101
Now we'll apply some regex rules. I'm sure everyone has seen this site at one point or another, but it's one of the best out there and also people to use it to share their "lessons" with each other.

regex1 = r"hosp[a-zA-Z]+"         #hospital
regex2 = r"2\d+-\d+-"             #2018-04-
regex3 = r"countr+\w+"             #country:
#before and after hospital up to the period
regex4 = r"([^.]*?hosp[^.]*\.)"    

In the screenshot below, I'm searching for sentences that contain variations on the word 'hospital'.

Now we send our json to a text file:

with open('/Users/catherineordun/Documents/Data/webhose.txt', 'r') as myfile:
   data=myfile.read().replace('\n', '')

Reading it back in looks like this:

'{"moreResultsAvailable": 0, "totalResults": 47, "posts": [{"entities": {"persons": [{"name": "alexander the great", "sentiment": "none"}], "locations": [{"name": "gaza", "sentiment": "none"}, {"name": "egypt", "sentiment": "none"}, {"name": "israel", "sentiment": "none"}], "organizations": []}, "rating": null, "uuid": "804231520b8db9b0310d67e0045f2c63b5c9985d", "thread": {"social": {"gplus": {"shares": 0}, "pinterest": {"shares": 1}, "vk": {"shares": 0}, "linkedin": {"shares": 0}, "facebook": {"likes": 18, "comments": 0, "shares": 18}, "stumbledupon": {"shares": 0}}, "site_full": "www.alternet.org", "main_image": "https://www.alternet.org/sites/default/files/story_images/screen_shot_2018-04-27_at_7.43.23_pm.png", "site_section": "http://feeds.feedblitz.com/~/15426773/1ja5da/alternet~Sharron-Angle-Rape-Incest-Part-of-Gods-Plan-Opposes-Abortion-No-Matter-What", "section_title": "AlterNet.org Main RSS Feed", "url": "http://omgili.com/ri/.wHSUbtEfZR.pw9.fMbs5AVG2lziPjh1vlzG1IfB3F_v_199id2tOGI29D35hm6jvz_lTFB9VLsFK.sFevQEWEib4211EEE2wtGmuH3_RWyp8duGgv2j9w--", "country": "US", "domain_rank": 5130, "title": "No One Can Escape the Toxic 'Biosphere of War' in Gaza", "performance_score": 0, "site": "alternet.org", "site_categories": ["media"], "participants_count": 1, "title_full": "No One Can Escape the Toxic 'Biosphere of War' in Gaza", "spam_score": 0.0, "site_type": "blogs", "published": "2018-04-28T02:38:00.000+03:00", "replies_count": 0, "uuid": "804231520b8db9b0310d67e0045f2c63b5c9985d"}, "author": "The Conversation", "url": "http://omgili.com/ri/.wHSUbtEfZR.pw9.fMbs5AVG2lziPjh1vlzG1IfB3F_v_199id2tOGI29D35hm6jvz_lTFB9VLsFK.sFevQEWEib4211EEE2wtGmuH3_RWyp8duGgv2j9w--", "ord_in_thread": 0, "title": "No One Can Escape the Toxic 'Biosphere of War' in Gaza", "highlightText": "", "language": "english", "text": "No One Can Escape the Toxic 'Biosphere of War' in Gaza The environmental effects of war can be felt for years. Comments\nGaza has often been invaded for its water. Every army leaving or entering the Sinai desert, whether Babylonians, Alexander the Great, the Ottomans, or the British , has sought relief there. But today the water of Gaza highlights a toxic situation that is spiralling out of control.\nA combination of repeated Israeli attacks and the sealing of its borders by Israel and Egypt, have left the territory unable to process its water or waste. Every drop of water swallowed in Gaza, like every toilet flushed or antibiotic imbibed, returns to the environment in a degraded state.\nWhen a hospital toilet is flushed, for instance, it seeps untreated through the sand into the aquifer. There it joins water laced with pesticides from farms, heavy metals from industry, and salt from the ocean. It is then pumped back up by municipal or private wells, joined with a small fraction of freshwater purchased from Israel, and cycled back into people\u2019s taps. This results in widespread contamination and undrinkable drinking water , about 90% of which exceeds the World Health Organisation (WHO) guidelines for salinity and chloride.

Use beautifulsoup to get rid of ascii and convert into a string.

from bs4 import BeautifulSoup #try to get rid of ascii
cln = BeautifulSoup(data, "lxml").get_text()
cln_string = cln.encode('ascii', 'ignore')

Join together the first three regex's:

import re
def matchreturn(text):
   matches = re.findall("|".join([regex1, 
                           regex2, 
                           regex3]), text, flags=re.IGNORECASE)
   matchlist = []
   for match in matches:
     matchlist.append(match)
   return matchlist

matches = matchreturn(cln_string)

Here's a sample of the list:

['2018-04-',
'country',
'2018-04-',
'hospital',
'country',
'2018-04-',
'2018-04-',
'country',
'2018-04-',
'hospitalized',
'countries',
'2018-04-',
'2018-04-',

Apply the fourth regex that looked for our sentences containing variations on the word 'hospital'.

 matches2 = re.findall(regex4, cln_string, flags=re.IGNORECASE)

Sample of matches2 output:

['\nWhen a hospital toilet is flushed, for instance, it seeps untreated through the sand into the aquifer.',
' As of Friday, 98 people have been infected in 22 states; almost half of them have been hospitalized, and 10 have developed kidney failure.',
' Melissa Whiteley, an 18-year-old engineering student from Hanford in Stoke-on-Trent, fell ill at Christmas and died in hospital a month later.',
' REX 12/50 Malnutrition deaths in hospitals in England and Wales at highest level for a decade\nThe number of people dying in hospital as a result of malnutrition has hit its highest level for a decade, figures from the Office for National Statistics (ONS) show.',
' Getty Images 27/50 Ketamine helps patients with severe depression \u2018when nothing else works\u2019 doctors say\nKetamine helps patients with severe depression \u2018when nothing else works\u2019 doctors say 28/50 Playing Tetris in hospital after a traumatic incident could prevent PTSD\nScientists conducted the research on 71 car crash victims as they were waiting for treatment at one hospital\u2019s accident and emergency department.',
'\nUnderstanding just how bacteria do this can help in the design of new and better antibiotics, Dantas said, and can help clean up some of the problems that help lead to antibiotic resistance in the first place, such as spills from factories that make the drugs; waste from farms where animals are fed antibiotics to make them grow; and hospital sewage.',
' They can thrive and spread in hospitals and in the community.',
' Related Superbugs lurk in hospital plumbing\nAnd understanding the mechanisms can help drug designers stay a step ahead of bacteria that constantly mutate and evolve new ways to resist the effects of antibiotics.',
'\n\"One of the big problems of antibiotic use, whether it be in hospitals or in agriculture or aquaculture, is that once you use the antibiotic, they stick around.',
'\n\"Bacteria have their own relationship with antibiotics, completely independent of how we use them in hospitals,\" she said.',
' So now we find it in in many of the most important pathogens we face in hospitals.',
'kHjI01nzvpK2EDgR7c7yjYA--", "country": "US", "domain_rank": 2350, "title": "How smaller hospitals can effectively reduce antibiotic overuse", "performance_score": 0, "site": "sciencedaily.',
'com", "site_categories": ["non_standard_content", "adult"], "participants_count": 0, "title_full": "How smaller hospitals can effectively reduce antibiotic overuse", "spam_score": 0.',
'kHjI01nzvpK2EDgR7c7yjYA--", "ord_in_thread": 0, "title": "How smaller hospitals can effectively reduce antibiotic overuse", "highlightText": "", "language": "english", "text": "Follow all of ScienceDaily's latest research news and top science headlines ! Science News How smaller hospitals can effectively reduce antibiotic overuse Date: Intermountain Medical Center Summary: Researchers completed a study identifying how community hospitals with fewer than 200 beds can develop antibiotic stewardship programs that work to prevent the growth of superbugs.',
' Share: FULL STORY Researchers at Intermountain Healthcare and University of Utah Health in Salt Lake City have completed a study identifying how community hospitals with fewer than 200 beds can develop antibiotic stewardship programs that work to prevent the growth of antibiotic-resistant organisms, or \"superbugs,\" which are becoming more common and deadly.',
' advertisement\nFor the 15 month-study, researchers compared the impact of three types of antibiotic stewardship programs in 15 small hospitals within the Intermountain Healthcare system.',
' They found the most effective program used infectious disease physicians and pharmacists at a central hospital working with local pharmacists to reduce broad-spectrum antibiotic use by nearly 25 percent and total antibiotic use by 11 percent.',
'\nAll hospitals, no matter how large or small, need antibiotic stewardship programs to help physicians use antibiotics optimally and prevent the growth of antibiotic-resistant organisms.',
" Until now, it's been unclear how small community and rural hospitals could build such programs to effectively reduce antibiotic use.",
' hospitals regardless of their size.',
' Antibiotics are also responsible for many side effects in patients in the hospital, including Clostridium difficile , or C diff .',
'\nHospitals across the country are required by The Joint Commission to implement antibiotic stewardship programs to improve antibiotic prescribing in hospitals, since experts estimate 30 to 50 percent of prescribed antibiotics could be used more effectively -- or are unnecessary.',
'\n\"The challenge has been knowing how these programs can be implemented in small hospitals, where, historically, they've been absent, even though antibiotic use rates in small hospitals are very similar to large hospitals, where the programs are typically found,\" he added.',
'\nWhile many smaller hospitals have lacked the resources to build a formal antibiotic stewardship program, researchers determined that using a centralized infectious disease support program decreased overall antibiotic use and the overuse of most broad-spectrum drugs, which are used to target a wide range of bacteria that cause diseases.',
'\nPrior to the study, each of the participating hospitals lacked antibiotic stewardship programs.', ...
' Each hospital was randomly assigned to one of three types of programs to determine which was most effective in reducing broad-spectrum antibiotic use:\nProgram 1: Implemented basic education to physicians and staff on antibiotic stewardship programs Provided a 24/7 infectious disease hotline staffed by infectious disease specialists\nProgram 2: Provided more advanced antibiotic stewardship education Provided a 24/7 infectious disease hotline staffed by infectious disease specialists Implemented a pharmacy-based initiative in which local pharmacists reviewed use of broad-spectrum antibiotics and provided recommendations for improvement to prescribers Certain broad-spectrum antibiotics were restricted and only local pharmacy staff could approve their use\nProgram 3: Provided more advanced antibiotic stewardship education Provided a 24/7 infectious disease hotline staffed by infectious disease specialists Implemented a pharmacy-based initiative in which local pharmacists reviewed most antibiotic prescriptions and provided recommendations for improvement to prescribers Certain broad-spectrum antibiotics were restricted and only centralized infectious diseases pharmacists could approve their use Infectious disease specialists reviewed selected microbiology results and spoke with local providers about recommendations for treatment\n\"For the first time, all of the participating hospitals had access to infectious diseases physicians via a hotline,\" said Dr.']

Generate a word cloud from the corpus of text about hospitalizations dealing with superbugs:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,3))

#make a corpus that the vectorizer can read as a list
corpus = matches2
matrix = vectorizer.fit_transform(corpus)

Return the IDF values:

idf = vectorizer.idf_

terms = dict(zip(vectorizer.get_feature_names(), idf))

Time for the word cloud, drum roll... https://github.com/amueller/word_cloud

from wordcloud import WordCloud

# Initialize the word cloud

wc = WordCloud(
  background_color="white",
  max_words=2000,
  width = 1024,
  height = 720,
  stopwords=stopwords.words("english")
)

# Generate the cloud

wc.generate_from_frequencies(terms)
wc.to_file("word_cloud_hospitals.png")