NLP and the CIA World Factbook - Earthling Interactive

Each entry in the CIA World Factbook has a multitude of factoids as well as a continent (or sub-continent) name and description. It was the availability of a description that interested me, to use as the corpus in a Natural Language Processing NLP analysis to determine if, using only the description, a K-Means clustering algorithm would put the countries into their geographic region*.

There are 242 countries and territories covered by the CIA World Factbook. They are broken down into ten geographic regions (africa, australia-oceania, central-america-n-caribbean, central-asia, east-n-southeast-asia, europe, middle-east, north-america, south-america, south-asia). The countries are not evenly distributed by region, with africa having the most (55) and central-asia having the least (6).

Each country has a file in json format that was obtained from the Github repo. I wrote the script below in Python to parse the relevant fields from the json files and create a csv file for input to the NLP analysis.

# package imports

import json
import os
import fnmatch
import csv

# set constant variable values
rootPath = r'/home/pitt/nlp_analysis'
pattern = '*.json'

#ingest the country codes and names file to eliminate non-country values
countrycode = []
with open('/home/pitt/nlp_analysis/countrylist.txt') as countrylist:
    for line in countrylist:
        countrycode.append(line.split(",")[0])


# create the header row for the list
data = [['countrycode','continent','countryname','description']]

# traverse the directory structure and create rows for data file
for root, dirs, files in os.walk(rootPath):
    for filename in fnmatch.filter(files, pattern):
        completefilename = os.path.join(root, filename)
        filenameroot = filename.split(".")[0]
        if filenameroot in countrycode:
            with open(completefilename) as factbookfile:
                jsonfile = json.load(factbookfile)
                try:
                    countrylongname = jsonfile['Government'] 
                                              ['Country name'] 
                                              ['conventional long form'] 
                                              ['text'].encode('utf-8')
                except:
                    countrylongname = 'none'
                countryshortname = jsonfile['Government'] 
                                           ['Country name'] 
                                           ['conventional short form'] 
                                           ['text'].encode('utf-8')
                if countryshortname == 'none':
                    countryname = countrylongname
                else:
                    countryname = countryshortname
                descr = jsonfile['Introduction'] 
                                ['Background'] 
                                ['text'].encode('utf-8')
                countryrow = []
                countryrow.append(filenameroot)
                countryrow.append(root.split('/')[-1])
                countryrow.append(countryname)
                countryrow.append(descr)
            data.append(countryrow)
        else:
            pass

# write out the formatted file with the header and data rows
with open('countries.csv', 'w') as csvfile:
    datawriter = csv.writer(csvfile)
    for row in data:
        datawriter.writerow(row)

The resulting csv file has four columns (countrycode,continent,countryname,description). In this case the continent is the region assigned to it by the CIA and not the classical definition of one of the six populated continents. I originally did the analysis in a Jupyter notebook, which is reproduced below as separate steps. For the sake of clarity I am printing out the full description and tokenized values from my father’s homeland, Ireland.

# Package imports

import string
import collections

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer


# Read data in from local file

from csv import reader

with open('/home/pitt/nlp_analysis/countries.csv', 'rb') as csvFile:
    next(csvFile)
    csvReader = reader(csvFile, delimiter=',', quotechar='"')
    data = list(csvReader)


# Needed to download the NLTK libraries the first time
# nltk.download()

# Set constants

PUNCTUATION_NUMBERS = set(string.punctuation + '0123456789')
STOPWORDS = set(stopwords.words('english'))
STEMMER = PorterStemmer()


# Function to tokenize the country description

def tokenize(text):
    tokens = word_tokenize(text)
    lowercased = [t.lower() for t in tokens]
    no_punctuation = []
    for word in lowercased:
        punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION_NUMBERS])
        no_punctuation.append(punct_removed)
    no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
    stemmed = [STEMMER.stem(w) for w in no_stopwords]
    return [w for w in stemmed if w]


# Get the number of records in the data file

print 'Number of rows in the countries.csv file is: %s' % len(data)

Number of rows in the countries.csv file is: 242. Normally this would be used to guide the initial number of clusters.

# Extract the country description from the data file

full_sentences = []

for i in range(len(data)):
        full_sentences.append(data[i][3])

Here is the full description from Ireland:

Celtic tribes arrived on the island between 600 and 150 B.C. Invasions by Norsemen that began in the late 8th century were finally ended when King Brian BORU defeated the Danes in 1014. Norman invasions began in the 12th century and set off more than seven centuries of Anglo-Irish struggle marked by fierce rebellions and harsh repressions. The Irish famine of the mid-19th century saw the population of the island drop by one third through starvation and emigration. For more than a century after that the population of the island continued to fall only to begin growing again in the 1960s. Over the last 50 years, Ireland’s high birthrate has made it demographically one of the youngest populations in the EU. The modern Irish state traces its origins to the failed 1916 Easter Monday Uprising that touched off several years of guerrilla warfare resulting in independence from the UK in 1921 for 26 southern counties; six northern (Ulster) counties remained part of the UK. Unresolved issues in Northern Ireland erupted into years of violence known as the “Troubles” that began in the 1960s. The Government of Ireland was part of a process along with the UK and US Governments that helped broker what is known as The Good Friday Agreement in Northern Ireland in 1998. This initiated a new phase of cooperation between the Irish and British Governments. Ireland was neutral in World War II and continues its policy of military neutrality. Ireland joined the European Community in 1973 and the euro-zone currency union in 1999. The economic boom years of the Celtic Tiger (1995-2007) saw rapid economic growth, which came to an abrupt end in 2008 with the meltdown of the Irish banking system. Today the economy is recovering, fueled by large and growing foreign direct investment, especially from US multi-nationals.

And here are the tokenized (punctuation removed, lower-cased, and lemmaticized) words from this description:

‘celtic’, ‘tribe’, ‘arriv’, ‘island’, ‘bc’, ‘invas’, ‘norsemen’, ‘began’, ‘late’, ‘th’, ‘centuri’, ‘final’, ‘end’, ‘king’, ‘brian’, ‘bor’, ‘defeat’, ‘dane’, ‘norman’, ‘invas’, ‘began’, ‘th’, ‘centuri’, ‘set’, ‘seven’, ‘centuri’, ‘angloirish’, ‘struggl’, ‘mark’, ‘fierc’, ‘rebellion’, ‘harsh’, ‘repress’, ‘irish’, ‘famin’, ‘midth’, ‘centuri’, ‘saw’, ‘popul’, ‘island’, ‘drop’, ‘one’, ‘third’, ‘starvat’, ’emigr’, ‘centuri’, ‘popul’, ‘island’, ‘contin’, ‘fall’, ‘begin’, ‘grow’, ‘last’, ‘year’, ‘ireland’, ‘high’, ‘birthrat’, ‘made’, ‘demograph’, ‘one’, ‘youngest’, ‘popul’, ‘e’, ‘modern’, ‘irish’, ‘state’, ‘trace’, ‘origin’, ‘fail’, ‘easter’, ‘monday’, ‘upris’, ‘touch’, ‘sever’, ‘year’, ‘guerrilla’, ‘warfar’, ‘result’, ‘independ’, ‘uk’, ‘southern’, ‘counti’, ‘six’, ‘northern’, ‘ulster’, ‘counti’, ‘remain’, ‘part’, ‘uk’, ‘unresolv’, ‘iss’, ‘northern’, ‘ireland’, ‘erupt’, ‘year’, ‘violenc’, ‘known’, ‘troubl’, ‘began’, ‘govern’, ‘ireland’, ‘part’, ‘process’, ‘along’, ‘uk’, ‘us’, ‘govern’, ‘help’, ‘broker’, ‘known’, ‘good’, ‘friday’, ‘agreement’, ‘northern’, ‘ireland’, ‘initi’, ‘new’, ‘phase’, ‘cooper’, ‘irish’, ‘british’, ‘govern’, ‘ireland’, ‘neutral’, ‘world’, ‘war’, ‘ii’, ‘contin’, ‘polici’, ‘militari’, ‘neutral’, ‘ireland’, ‘join’, ‘european’, ‘commun’, ‘eurozon’, ‘currenc’, ‘union’, ‘econom’, ‘boom’, ‘year’, ‘celtic’, ‘tiger’, ‘saw’, ‘rapid’, ‘econom’, ‘growth’, ‘came’, ‘abrupt’, ‘end’, ‘meltdown’, ‘irish’, ‘bank’, ‘system’, ‘today’, ‘economi’, ‘recov’, ‘fuel’, ‘larg’, ‘grow’, ‘foreign’, ‘direct’, ‘invest’, ‘especi’, ‘us’, ‘multin’

# Create a dict with the description and the country continent : country name

sentence_lookup = {}

for i in range(len(data)):
        continent_country = data[i][1] + ' : ' + data[i][2]
        sentence_lookup[data[i][3]] = continent_country


# Function to create clusters using K-Means from the country descriptions

def cluster_sentences(sentences, nb_of_clusters=7):
    tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, 
                                       stop_words=stopwords.words('english'), 
                                       max_df=0.9, 
                                       min_df=0.1, 
                                       lowercase=True)

    #builds a tf-idf matrix for the sentences
    tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)
    kmeans = KMeans(n_clusters=nb_of_clusters)
    kmeans.fit(tfidf_matrix)
    clusters = collections.defaultdict(list)

    for i, label in enumerate(kmeans.labels_):
        clusters[label].append(i)
    return dict(clusters)


# Calculate and print out the clusters and associated country information

nclusters= 10
clusters = cluster_sentences(full_sentences, nclusters)

for cluster in range(nclusters):
    print "Cluster ",cluster,":"
    for i,sentence in enumerate(clusters[cluster]):
        print "tCountry ",i,": ", sentence_lookup.get([full_sentences[sentence]][0])

Here is a list of the clusters and their associated countries, listed as region : countryname. Note that I used a cluster size of 10, which might seem small given the number of countries in the analysis (242), but I wanted to match the number of regions the countries were segmented into geographically. Also note that the order of the countries in the cluster is not relevant.

Cluster  0 :
   Country 0 :  south-america : Guyana
   Country 1 :  south-america : Peru
   Country 2 :  south-america : Bolivia
   Country 3 :  europe : Belarus
   Country 4 :  central-asia : Kyrgyzstan
   Country 5 :  central-asia : Turkmenistan
   Country 6 :  east-n-southeast-asia : Philippines
   Country 7 :  africa : Botswana
   Country 8 :  africa : Seychelles
   Country 9 :  africa : Niger
   Country 10 :  africa : South Africa
   Country 11 :  africa : Congo (Brazzaville)
   Country 12 :  africa : Guinea-Bissau
   Country 13 :  africa : Burundi
   Country 14 :  africa : Equatorial Guinea
   Country 15 :  africa : Djibouti
   Country 16 :  africa : Egypt
   Country 17 :  africa : Malawi
   Country 18 :  africa : Togo
   Country 19 :  africa : Central African Republic
   Country 20 :  africa : Mauritania
   Country 21 :  africa : Guinea
   Country 22 :  africa : The Gambia
   Country 23 :  africa : Burkina Faso
   Country 24 :  africa : Ghana
   Country 25 :  africa : Mali
   Country 26 :  africa : Zambia
   Country 27 :  africa : Gabon
   Country 28 :  africa : Madagascar
   Country 29 :  africa : Comoros
   Country 30 :  central-america-n-caribbean : Nicaragua
   Country 31 :  central-america-n-caribbean : The Dominican Republic
   Country 32 :  south-asia : Maldives

Cluster  1 :
   Country 0 :  europe : Guernsey
   Country 1 :  europe : France
   Country 2 :  europe : Jersey
   Country 3 :  europe : Isle of Man
   Country 4 :  central-america-n-caribbean : Saint Lucia
   Country 5 :  central-america-n-caribbean : Saint Vincent and the Grenadines
   Country 6 :  australia-oceania : New Caledonia
   Country 7 :  australia-oceania : French Polynesia
   Country 8 :  north-america : Clipperton Island
   Country 9 :  north-america : Saint Pierre and Miquelon

Cluster  2 :
   Country 0 :  south-america : Venezuela
   Country 1 :  south-america : Uruguay
   Country 2 :  south-america : Colombia
   Country 3 :  south-america : Chile
   Country 4 :  south-america : Suriname
   Country 5 :  south-america : Brazil
   Country 6 :  south-america : Argentina
   Country 7 :  europe : Ukraine
   Country 8 :  europe : Macedonia
   Country 9 :  europe : Bosnia and Herzegovina
   Country 10 :  central-asia : Tajikistan
   Country 11 :  east-n-southeast-asia : South Korea
   Country 12 :  east-n-southeast-asia : North Korea
   Country 13 :  east-n-southeast-asia : Timor-Leste
   Country 14 :  east-n-southeast-asia : Vietnam
   Country 15 :  east-n-southeast-asia : Macau
   Country 16 :  east-n-southeast-asia : Japan
   Country 17 :  east-n-southeast-asia : Hong Kong
   Country 18 :  africa : Liberia
   Country 19 :  africa : Cote d'Ivoire
   Country 20 :  africa : Sudan
   Country 21 :  africa : DRC
   Country 22 :  africa : Sierra Leone
   Country 23 :  africa : Libya
   Country 24 :  africa : Eritrea
   Country 25 :  africa : South Sudan
   Country 26 :  africa : Rwanda
   Country 27 :  central-america-n-caribbean : Honduras
   Country 28 :  central-america-n-caribbean : Guatemala
   Country 29 :  central-america-n-caribbean : El Salvador
   Country 30 :  australia-oceania : Solomon Islands
   Country 31 :  middle-east : Syria
   Country 32 :  middle-east : Bahrain
   Country 33 :  middle-east : Saudi Arabia
   Country 34 :  middle-east : Turkey
   Country 35 :  middle-east : Israel
   Country 36 :  middle-east : United Arab Emirates
   Country 37 :  middle-east : Jordan
   Country 38 :  middle-east : Yemen
   Country 39 :  middle-east : Iran
   Country 40 :  middle-east : Qatar
   Country 41 :  middle-east : Lebanon
   Country 42 :  middle-east : Oman
   Country 43 :  south-asia : Afghanistan
   Country 44 :  south-asia : Bhutan

Cluster  3 :
   Country 0 :  europe : Faroe Islands
   Country 1 :  europe : Svalbard
   Country 2 :  europe : Iceland
   Country 3 :  europe : Monaco
   Country 4 :  europe : Ireland
   Country 5 :  east-n-southeast-asia : China
   Country 6 :  east-n-southeast-asia : Laos
   Country 7 :  east-n-southeast-asia : Brunei
   Country 8 :  africa : Cabo Verde
   Country 9 :  africa : Mauritius
   Country 10 :  central-america-n-caribbean : Anguilla
   Country 11 :  central-america-n-caribbean : Barbados
   Country 12 :  central-america-n-caribbean : Costa Rica
   Country 13 :  central-america-n-caribbean : Belize
   Country 14 :  central-america-n-caribbean : Grenada
   Country 15 :  central-america-n-caribbean : Trinidad and Tobago
   Country 16 :  central-america-n-caribbean : Haiti
   Country 17 :  central-america-n-caribbean : Virgin Islands
   Country 18 :  central-america-n-caribbean : Aruba
   Country 19 :  central-america-n-caribbean : Jamaica
   Country 20 :  australia-oceania : American Samoa
   Country 21 :  australia-oceania : Australia
   Country 22 :  australia-oceania : Vanuatu
   Country 23 :  south-asia : India
   Country 24 :  south-asia : Sri Lanka
   Country 25 :  north-america : Greenland
   Country 26 :  north-america : Mexico

Cluster  4 :
   Country 0 :  europe : Dhekelia
   Country 1 :  europe : Akrotiri
   Country 2 :  europe : Cyprus
   Country 3 :  europe : Malta
   Country 4 :  europe : Gibraltar
   Country 5 :  east-n-southeast-asia : Papua New Guinea
   Country 6 :  australia-oceania : New Zealand
   Country 7 :  australia-oceania : Cook Islands

Cluster  5 :
   Country 0 :  south-america : Falkland Islands (Islas Malvinas)
   Country 1 :  south-america : South Georgia and South Sandwich Islands
   Country 2 :  europe : Jan Mayen
   Country 3 :  africa : Saint Helena, Ascension, and Tristan da Cunha
   Country 4 :  central-america-n-caribbean : Turks and Caicos Islands
   Country 5 :  central-america-n-caribbean : Saint Kitts and Nevis
   Country 6 :  central-america-n-caribbean : Antigua and Barbuda
   Country 7 :  central-america-n-caribbean : Cayman Islands
   Country 8 :  central-america-n-caribbean : Dominica
   Country 9 :  central-america-n-caribbean : Montserrat
   Country 10 :  central-america-n-caribbean : Saint Martin
   Country 11 :  central-america-n-caribbean : Saint Barthelemy
   Country 12 :  australia-oceania : Coral Sea Islands
   Country 13 :  australia-oceania : Pitcairn Islands
   Country 14 :  australia-oceania : Tonga
   Country 15 :  australia-oceania : Norfolk Island
   Country 16 :  australia-oceania : Tuvalu
   Country 17 :  australia-oceania : Christmas Island
   Country 18 :  australia-oceania : Tokelau
   Country 19 :  australia-oceania : Niue
   Country 20 :  australia-oceania : Cocos (Keeling) Islands
   Country 21 :  australia-oceania : Ashmore and Cartier Islands
   Country 22 :  australia-oceania : Wallis and Futuna
   Country 23 :  australia-oceania : Samoa
   Country 24 :  south-asia : British Indian Ocean Territory
   Country 25 :  north-america : Bermuda

Cluster  6 :
   Country 0 :  central-america-n-caribbean : Navassa Island
   Country 1 :  central-america-n-caribbean : Panama
   Country 2 :  central-america-n-caribbean : The Bahamas
   Country 3 :  central-america-n-caribbean : Puerto Rico
   Country 4 :  central-america-n-caribbean : British Virgin Islands
   Country 5 :  central-america-n-caribbean : Cuba
   Country 6 :  australia-oceania : Northern Mariana Islands
   Country 7 :  australia-oceania : Guam
   Country 8 :  australia-oceania : Marshall Islands
   Country 9 :  australia-oceania : Palau
   Country 10 :  australia-oceania : Baker Island; Howland Island; Jarvis Island; Johnston Atoll; Kingman Reef; Midway Islands; Palmyra Atoll
   Country 11 :  australia-oceania : Wake Island
   Country 12 :  australia-oceania : Federated States of Micronesia
   Country 13 :  australia-oceania : Kiribati

Cluster  7 :
   Country 0 :  south-america : Paraguay
   Country 1 :  south-america : Ecuador
   Country 2 :  europe : Moldova
   Country 3 :  europe : Andorra
   Country 4 :  east-n-southeast-asia : Taiwan
   Country 5 :  east-n-southeast-asia : Burma
   Country 6 :  east-n-southeast-asia : Cambodia
   Country 7 :  east-n-southeast-asia : Mongolia
   Country 8 :  east-n-southeast-asia : Indonesia
   Country 9 :  east-n-southeast-asia : Thailand
   Country 10 :  africa : Algeria
   Country 11 :  africa : Tunisia
   Country 12 :  africa : Uganda
   Country 13 :  africa : Cameroon
   Country 14 :  africa : Morocco
   Country 15 :  africa : Nigeria
   Country 16 :  africa : Benin
   Country 17 :  africa : Somalia
   Country 18 :  africa : Senegal
   Country 19 :  africa : Ethiopia
   Country 20 :  africa : Chad
   Country 21 :  africa : Zimbabwe
   Country 22 :  africa : Tanzania
   Country 23 :  africa : Namibia
   Country 24 :  africa : Mozambique
   Country 25 :  africa : Sao Tome and Principe
   Country 26 :  africa : Lesotho
   Country 27 :  africa : Angola
   Country 28 :  africa : Swaziland
   Country 29 :  australia-oceania : Fiji
   Country 30 :  middle-east : Kuwait
   Country 31 :  middle-east : Iraq
   Country 32 :  middle-east : Georgia
   Country 33 :  south-asia : Pakistan
   Country 34 :  south-asia : Bangladesh
   Country 35 :  south-asia : Nepal

Cluster  8 :
   Country 0 :  europe : Montenegro
   Country 1 :  europe : Holy See (Vatican City)
   Country 2 :  europe : Belgium
   Country 3 :  europe : Luxembourg
   Country 4 :  europe : Croatia
   Country 5 :  europe : Liechtenstein
   Country 6 :  europe : Kosovo
   Country 7 :  europe : Slovakia
   Country 8 :  europe : Czechia
   Country 9 :  europe : Albania
   Country 10 :  europe : Netherlands
   Country 11 :  europe : Switzerland
   Country 12 :  europe : Bulgaria
   Country 13 :  europe : Serbia
   Country 14 :  europe : Slovenia
   Country 15 :  europe : Hungary
   Country 16 :  east-n-southeast-asia : Singapore
   Country 17 :  east-n-southeast-asia : Malaysia
   Country 18 :  australia-oceania : Nauru

Cluster  9 :
   Country 0 :  europe : Greece
   Country 1 :  europe : San Marino
   Country 2 :  europe : Spain
   Country 3 :  europe : Germany
   Country 4 :  europe : Estonia
   Country 5 :  europe : Sweden
   Country 6 :  europe : Finland
   Country 7 :  europe : United Kingdom
   Country 8 :  europe : Lithuania
   Country 9 :  europe : Latvia
   Country 10 :  europe : Romania
   Country 11 :  europe : Austria
   Country 12 :  europe : Portugal
   Country 13 :  europe : Norway
   Country 14 :  europe : Italy
   Country 15 :  europe : Denmark
   Country 16 :  europe : Poland
   Country 17 :  central-asia : Russia
   Country 18 :  central-asia : Uzbekistan
   Country 19 :  central-asia : Kazakhstan
   Country 20 :  middle-east : Azerbaijan
   Country 21 :  middle-east : Armenia
   Country 22 :  north-america : Canada
   Country 23 :  north-america : United States

Lengthy list! So, what can a quick scan of the cluster results tell us? There are some clusters with dominant regions (defined as having greater than 2/3 of the records): Cluster 0 (africa), Cluster 4,8,9 (europe), while the region with the greatest cohesion is middle-east, which only appears in three clusters (2,7,9). The cluster with the least amount of diversity, in terms of regions represented is Cluster 6, with only two regions (central-america-n-caribbean, australia-oceania). Clusters 1,4,5,6 are dominated by island countries/territories spread throughout the globe.

So, generally speaking, the country/territory descriptions did not lend themselves to the same geographic clustering that exists geographically on Earth. Possible enhancements to the analysis include:
a) remove duplicate entries from the tokens
b) keep the free-standing years in the tokens, but stem them to remove the last value, so decades would match (i.e. 1916 would become 191).
c) remove any token with a length if less than three.
d) remove the name of the country from its own token list.
e) alter the settings for the K-Means clustering algorithm. The TfidfVectorizer command has many other parameters beyond what I have used here. The “df” parameters I used relate to word document frequency and removes tokens that occur in fewer than 0.1 or greater than 0.9 of the documents.
f) remove the names of other countries from the token list. This I think is not a good idea because the history of certain countries is closely linked to other countries, and it would be important to preserve this. An example of this behavior is Cluster 1, which is made of almost entirely of France and countries/territories that were once or still are considered part of that country. It is by no means an exhaustive list of countries that have ties to France.

(*) A major unknown in this analysis is: who are the authors of these descriptions? Are they all authored by one person? each region by one person? a collaboration of multiple authors? A bias may be introduced depending on the answers to these questions. I do not have the answers to these questions of authorship, and the CIA is not talking!

Related Posts