Make Contact

Close
Invalid Input
Please type your name. Please type your company Invalid email address.
Invalid Input

Make Contact

608.294.5460
634 W Main St., Ste 201
Madison, WI  53703

Map
[email protected]

NLP and the CIA World Factbook

Each entry in the CIA World Factbook has a multitude of factoids as well as a continent (or sub-continent) name and description. It was the availability of a description that interested me, to use as the corpus in a Natural Language Processing NLP analysis to determine if, using only the description, a K-Means clustering algorithm would put the countries into their geographic region*.

There are 242 countries and territories covered by the CIA World Factbook. They are broken down into ten geographic regions (africa, australia-oceania, central-america-n-caribbean, central-asia, east-n-southeast-asia, europe, middle-east, north-america, south-america, south-asia). The countries are not evenly distributed by region, with africa having the most (55) and central-asia having the least (6).

Each country has a file in json format that was obtained from the Github repo. I wrote the script below in Python to parse the relevant fields from the json files and create a csv file for input to the NLP analysis.

# package imports

import json
import os
import fnmatch
import csv

# set constant variable values
rootPath = r'/home/pitt/nlp_analysis'
pattern = '*.json'

#ingest the country codes and names file to eliminate non-country values
countrycode = []
with open('/home/pitt/nlp_analysis/countrylist.txt') as countrylist:
    for line in countrylist:
        countrycode.append(line.split(",")[0])


# create the header row for the list
data = [['countrycode','continent','countryname','description']]

# traverse the directory structure and create rows for data file
for root, dirs, files in os.walk(rootPath):
    for filename in fnmatch.filter(files, pattern):
        completefilename = os.path.join(root, filename)
        filenameroot = filename.split(".")[0]
        if filenameroot in countrycode:
            with open(completefilename) as factbookfile:
                jsonfile = json.load(factbookfile)
                try:
                    countrylongname = jsonfile['Government'] \
                                              ['Country name'] \
                                              ['conventional long form'] \
                                              ['text'].encode('utf-8')
                except:
                    countrylongname = 'none'
                countryshortname = jsonfile['Government'] \
                                           ['Country name'] \
                                           ['conventional short form'] \
                                           ['text'].encode('utf-8')
                if countryshortname == 'none':
                    countryname = countrylongname
                else:
                    countryname = countryshortname
                descr = jsonfile['Introduction'] \
                                ['Background'] \
                                ['text'].encode('utf-8')
                countryrow = []
                countryrow.append(filenameroot)
                countryrow.append(root.split('/')[-1])
                countryrow.append(countryname)
                countryrow.append(descr)
            data.append(countryrow)
        else:
            pass

# write out the formatted file with the header and data rows
with open('countries.csv', 'w') as csvfile:
    datawriter = csv.writer(csvfile)
    for row in data:
        datawriter.writerow(row)

The resulting csv file has four columns (countrycode,continent,countryname,description). In this case the continent is the region assigned to it by the CIA and not the classical definition of one of the six populated continents. I originally did the analysis in a Jupyter notebook, which is reproduced below as separate steps. For the sake of clarity I am printing out the full description and tokenized values from my father's homeland, Ireland.

# Package imports

import string
import collections

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
# Read data in from local file

from csv import reader

with open('/home/pitt/nlp_analysis/countries.csv', 'rb') as csvFile:
    next(csvFile)
    csvReader = reader(csvFile, delimiter=',', quotechar='"')
    data = list(csvReader)
# Needed to download the NLTK libraries the first time
# nltk.download()

# Set constants

PUNCTUATION_NUMBERS = set(string.punctuation + '0123456789')
STOPWORDS = set(stopwords.words('english'))
STEMMER = PorterStemmer()
# Function to tokenize the country description

def tokenize(text):
    tokens = word_tokenize(text)
    lowercased = [t.lower() for t in tokens]
    no_punctuation = []
    for word in lowercased:
        punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION_NUMBERS])
        no_punctuation.append(punct_removed)
    no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
    stemmed = [STEMMER.stem(w) for w in no_stopwords]
    return [w for w in stemmed if w]
# Get the number of records in the data file

print 'Number of rows in the countries.csv file is: %s' % len(data)

Number of rows in the countries.csv file is: 242. Normally this would be used to guide the initial number of clusters.

# Extract the country description from the data file

full_sentences = []

for i in range(len(data)):
        full_sentences.append(data[i][3])

Here is the full description from Ireland:

Celtic tribes arrived on the island between 600 and 150 B.C. Invasions by Norsemen that began in the late 8th century were finally ended when King Brian BORU defeated the Danes in 1014. Norman invasions began in the 12th century and set off more than seven centuries of Anglo-Irish struggle marked by fierce rebellions and harsh repressions. The Irish famine of the mid-19th century saw the population of the island drop by one third through starvation and emigration. For more than a century after that the population of the island continued to fall only to begin growing again in the 1960s. Over the last 50 years, Ireland's high birthrate has made it demographically one of the youngest populations in the EU. The modern Irish state traces its origins to the failed 1916 Easter Monday Uprising that touched off several years of guerrilla warfare resulting in independence from the UK in 1921 for 26 southern counties; six northern (Ulster) counties remained part of the UK. Unresolved issues in Northern Ireland erupted into years of violence known as the "Troubles" that began in the 1960s. The Government of Ireland was part of a process along with the UK and US Governments that helped broker what is known as The Good Friday Agreement in Northern Ireland in 1998. This initiated a new phase of cooperation between the Irish and British Governments. Ireland was neutral in World War II and continues its policy of military neutrality. Ireland joined the European Community in 1973 and the euro-zone currency union in 1999. The economic boom years of the Celtic Tiger (1995-2007) saw rapid economic growth, which came to an abrupt end in 2008 with the meltdown of the Irish banking system. Today the economy is recovering, fueled by large and growing foreign direct investment, especially from US multi-nationals.

And here are the tokenized (punctuation removed, lower-cased, and lemmaticized) words from this description:

'celtic', 'tribe', 'arriv', 'island', 'bc', 'invas', 'norsemen', 'began', 'late', 'th', 'centuri', 'final', 'end', 'king', 'brian', 'bor', 'defeat', 'dane', 'norman', 'invas', 'began', 'th', 'centuri', 'set', 'seven', 'centuri', 'angloirish', 'struggl', 'mark', 'fierc', 'rebellion', 'harsh', 'repress', 'irish', 'famin', 'midth', 'centuri', 'saw', 'popul', 'island', 'drop', 'one', 'third', 'starvat', 'emigr', 'centuri', 'popul', 'island', 'contin', 'fall', 'begin', 'grow', 'last', 'year', 'ireland', 'high', 'birthrat', 'made', 'demograph', 'one', 'youngest', 'popul', 'e', 'modern', 'irish', 'state', 'trace', 'origin', 'fail', 'easter', 'monday', 'upris', 'touch', 'sever', 'year', 'guerrilla', 'warfar', 'result', 'independ', 'uk', 'southern', 'counti', 'six', 'northern', 'ulster', 'counti', 'remain', 'part', 'uk', 'unresolv', 'iss', 'northern', 'ireland', 'erupt', 'year', 'violenc', 'known', 'troubl', 'began', 'govern', 'ireland', 'part', 'process', 'along', 'uk', 'us', 'govern', 'help', 'broker', 'known', 'good', 'friday', 'agreement', 'northern', 'ireland', 'initi', 'new', 'phase', 'cooper', 'irish', 'british', 'govern', 'ireland', 'neutral', 'world', 'war', 'ii', 'contin', 'polici', 'militari', 'neutral', 'ireland', 'join', 'european', 'commun', 'eurozon', 'currenc', 'union', 'econom', 'boom', 'year', 'celtic', 'tiger', 'saw', 'rapid', 'econom', 'growth', 'came', 'abrupt', 'end', 'meltdown', 'irish', 'bank', 'system', 'today', 'economi', 'recov', 'fuel', 'larg', 'grow', 'foreign', 'direct', 'invest', 'especi', 'us', 'multin'

# Create a dict with the description and the country continent : country name

sentence_lookup = {}

for i in range(len(data)):
        continent_country = data[i][1] + ' : ' + data[i][2]
        sentence_lookup[data[i][3]] = continent_country
# Function to create clusters using K-Means from the country descriptions

def cluster_sentences(sentences, nb_of_clusters=7):
    tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, \
                                       stop_words=stopwords.words('english'), \
                                       max_df=0.9, \
                                       min_df=0.1, \
                                       lowercase=True)

    #builds a tf-idf matrix for the sentences
    tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)
    kmeans = KMeans(n_clusters=nb_of_clusters)
    kmeans.fit(tfidf_matrix)
    clusters = collections.defaultdict(list)
    
    for i, label in enumerate(kmeans.labels_):
        clusters[label].append(i)
    return dict(clusters)
# Calculate and print out the clusters and associated country information

nclusters= 10
clusters = cluster_sentences(full_sentences, nclusters)

for cluster in range(nclusters):
    print "Cluster ",cluster,":"
    for i,sentence in enumerate(clusters[cluster]):
        print "\tCountry ",i,": ", sentence_lookup.get([full_sentences[sentence]][0])

Here is a list of the clusters and their associated countries, listed as region : countryname. Note that I used a cluster size of 10, which might seem small given the number of countries in the analysis (242), but I wanted to match the number of regions the countries were segmented into geographically. Also note that the order of the countries in the cluster is not relevant.

Cluster  0 :
	Country 0 :  south-america : Guyana
	Country 1 :  south-america : Peru
	Country 2 :  south-america : Bolivia
	Country 3 :  europe : Belarus
	Country 4 :  central-asia : Kyrgyzstan￿
	Country 5 :  central-asia : Turkmenistan
	Country 6 :  east-n-southeast-asia : Philippines
	Country 7 :  africa : Botswana
	Country 8 :  africa : Seychelles
	Country 9 :  africa : Niger
	Country 10 :  africa : South Africa
	Country 11 :  africa : Congo (Brazzaville)
	Country 12 :  africa : Guinea-Bissau
	Country 13 :  africa : Burundi
	Country 14 :  africa : Equatorial Guinea
	Country 15 :  africa : Djibouti
	Country 16 :  africa : Egypt
	Country 17 :  africa : Malawi
	Country 18 :  africa : Togo
	Country 19 :  africa : Central African Republic
	Country 20 :  africa : Mauritania
	Country 21 :  africa : Guinea
	Country 22 :  africa : The Gambia
	Country 23 :  africa : Burkina Faso
	Country 24 :  africa : Ghana
	Country 25 :  africa : Mali
	Country 26 :  africa : Zambia
	Country 27 :  africa : Gabon
	Country 28 :  africa : Madagascar
	Country 29 :  africa : Comoros
	Country 30 :  central-america-n-caribbean : Nicaragua
	Country 31 :  central-america-n-caribbean : The Dominican Republic
	Country 32 :  south-asia : Maldives

Cluster  1 :
	Country 0 :  europe : Guernsey
	Country 1 :  europe : France
	Country 2 :  europe : Jersey
	Country 3 :  europe : Isle of Man
	Country 4 :  central-america-n-caribbean : Saint Lucia
	Country 5 :  central-america-n-caribbean : Saint Vincent and the Grenadines
	Country 6 :  australia-oceania : New Caledonia
	Country 7 :  australia-oceania : French Polynesia
	Country 8 :  north-america : Clipperton Island
	Country 9 :  north-america : Saint Pierre and Miquelon

Cluster  2 :
	Country 0 :  south-america : Venezuela
	Country 1 :  south-america : Uruguay
	Country 2 :  south-america : Colombia
	Country 3 :  south-america : Chile
	Country 4 :  south-america : Suriname
	Country 5 :  south-america : Brazil
	Country 6 :  south-america : Argentina
	Country 7 :  europe : Ukraine
	Country 8 :  europe : Macedonia
	Country 9 :  europe : Bosnia and Herzegovina
	Country 10 :  central-asia : Tajikistan
	Country 11 :  east-n-southeast-asia : South Korea
	Country 12 :  east-n-southeast-asia : North Korea
	Country 13 :  east-n-southeast-asia : Timor-Leste
	Country 14 :  east-n-southeast-asia : Vietnam
	Country 15 :  east-n-southeast-asia : Macau
	Country 16 :  east-n-southeast-asia : Japan
	Country 17 :  east-n-southeast-asia : Hong Kong
	Country 18 :  africa : Liberia
	Country 19 :  africa : Cote d'Ivoire
	Country 20 :  africa : Sudan
	Country 21 :  africa : DRC
	Country 22 :  africa : Sierra Leone
	Country 23 :  africa : Libya
	Country 24 :  africa : Eritrea
	Country 25 :  africa : South Sudan
	Country 26 :  africa : Rwanda
	Country 27 :  central-america-n-caribbean : Honduras
	Country 28 :  central-america-n-caribbean : Guatemala
	Country 29 :  central-america-n-caribbean : El Salvador
	Country 30 :  australia-oceania : Solomon Islands
	Country 31 :  middle-east : Syria
	Country 32 :  middle-east : Bahrain
	Country 33 :  middle-east : Saudi Arabia
	Country 34 :  middle-east : Turkey
	Country 35 :  middle-east : Israel
	Country 36 :  middle-east : United Arab Emirates
	Country 37 :  middle-east : Jordan
	Country 38 :  middle-east : Yemen
	Country 39 :  middle-east : Iran
	Country 40 :  middle-east : Qatar
	Country 41 :  middle-east : Lebanon
	Country 42 :  middle-east : Oman
	Country 43 :  south-asia : Afghanistan
	Country 44 :  south-asia : Bhutan

Cluster  3 :
	Country 0 :  europe : Faroe Islands
	Country 1 :  europe : Svalbard
	Country 2 :  europe : Iceland
	Country 3 :  europe : Monaco
	Country 4 :  europe : Ireland
	Country 5 :  east-n-southeast-asia : China
	Country 6 :  east-n-southeast-asia : Laos
	Country 7 :  east-n-southeast-asia : Brunei
	Country 8 :  africa : Cabo Verde
	Country 9 :  africa : Mauritius
	Country 10 :  central-america-n-caribbean : Anguilla
	Country 11 :  central-america-n-caribbean : Barbados
	Country 12 :  central-america-n-caribbean : Costa Rica
	Country 13 :  central-america-n-caribbean : Belize
	Country 14 :  central-america-n-caribbean : Grenada
	Country 15 :  central-america-n-caribbean : Trinidad and Tobago
	Country 16 :  central-america-n-caribbean : Haiti
	Country 17 :  central-america-n-caribbean : Virgin Islands
	Country 18 :  central-america-n-caribbean : Aruba
	Country 19 :  central-america-n-caribbean : Jamaica
	Country 20 :  australia-oceania : American Samoa
	Country 21 :  australia-oceania : Australia
	Country 22 :  australia-oceania : Vanuatu
	Country 23 :  south-asia : India
	Country 24 :  south-asia : Sri Lanka
	Country 25 :  north-america : Greenland
	Country 26 :  north-america : Mexico

Cluster  4 :
	Country 0 :  europe : Dhekelia
	Country 1 :  europe : Akrotiri
	Country 2 :  europe : Cyprus
	Country 3 :  europe : Malta
	Country 4 :  europe : Gibraltar
	Country 5 :  east-n-southeast-asia : Papua New Guinea
	Country 6 :  australia-oceania : New Zealand
	Country 7 :  australia-oceania : Cook Islands

Cluster  5 :
	Country 0 :  south-america : Falkland Islands (Islas Malvinas)
	Country 1 :  south-america : South Georgia and South Sandwich Islands
	Country 2 :  europe : Jan Mayen
	Country 3 :  africa : Saint Helena, Ascension, and Tristan da Cunha
	Country 4 :  central-america-n-caribbean : Turks and Caicos Islands
	Country 5 :  central-america-n-caribbean : Saint Kitts and Nevis
	Country 6 :  central-america-n-caribbean : Antigua and Barbuda
	Country 7 :  central-america-n-caribbean : Cayman Islands
	Country 8 :  central-america-n-caribbean : Dominica
	Country 9 :  central-america-n-caribbean : Montserrat
	Country 10 :  central-america-n-caribbean : Saint Martin
	Country 11 :  central-america-n-caribbean : Saint Barthelemy
	Country 12 :  australia-oceania : Coral Sea Islands
	Country 13 :  australia-oceania : Pitcairn Islands
	Country 14 :  australia-oceania : Tonga
	Country 15 :  australia-oceania : Norfolk Island
	Country 16 :  australia-oceania : Tuvalu
	Country 17 :  australia-oceania : Christmas Island
	Country 18 :  australia-oceania : Tokelau
	Country 19 :  australia-oceania : Niue
	Country 20 :  australia-oceania : Cocos (Keeling) Islands
	Country 21 :  australia-oceania : Ashmore and Cartier Islands
	Country 22 :  australia-oceania : Wallis and Futuna
	Country 23 :  australia-oceania : Samoa
	Country 24 :  south-asia : British Indian Ocean Territory
	Country 25 :  north-america : Bermuda

Cluster  6 :
	Country 0 :  central-america-n-caribbean : Navassa Island
	Country 1 :  central-america-n-caribbean : Panama
	Country 2 :  central-america-n-caribbean : The Bahamas
	Country 3 :  central-america-n-caribbean : Puerto Rico
	Country 4 :  central-america-n-caribbean : British Virgin Islands
	Country 5 :  central-america-n-caribbean : Cuba
	Country 6 :  australia-oceania : Northern Mariana Islands
	Country 7 :  australia-oceania : Guam
	Country 8 :  australia-oceania : Marshall Islands
	Country 9 :  australia-oceania : Palau
	Country 10 :  australia-oceania : Baker Island; Howland Island; Jarvis Island; Johnston Atoll; Kingman Reef; Midway Islands; Palmyra Atoll
	Country 11 :  australia-oceania : Wake Island
	Country 12 :  australia-oceania : Federated States of Micronesia
	Country 13 :  australia-oceania : Kiribati

Cluster  7 :
	Country 0 :  south-america : Paraguay
	Country 1 :  south-america : Ecuador
	Country 2 :  europe : Moldova
	Country 3 :  europe : Andorra
	Country 4 :  east-n-southeast-asia : Taiwan
	Country 5 :  east-n-southeast-asia : Burma
	Country 6 :  east-n-southeast-asia : Cambodia
	Country 7 :  east-n-southeast-asia : Mongolia
	Country 8 :  east-n-southeast-asia : Indonesia
	Country 9 :  east-n-southeast-asia : Thailand
	Country 10 :  africa : Algeria
	Country 11 :  africa : Tunisia
	Country 12 :  africa : Uganda
	Country 13 :  africa : Cameroon
	Country 14 :  africa : Morocco
	Country 15 :  africa : Nigeria
	Country 16 :  africa : Benin
	Country 17 :  africa : Somalia
	Country 18 :  africa : Senegal
	Country 19 :  africa : Ethiopia
	Country 20 :  africa : Chad
	Country 21 :  africa : Zimbabwe
	Country 22 :  africa : Tanzania
	Country 23 :  africa : Namibia
	Country 24 :  africa : Mozambique
	Country 25 :  africa : Sao Tome and Principe
	Country 26 :  africa : Lesotho
	Country 27 :  africa : Angola
	Country 28 :  africa : Swaziland
	Country 29 :  australia-oceania : Fiji
	Country 30 :  middle-east : Kuwait
	Country 31 :  middle-east : Iraq
	Country 32 :  middle-east : Georgia
	Country 33 :  south-asia : Pakistan
	Country 34 :  south-asia : Bangladesh
	Country 35 :  south-asia : Nepal

Cluster  8 :
	Country 0 :  europe : Montenegro
	Country 1 :  europe : Holy See (Vatican City)
	Country 2 :  europe : Belgium
	Country 3 :  europe : Luxembourg
	Country 4 :  europe : Croatia
	Country 5 :  europe : Liechtenstein
	Country 6 :  europe : Kosovo
	Country 7 :  europe : Slovakia
	Country 8 :  europe : Czechia
	Country 9 :  europe : Albania
	Country 10 :  europe : Netherlands
	Country 11 :  europe : Switzerland
	Country 12 :  europe : Bulgaria
	Country 13 :  europe : Serbia
	Country 14 :  europe : Slovenia
	Country 15 :  europe : Hungary
	Country 16 :  east-n-southeast-asia : Singapore
	Country 17 :  east-n-southeast-asia : Malaysia
	Country 18 :  australia-oceania : Nauru

Cluster  9 :
	Country 0 :  europe : Greece
	Country 1 :  europe : San Marino
	Country 2 :  europe : Spain
	Country 3 :  europe : Germany
	Country 4 :  europe : Estonia
	Country 5 :  europe : Sweden
	Country 6 :  europe : Finland
	Country 7 :  europe : United Kingdom
	Country 8 :  europe : Lithuania
	Country 9 :  europe : Latvia
	Country 10 :  europe : Romania
	Country 11 :  europe : Austria
	Country 12 :  europe : Portugal
	Country 13 :  europe : Norway
	Country 14 :  europe : Italy
	Country 15 :  europe : Denmark
	Country 16 :  europe : Poland
	Country 17 :  central-asia : Russia
	Country 18 :  central-asia : Uzbekistan
	Country 19 :  central-asia : Kazakhstan
	Country 20 :  middle-east : Azerbaijan
	Country 21 :  middle-east : Armenia
	Country 22 :  north-america : Canada
	Country 23 :  north-america : United States

Lengthy list! So, what can a quick scan of the cluster results tell us? There are some clusters with dominant regions (defined as having greater than 2/3 of the records): Cluster 0 (africa), Cluster 4,8,9 (europe), while the region with the greatest cohesion is middle-east, which only appears in three clusters (2,7,9). The cluster with the least amount of diversity, in terms of regions represented is Cluster 6, with only two regions (central-america-n-caribbean, australia-oceania). Clusters 1,4,5,6 are dominated by island countries/territories spread throughout the globe.

So, generally speaking, the country/territory descriptions did not lend themselves to the same geographic clustering that exists geographically on Earth. Possible enhancements to the analysis include:
a) remove duplicate entries from the tokens
b) keep the free-standing years in the tokens, but stem them to remove the last value, so decades would match (i.e. 1916 would become 191).
c) remove any token with a length if less than three.
d) remove the name of the country from it's own token list.
e) alter the settings for the K-Means clustering algorithm. The TfidfVectorizer command has many other parameters beyond what I have used here. The "df" parameters I used relate to word document frequency and removes tokens that occur in fewer than 0.1 or greater than 0.9 of the documents.
f) remove names of other countries from the token list. This I think is not a good idea because the history of certain countries is closely linked to other countries, and it would be important to preserve this. An example of this behavior is Cluster 1, which is made of almost entirely of France and countries/territories that were once or still are considered part of that country. It is by no means an exhaustive list of countries that have ties to France.


(*) A major unknown in this analysis is: who are the authors of these descriptions? Are they all authored by one person?, each region by one person?, a collaboration of multiple authors? A bias may be introduced depending on the answers to these questions. I do not have the answers to these questions of authorship, and the CIA is not talking!

Make Contact

Looking for a team to help your idea take flight?
Get in touch and we'll talk it out.

Phone or Email

(608) 294-5460

Address

Earthling Interactive
634 W Main St., Ste 201
Madison, WI 53703