Each entry in the CIA World Factbook has a multitude of factoids as well as a continent (or sub-continent) name and description. It was the availability of a description that interested me, to use as the corpus in a Natural Language Processing NLP analysis to determine if, using only the description, a K-Means clustering algorithm would put the countries into their geographic region*.
There are 242 countries and territories covered by the CIA World Factbook. They are broken down into ten geographic regions (africa, australia-oceania, central-america-n-caribbean, central-asia, east-n-southeast-asia, europe, middle-east, north-america, south-america, south-asia). The countries are not evenly distributed by region, with africa having the most (55) and central-asia having the least (6).
Each country has a file in json format that was obtained from the Github repo. I wrote the script below in Python to parse the relevant fields from the json files and create a csv file for input to the NLP analysis.
# package imports import json import os import fnmatch import csv # set constant variable values rootPath = r'/home/pitt/nlp_analysis' pattern = '*.json' #ingest the country codes and names file to eliminate non-country values countrycode = [] with open('/home/pitt/nlp_analysis/countrylist.txt') as countrylist: for line in countrylist: countrycode.append(line.split(",")[0]) # create the header row for the list data = [['countrycode','continent','countryname','description']] # traverse the directory structure and create rows for data file for root, dirs, files in os.walk(rootPath): for filename in fnmatch.filter(files, pattern): completefilename = os.path.join(root, filename) filenameroot = filename.split(".")[0] if filenameroot in countrycode: with open(completefilename) as factbookfile: jsonfile = json.load(factbookfile) try: countrylongname = jsonfile['Government'] ['Country name'] ['conventional long form'] ['text'].encode('utf-8') except: countrylongname = 'none' countryshortname = jsonfile['Government'] ['Country name'] ['conventional short form'] ['text'].encode('utf-8') if countryshortname == 'none': countryname = countrylongname else: countryname = countryshortname descr = jsonfile['Introduction'] ['Background'] ['text'].encode('utf-8') countryrow = [] countryrow.append(filenameroot) countryrow.append(root.split('/')[-1]) countryrow.append(countryname) countryrow.append(descr) data.append(countryrow) else: pass # write out the formatted file with the header and data rows with open('countries.csv', 'w') as csvfile: datawriter = csv.writer(csvfile) for row in data: datawriter.writerow(row)
The resulting csv file has four columns (countrycode,continent,countryname,description). In this case the continent is the region assigned to it by the CIA and not the classical definition of one of the six populated continents. I originally did the analysis in a Jupyter notebook, which is reproduced below as separate steps. For the sake of clarity I am printing out the full description and tokenized values from my father’s homeland, Ireland.
# Package imports import string import collections from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from sklearn.cluster import KMeans from sklearn.feature_extraction.text import TfidfVectorizer # Read data in from local file from csv import reader with open('/home/pitt/nlp_analysis/countries.csv', 'rb') as csvFile: next(csvFile) csvReader = reader(csvFile, delimiter=',', quotechar='"') data = list(csvReader) # Needed to download the NLTK libraries the first time # nltk.download() # Set constants PUNCTUATION_NUMBERS = set(string.punctuation + '0123456789') STOPWORDS = set(stopwords.words('english')) STEMMER = PorterStemmer() # Function to tokenize the country description def tokenize(text): tokens = word_tokenize(text) lowercased = [t.lower() for t in tokens] no_punctuation = [] for word in lowercased: punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION_NUMBERS]) no_punctuation.append(punct_removed) no_stopwords = [w for w in no_punctuation if not w in STOPWORDS] stemmed = [STEMMER.stem(w) for w in no_stopwords] return [w for w in stemmed if w] # Get the number of records in the data file print 'Number of rows in the countries.csv file is: %s' % len(data)
Number of rows in the countries.csv file is: 242. Normally this would be used to guide the initial number of clusters.
# Extract the country description from the data file full_sentences = [] for i in range(len(data)): full_sentences.append(data[i][3])
Here is the full description from Ireland:
Celtic tribes arrived on the island between 600 and 150 B.C. Invasions by Norsemen that began in the late 8th century were finally ended when King Brian BORU defeated the Danes in 1014. Norman invasions began in the 12th century and set off more than seven centuries of Anglo-Irish struggle marked by fierce rebellions and harsh repressions. The Irish famine of the mid-19th century saw the population of the island drop by one third through starvation and emigration. For more than a century after that the population of the island continued to fall only to begin growing again in the 1960s. Over the last 50 years, Ireland’s high birthrate has made it demographically one of the youngest populations in the EU. The modern Irish state traces its origins to the failed 1916 Easter Monday Uprising that touched off several years of guerrilla warfare resulting in independence from the UK in 1921 for 26 southern counties; six northern (Ulster) counties remained part of the UK. Unresolved issues in Northern Ireland erupted into years of violence known as the “Troubles” that began in the 1960s. The Government of Ireland was part of a process along with the UK and US Governments that helped broker what is known as The Good Friday Agreement in Northern Ireland in 1998. This initiated a new phase of cooperation between the Irish and British Governments. Ireland was neutral in World War II and continues its policy of military neutrality. Ireland joined the European Community in 1973 and the euro-zone currency union in 1999. The economic boom years of the Celtic Tiger (1995-2007) saw rapid economic growth, which came to an abrupt end in 2008 with the meltdown of the Irish banking system. Today the economy is recovering, fueled by large and growing foreign direct investment, especially from US multi-nationals.
And here are the tokenized (punctuation removed, lower-cased, and lemmaticized) words from this description:
‘celtic’, ‘tribe’, ‘arriv’, ‘island’, ‘bc’, ‘invas’, ‘norsemen’, ‘began’, ‘late’, ‘th’, ‘centuri’, ‘final’, ‘end’, ‘king’, ‘brian’, ‘bor’, ‘defeat’, ‘dane’, ‘norman’, ‘invas’, ‘began’, ‘th’, ‘centuri’, ‘set’, ‘seven’, ‘centuri’, ‘angloirish’, ‘struggl’, ‘mark’, ‘fierc’, ‘rebellion’, ‘harsh’, ‘repress’, ‘irish’, ‘famin’, ‘midth’, ‘centuri’, ‘saw’, ‘popul’, ‘island’, ‘drop’, ‘one’, ‘third’, ‘starvat’, ’emigr’, ‘centuri’, ‘popul’, ‘island’, ‘contin’, ‘fall’, ‘begin’, ‘grow’, ‘last’, ‘year’, ‘ireland’, ‘high’, ‘birthrat’, ‘made’, ‘demograph’, ‘one’, ‘youngest’, ‘popul’, ‘e’, ‘modern’, ‘irish’, ‘state’, ‘trace’, ‘origin’, ‘fail’, ‘easter’, ‘monday’, ‘upris’, ‘touch’, ‘sever’, ‘year’, ‘guerrilla’, ‘warfar’, ‘result’, ‘independ’, ‘uk’, ‘southern’, ‘counti’, ‘six’, ‘northern’, ‘ulster’, ‘counti’, ‘remain’, ‘part’, ‘uk’, ‘unresolv’, ‘iss’, ‘northern’, ‘ireland’, ‘erupt’, ‘year’, ‘violenc’, ‘known’, ‘troubl’, ‘began’, ‘govern’, ‘ireland’, ‘part’, ‘process’, ‘along’, ‘uk’, ‘us’, ‘govern’, ‘help’, ‘broker’, ‘known’, ‘good’, ‘friday’, ‘agreement’, ‘northern’, ‘ireland’, ‘initi’, ‘new’, ‘phase’, ‘cooper’, ‘irish’, ‘british’, ‘govern’, ‘ireland’, ‘neutral’, ‘world’, ‘war’, ‘ii’, ‘contin’, ‘polici’, ‘militari’, ‘neutral’, ‘ireland’, ‘join’, ‘european’, ‘commun’, ‘eurozon’, ‘currenc’, ‘union’, ‘econom’, ‘boom’, ‘year’, ‘celtic’, ‘tiger’, ‘saw’, ‘rapid’, ‘econom’, ‘growth’, ‘came’, ‘abrupt’, ‘end’, ‘meltdown’, ‘irish’, ‘bank’, ‘system’, ‘today’, ‘economi’, ‘recov’, ‘fuel’, ‘larg’, ‘grow’, ‘foreign’, ‘direct’, ‘invest’, ‘especi’, ‘us’, ‘multin’
# Create a dict with the description and the country continent : country name sentence_lookup = {} for i in range(len(data)): continent_country = data[i][1] + ' : ' + data[i][2] sentence_lookup[data[i][3]] = continent_country # Function to create clusters using K-Means from the country descriptions def cluster_sentences(sentences, nb_of_clusters=7): tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, stop_words=stopwords.words('english'), max_df=0.9, min_df=0.1, lowercase=True) #builds a tf-idf matrix for the sentences tfidf_matrix = tfidf_vectorizer.fit_transform(sentences) kmeans = KMeans(n_clusters=nb_of_clusters) kmeans.fit(tfidf_matrix) clusters = collections.defaultdict(list) for i, label in enumerate(kmeans.labels_): clusters[label].append(i) return dict(clusters) # Calculate and print out the clusters and associated country information nclusters= 10 clusters = cluster_sentences(full_sentences, nclusters) for cluster in range(nclusters): print "Cluster ",cluster,":" for i,sentence in enumerate(clusters[cluster]): print "tCountry ",i,": ", sentence_lookup.get([full_sentences[sentence]][0])
Here is a list of the clusters and their associated countries, listed as region : countryname. Note that I used a cluster size of 10, which might seem small given the number of countries in the analysis (242), but I wanted to match the number of regions the countries were segmented into geographically. Also note that the order of the countries in the cluster is not relevant.
Cluster 0 : Country 0 : south-america : Guyana Country 1 : south-america : Peru Country 2 : south-america : Bolivia Country 3 : europe : Belarus Country 4 : central-asia : Kyrgyzstan Country 5 : central-asia : Turkmenistan Country 6 : east-n-southeast-asia : Philippines Country 7 : africa : Botswana Country 8 : africa : Seychelles Country 9 : africa : Niger Country 10 : africa : South Africa Country 11 : africa : Congo (Brazzaville) Country 12 : africa : Guinea-Bissau Country 13 : africa : Burundi Country 14 : africa : Equatorial Guinea Country 15 : africa : Djibouti Country 16 : africa : Egypt Country 17 : africa : Malawi Country 18 : africa : Togo Country 19 : africa : Central African Republic Country 20 : africa : Mauritania Country 21 : africa : Guinea Country 22 : africa : The Gambia Country 23 : africa : Burkina Faso Country 24 : africa : Ghana Country 25 : africa : Mali Country 26 : africa : Zambia Country 27 : africa : Gabon Country 28 : africa : Madagascar Country 29 : africa : Comoros Country 30 : central-america-n-caribbean : Nicaragua Country 31 : central-america-n-caribbean : The Dominican Republic Country 32 : south-asia : Maldives Cluster 1 : Country 0 : europe : Guernsey Country 1 : europe : France Country 2 : europe : Jersey Country 3 : europe : Isle of Man Country 4 : central-america-n-caribbean : Saint Lucia Country 5 : central-america-n-caribbean : Saint Vincent and the Grenadines Country 6 : australia-oceania : New Caledonia Country 7 : australia-oceania : French Polynesia Country 8 : north-america : Clipperton Island Country 9 : north-america : Saint Pierre and Miquelon Cluster 2 : Country 0 : south-america : Venezuela Country 1 : south-america : Uruguay Country 2 : south-america : Colombia Country 3 : south-america : Chile Country 4 : south-america : Suriname Country 5 : south-america : Brazil Country 6 : south-america : Argentina Country 7 : europe : Ukraine Country 8 : europe : Macedonia Country 9 : europe : Bosnia and Herzegovina Country 10 : central-asia : Tajikistan Country 11 : east-n-southeast-asia : South Korea Country 12 : east-n-southeast-asia : North Korea Country 13 : east-n-southeast-asia : Timor-Leste Country 14 : east-n-southeast-asia : Vietnam Country 15 : east-n-southeast-asia : Macau Country 16 : east-n-southeast-asia : Japan Country 17 : east-n-southeast-asia : Hong Kong Country 18 : africa : Liberia Country 19 : africa : Cote d'Ivoire Country 20 : africa : Sudan Country 21 : africa : DRC Country 22 : africa : Sierra Leone Country 23 : africa : Libya Country 24 : africa : Eritrea Country 25 : africa : South Sudan Country 26 : africa : Rwanda Country 27 : central-america-n-caribbean : Honduras Country 28 : central-america-n-caribbean : Guatemala Country 29 : central-america-n-caribbean : El Salvador Country 30 : australia-oceania : Solomon Islands Country 31 : middle-east : Syria Country 32 : middle-east : Bahrain Country 33 : middle-east : Saudi Arabia Country 34 : middle-east : Turkey Country 35 : middle-east : Israel Country 36 : middle-east : United Arab Emirates Country 37 : middle-east : Jordan Country 38 : middle-east : Yemen Country 39 : middle-east : Iran Country 40 : middle-east : Qatar Country 41 : middle-east : Lebanon Country 42 : middle-east : Oman Country 43 : south-asia : Afghanistan Country 44 : south-asia : Bhutan Cluster 3 : Country 0 : europe : Faroe Islands Country 1 : europe : Svalbard Country 2 : europe : Iceland Country 3 : europe : Monaco Country 4 : europe : Ireland Country 5 : east-n-southeast-asia : China Country 6 : east-n-southeast-asia : Laos Country 7 : east-n-southeast-asia : Brunei Country 8 : africa : Cabo Verde Country 9 : africa : Mauritius Country 10 : central-america-n-caribbean : Anguilla Country 11 : central-america-n-caribbean : Barbados Country 12 : central-america-n-caribbean : Costa Rica Country 13 : central-america-n-caribbean : Belize Country 14 : central-america-n-caribbean : Grenada Country 15 : central-america-n-caribbean : Trinidad and Tobago Country 16 : central-america-n-caribbean : Haiti Country 17 : central-america-n-caribbean : Virgin Islands Country 18 : central-america-n-caribbean : Aruba Country 19 : central-america-n-caribbean : Jamaica Country 20 : australia-oceania : American Samoa Country 21 : australia-oceania : Australia Country 22 : australia-oceania : Vanuatu Country 23 : south-asia : India Country 24 : south-asia : Sri Lanka Country 25 : north-america : Greenland Country 26 : north-america : Mexico Cluster 4 : Country 0 : europe : Dhekelia Country 1 : europe : Akrotiri Country 2 : europe : Cyprus Country 3 : europe : Malta Country 4 : europe : Gibraltar Country 5 : east-n-southeast-asia : Papua New Guinea Country 6 : australia-oceania : New Zealand Country 7 : australia-oceania : Cook Islands Cluster 5 : Country 0 : south-america : Falkland Islands (Islas Malvinas) Country 1 : south-america : South Georgia and South Sandwich Islands Country 2 : europe : Jan Mayen Country 3 : africa : Saint Helena, Ascension, and Tristan da Cunha Country 4 : central-america-n-caribbean : Turks and Caicos Islands Country 5 : central-america-n-caribbean : Saint Kitts and Nevis Country 6 : central-america-n-caribbean : Antigua and Barbuda Country 7 : central-america-n-caribbean : Cayman Islands Country 8 : central-america-n-caribbean : Dominica Country 9 : central-america-n-caribbean : Montserrat Country 10 : central-america-n-caribbean : Saint Martin Country 11 : central-america-n-caribbean : Saint Barthelemy Country 12 : australia-oceania : Coral Sea Islands Country 13 : australia-oceania : Pitcairn Islands Country 14 : australia-oceania : Tonga Country 15 : australia-oceania : Norfolk Island Country 16 : australia-oceania : Tuvalu Country 17 : australia-oceania : Christmas Island Country 18 : australia-oceania : Tokelau Country 19 : australia-oceania : Niue Country 20 : australia-oceania : Cocos (Keeling) Islands Country 21 : australia-oceania : Ashmore and Cartier Islands Country 22 : australia-oceania : Wallis and Futuna Country 23 : australia-oceania : Samoa Country 24 : south-asia : British Indian Ocean Territory Country 25 : north-america : Bermuda Cluster 6 : Country 0 : central-america-n-caribbean : Navassa Island Country 1 : central-america-n-caribbean : Panama Country 2 : central-america-n-caribbean : The Bahamas Country 3 : central-america-n-caribbean : Puerto Rico Country 4 : central-america-n-caribbean : British Virgin Islands Country 5 : central-america-n-caribbean : Cuba Country 6 : australia-oceania : Northern Mariana Islands Country 7 : australia-oceania : Guam Country 8 : australia-oceania : Marshall Islands Country 9 : australia-oceania : Palau Country 10 : australia-oceania : Baker Island; Howland Island; Jarvis Island; Johnston Atoll; Kingman Reef; Midway Islands; Palmyra Atoll Country 11 : australia-oceania : Wake Island Country 12 : australia-oceania : Federated States of Micronesia Country 13 : australia-oceania : Kiribati Cluster 7 : Country 0 : south-america : Paraguay Country 1 : south-america : Ecuador Country 2 : europe : Moldova Country 3 : europe : Andorra Country 4 : east-n-southeast-asia : Taiwan Country 5 : east-n-southeast-asia : Burma Country 6 : east-n-southeast-asia : Cambodia Country 7 : east-n-southeast-asia : Mongolia Country 8 : east-n-southeast-asia : Indonesia Country 9 : east-n-southeast-asia : Thailand Country 10 : africa : Algeria Country 11 : africa : Tunisia Country 12 : africa : Uganda Country 13 : africa : Cameroon Country 14 : africa : Morocco Country 15 : africa : Nigeria Country 16 : africa : Benin Country 17 : africa : Somalia Country 18 : africa : Senegal Country 19 : africa : Ethiopia Country 20 : africa : Chad Country 21 : africa : Zimbabwe Country 22 : africa : Tanzania Country 23 : africa : Namibia Country 24 : africa : Mozambique Country 25 : africa : Sao Tome and Principe Country 26 : africa : Lesotho Country 27 : africa : Angola Country 28 : africa : Swaziland Country 29 : australia-oceania : Fiji Country 30 : middle-east : Kuwait Country 31 : middle-east : Iraq Country 32 : middle-east : Georgia Country 33 : south-asia : Pakistan Country 34 : south-asia : Bangladesh Country 35 : south-asia : Nepal Cluster 8 : Country 0 : europe : Montenegro Country 1 : europe : Holy See (Vatican City) Country 2 : europe : Belgium Country 3 : europe : Luxembourg Country 4 : europe : Croatia Country 5 : europe : Liechtenstein Country 6 : europe : Kosovo Country 7 : europe : Slovakia Country 8 : europe : Czechia Country 9 : europe : Albania Country 10 : europe : Netherlands Country 11 : europe : Switzerland Country 12 : europe : Bulgaria Country 13 : europe : Serbia Country 14 : europe : Slovenia Country 15 : europe : Hungary Country 16 : east-n-southeast-asia : Singapore Country 17 : east-n-southeast-asia : Malaysia Country 18 : australia-oceania : Nauru Cluster 9 : Country 0 : europe : Greece Country 1 : europe : San Marino Country 2 : europe : Spain Country 3 : europe : Germany Country 4 : europe : Estonia Country 5 : europe : Sweden Country 6 : europe : Finland Country 7 : europe : United Kingdom Country 8 : europe : Lithuania Country 9 : europe : Latvia Country 10 : europe : Romania Country 11 : europe : Austria Country 12 : europe : Portugal Country 13 : europe : Norway Country 14 : europe : Italy Country 15 : europe : Denmark Country 16 : europe : Poland Country 17 : central-asia : Russia Country 18 : central-asia : Uzbekistan Country 19 : central-asia : Kazakhstan Country 20 : middle-east : Azerbaijan Country 21 : middle-east : Armenia Country 22 : north-america : Canada Country 23 : north-america : United States
Lengthy list! So, what can a quick scan of the cluster results tell us? There are some clusters with dominant regions (defined as having greater than 2/3 of the records): Cluster 0 (africa), Cluster 4,8,9 (europe), while the region with the greatest cohesion is middle-east, which only appears in three clusters (2,7,9). The cluster with the least amount of diversity, in terms of regions represented is Cluster 6, with only two regions (central-america-n-caribbean, australia-oceania). Clusters 1,4,5,6 are dominated by island countries/territories spread throughout the globe.
So, generally speaking, the country/territory descriptions did not lend themselves to the same geographic clustering that exists geographically on Earth. Possible enhancements to the analysis include:
a) remove duplicate entries from the tokens
b) keep the free-standing years in the tokens, but stem them to remove the last value, so decades would match (i.e. 1916 would become 191).
c) remove any token with a length if less than three.
d) remove the name of the country from its own token list.
e) alter the settings for the K-Means clustering algorithm. The TfidfVectorizer command has many other parameters beyond what I have used here. The “df” parameters I used relate to word document frequency and removes tokens that occur in fewer than 0.1 or greater than 0.9 of the documents.
f) remove the names of other countries from the token list. This I think is not a good idea because the history of certain countries is closely linked to other countries, and it would be important to preserve this. An example of this behavior is Cluster 1, which is made of almost entirely of France and countries/territories that were once or still are considered part of that country. It is by no means an exhaustive list of countries that have ties to France.
(*) A major unknown in this analysis is: who are the authors of these descriptions? Are they all authored by one person? each region by one person? a collaboration of multiple authors? A bias may be introduced depending on the answers to these questions. I do not have the answers to these questions of authorship, and the CIA is not talking!