Searching For Anonymous With Natural Language Processing, Part II

Searching For Anonymous With Natural Language Processing

A previous post discussed the speculation surrounding the identity of the anonymous author of a 2018 New York Times op-ed and recent book critical of President Trump. At the time, attention swirled around Guy M. Snodgrass, a former speechwriter for erstwhile Secretary of Defense James Mattis.

In the ensuing months, the spotlight swiveled away from the speechwriter, who lacked the access Anonymous appeared to enjoy, and onto former Deputy National Security Advisor official Victoria Coates. Reports indicated that White House officials differed over whether Coates was Anonymous. She denied it, but ultimately transferred, or was transferred, from her National Security Council position in the White House complex to the Energy Department.

Suspicion of Coates reportedly hinged, among other reasons, on forensic linguistic analysis, although another report indicated that authorship attribution software acquired by the White House “was difficult to use and the effort failed.” It is unclear what software or analyses the White House tried.

As described below, natural language processing (“NLP”) analysis shows that Coates’ style resembles that of Anonymous noticeably less than did Snodgrass’.

By way of refresher, the earlier discussion mapped six stylistic traits and six parts of speech on a radar graph to facilitate comparison. The stylistic traits include average words per sentence, standard deviation of words per sentence, lexical diversity (the ratio of number of different words used to the total number of words used), commas per sentence, semicolons per sentence, and colons per sentence. The parts of speech include per-sentence frequencies of the following: singular noun, plural noun, singular proper noun, determiner, preposition or subordinating conjunction, and adjective.

Coates, like Snodgrass, has written one attributed, published book, David’s Sling: A History of Democracy in Ten Works of Art, which I bought and had scanned to text using OCR. I haven’t read the books discussed here, other than a cursory check on the OCR quality.

I then compared the stylistic traits and part-of-speech characteristics of David’s Sling with Anonymous’ book, A Warning. As before, A Warning sets the baseline value for each trait, with the corresponding value for the other books expressed as a percentage of the value for A Warning. The results, depicted in the tables and charts below and aggregated in the radar graph atop this post, show that Coates’ writing differs from Anonymous considerably more than does that of Snodgrass:

Stylistic Trait	A Warning, by Anonymous	Holding the Line, by Guy M. Snodgrass	David’s Sling, by Victoria Coates	**Holding the Line as a percentage of A Warning**	**David’s Sling as a percentage of A Warning**
Words per sentence, mean	17.28	17.98	20.78	104	120
Words per sentence, standard deviation	12.58	12.49	15.64	99	124
Lexical diversity (number of different words used:total number of words used)	0.147	0.123	0.144	84	98
Commas per sentence	0.8	0.98	1.46	123	183
Semicolons per sentence	0.02	0.014	0.116	70	580
Colons per sentence	0.04	0.11	0.11	275	275

Searching For Anonymous With Natural Language Processing

Part of Speech	A Warning, by Anonymous	Holding the Line, by Guy M. Snodgrass	David’s Sling, by Victoria Coates	**Holding the Line as a percentage of A Warning**	**David’s Sling as a percentage of A Warning**
Noun, singular	4.04	4.63	5.03	115	125
Noun, plural	0.93	0.74	1.09	80	117
Proper noun, singular	0.06	0.07	0.05	117	83
Determiner	1.59	1.54	2.16	97	136
Preposition or subordinating conjunction	1.7	1.8	2.5	106	147
Adjective	1.8	1.9	2.2	106	122

Searching For Anonymous With Natural Language Processing

For five of the six stylistic traits, Coates’ David’s Sling diverges from A Warning more than does Holding the Line. In the sixth category, colons per sentence, both books double A Warning’s rate. Among the parts of speech, Snodgrass matches Anonymous much more closely than does Coates. Yet it seems unlikely that Snodgrass is Anonymous given the likely limited White House access of a Pentagon speechwriter.

The radar graphs above and atop this post make the point visually. The closer the objects of comparison, the more nearly the graphs will overlap. It is easy to see that Coates’ book differs from A Warning more starkly, and across more traits, than does Holding the Line.

As with my earlier analysis, this exercise was designed to demonstrate some characteristics and applications of NLP, not to engage in a serious detection effort. As noted in my earlier post, a more comprehensive investigation could encompass numerous additional factors, such as additional parts of speech, or additional dimensions such as per-page and per-chapter. In addition, potentially insuperable obstacles may hamper any authorship attribution analysis, whether or not involving NLP, such as an intentional change of style to obscure authorship.

Searching For Anonymous With Natural Language Processing, Part II

Recent Posts

Contact Us

Download Datasheet

Get In Touch

Privacy Policy