Justin Zhang

Noam - Word List Generator

Avram Noam ChomskyAvram Noam Chomsky

Noam was a project created for the purpose of learning about web scrapping, cleaning datasets, and handling/manipulating tabular data.

Jump to Source Code

Who is Noam Chomsky?

Headshot

Noam Chomsky is an American linguist responsible for the creation of "the theory of generative grammar".

This is one of the most important contributions to the field of linguistics made in the 20th century and has dubbed him the father of modern linguistics.

Chomsky pioneered many theories in psychology's cognitive revolution.



He rejected the idea that newborn babies were born with blank minds and that children acquired language via learning and mimicry.

Chomsky argued that human beings were born with the innate ability to realize the "generative grammars" (ie. the structures of languages) that constitute every human language. Children make use of this innate ability to learn the languages that they are exposed to.

His influence has equipped many researchers with ways of thinking and ways to innovate within the fields of cognitive science, neuroscience, and artificial intelligence.


Noam Chomsky | Biography

Famed scholar Noam Chomsky is known for both his groundbreaking contributions to linguistics and his penetrating critiques of political systems.



What is this project about?

Noam was a project created for the purpose of learning about web scrapping, cleaning datasets, and handling/manipulating tabular data. Furthermore, my brother used to participate in spelling bee competitions and often had issues finding clean wordlists to practice with. This is why I thought scrapping words, definitions, etc. would be a great way to solve this problem. It would also be a perfect opportunity for me to practice dealing with big data.

Noam generates wordlists used for spelling bee purposes. It provides a way for spellers to test their spelling skills and improve their vocabulary. It also provides a way for teachers to create wordlists for their students to practice. Wordlists contain the following categories:

  • Words
  • Definitions
  • Pronunciations
  • Parts of Speech
  • Sentences
  • Etymology

Source Code


Word Scrapper | Github Repository

Noam generate wordlists used for spelling bee purposes. It provides a way for spellers to test their spelling skills and improve their vocabulary. It also provides a way for teachers to create wordlists for their students to practice.

Scrapping for Words

When I first started scrapping for words along with their definitions, parts of speech, etc. I needed a list of words, so I used this GitHub repo by dwyl to have a starting point from which to scrap.


English Words Dataset | Github Repository

A text file containing 479k English words for all your dictionary/word-based projects


I used the python library beautifulsoup4 to help with static scrapping Merriam-Webster. Initially, I didn't have any issues, but it wasn't long before I realized my first problem. Lots of the words you scrape for don't have many of the categories you want.

  1. Many of the pronunciations were just incorrect.
  2. The example sentences weren't friendly and complicated to understand.
  3. The word you're looking for doesn't give you the correct page. (ie. searching for 'quickly' will give you a page for 'quick')

As great as Merriam-Webster was, lots of its information just wasn't there. My next solution was to scrap from multiple sources. I ended up grabbing words from here, using this for etymologies, this for sentences, Merriam Webster for pronunciation, and the remaining categories from here and parsing it into a file in JSON formatting (which you can find in the Github Repo)


Project Link

Here you can generate wordlists yourself, the form requires a wordlist length ranging from 1 to 600 words, and an API key provided below.

Noam Application

Noam | Word Scapper - Wordlist Creator

Noam creates wordlists for spelling bees, aiding spellers in skill improvement and vocabulary expansion, and assisting teachers in crafting practice materials for students.



What isn't perfect

I wasn't able to fix issues related to problem number 3 for pronunciation. So some of the words generated in the list might provide a variant pronunciation. (ie. the word "quickly" will provide a pronunciation for "quick")

Around 40-50% of the words have an unknown etymology, my method of getting etymology is to grab the first block of text at etymonline.com and filter it for keywords such as



Example page of origins scrapped

And if none of these words were found, it would be labeled as unknown. I could have added more origins to lower this percentage, but I found that adding too many would give certain words too many origins. The issue was that some words were just documented too well, and others had little to none (ie. just a couple dates)

You can find the scripts I wrote up to scrap these sites in the utils folder in the repository.

Issues with databases

I initially wanted to store each word into a key-value database like AWS dynamoDB, but just found it to be incredibly inefficient, since each generation required 1 to 600 random reads. It was also just too slow.

I ended up just storing it as a JSON text file in memory.

In hindsight, looking at all of the database paradigms It might have been better to use a document paradigm like MongoDB or Firebase. I didn't use a relational database, because I found that there wasn't a need to relate any data at all since each word along with all its categories is independent. And it didn't make any sense to have 1 table with each word as its own entry.

What is the meaning behind the project name?

I've written a post about the meaning behind my project names, you can read about it here.



Built with Next.js and Tailwind CSS. Made with ❤️ by Justin Zhang.