Ecuadorian Linguistic Corpus

GitHub Repository

Backend structure

To create a structure for a corpus, first is necessarily think in databases. Considering that a corpus is going to manage thousands of words, lemmas, forms and documents it needs a big DDBB manager, I choose Postgres for that. Also, I choose Django because you can manage a big web page with modular parts. That means it can be growing with no conflicts of the other modules. Now it has 4 Apps, one for the Spanish Corpus, one for Users managing, one for contact and one for manage stats of the page. It is expected to be able to add more Apps that handle other languages soon.

The data base has to main tables (or Models, like Django calls it), one for documents and one for the forms. Each of one has auxiliar tables for standard information, like the specific name of grammar or type of document. It is possible to make a query of forms, applying filters of documents.

Spacy as a Backend tool

Adding thousands of Forms can be a really complicated, because it would be necessary a human determining the grammar of each word. But adding Spacy the work becomes easy, it can tokenize, and make a POS tagging. In other words, it can determine which is the part of the speech of that word. Then humans have only to check it. The accuracy of Spacy in Spanish is above 98%.

Then all the documents are processed and tokenized with spacy. The queries also use spacy to find the 10 words after and before each queued form.

Statistics

When you make a query of a form it’s useful know which other words are before and after each form. Also know the most representative lemmas, topics and zones where the form has been used. All this information is displayed when you make a query. All this information is collected and processed with Space, Pandas and Matplotlib.

Also, the main statistics of the site, are processed when you entered to the “actual status” tab. There you can see how many documents, lemmas, forms, topics and zones are in the Database.

Bootstrap

To make a responsive Webpage I used Bootstrap tools. And for making the dynamc forms I used AJAX. It can change depending of what you choose.

Project link: https://github.com/dfmoscoso23/corpus

Nifty tech tag lists fromĀ Wouter Beeftink