A simple service return top used words in a given text and how many times they occur
Input: a text of string contains many word without punctuation and number n represent top n words occur in text
Output: return a list of key-value with represent for top n most used words and the number of time they occur in text, if there is more than one word with same number occur then choose any of them.
Example:
ex1
input: 2, "the sky is the great black sky for the camera"
output: {"the" : 3, "sky" : 2}
ex2
input: 2, "no one no why"
output: {"no" : 2, "why" : 1} or {"no" : 2, "one" : 1}
To install project, you need to install Go and set your Go workspace first, then download repository or clone.
git clone https://github.com/tangducthinh456/most-used-word.git
First go to root directory of project
cd most-used-word
To start service, use go command
go run src/main.go
To start unit test, use go test command
go test -v ./src/test/unit_test/
To start integration test, use go test command
go test -v ./src/test/integration_test/
└───src
├───config
├───model
├───server
│ ├───handle
│ └───router
├───services
└───test
├───integration_test
└───unit_test
Config package used to configure all configuration of service, include server config (IP, Port, HTTP/HTTPS), Database config (IP, Port, Username, Password, Table) if any, etc ... By default, config package will get default config from file config.yaml in root directory of project. Config is implement in Singleton pattern because config object of package is initial at once while program run, and it has private constructor and a public method GetServerConfig() to access from outside package.
Package model defines model of service used, include data transfer object between server and client, model to save to database if any.
Server folder will contain 2 packages handle and router.
Handle package define all handler used in router.
Router contains object route that is implement in singleton pattern because it also initial once.
Services package provide all bussiness of service and method that other package call. If service connect to a database, services package also have responsibility for connection and all method CRUD on it. On this service, the business is function that get text and return top used word.
Test folder contains 2 package unit_test and integration_test.
Unit_test is used to test unit of project which is business function, ...
Integration_test is used to test integration of package and final result.
The main purpose of this service is to get input as a string together with a number of top words want to know be used in string and return a list of key-value pairs with key is words and value is the number it occurs in string.
To solve this problem, we can have a very simple approach that is loop for each word that contains in the string and then do another loop inside to count how many times it occur. There will be 2 nested loop so the time complexity is O(n^2) and we have to save the state of words counted to an array of key-value pair so the space complexity is O(n), after that we have to do another loop to get top used words. This approach costs a lot of time complexity, which is unacceptable in real massive scale system.
Another approach that is more efficient is use data struct map with key is the word in string and value is the number it occur. First we iterate to all words in string and for each word, the initial value of map is 0, when we meet a word we add it to our map and increase the value corresponding to the word we meet which is the key to represent it's occurent in string. Finally we we will have a map with word key and occurent value. The time complexity to this step is O(n) because we just have to loop one time. But map in Go is not implement as order map, which means the largest value is not at first, so you can implement an order map which will save the state of key-value in head of map in value order, but the implementation of order map is not easy. The problem now is find n top largest value in the map, which will be easy solved by sorting then choose top element from it. But Go does not allow to sort map, so we have to implement a struct that map the key and value from map, and use sort function in Go built-in. The sort function in Go will be implement efficiently as the quicksort with average time complexity in O(n.logn). So the overall time complexity of this approach is O(nlogn) and space complexity is O(2n) ~ O(n).
Detail of the code is implement in function TopUsedWord in package services.
With unit test, for simple I just use blackbox test, I divide all posibility in 2 range which is 0 < top < 5 and top >= 5. With this division, I use robust boundary testing techic, which will test the value before, in and after point of range. There are 2 points 0 and 5 so all test are in -1, 0, 1 and 4, 5, 6. I also test in some special case like if we need top 2 words but the case is {"a" : 3, "b" : 2, "c" : 2, "d" : 2}, it means there are many seconds top word that equals.
With integration test, I test with 2 case. Case success return value and check the value and status code, case fail I check the status code.
Thanks for reading