Skip to content

tangducthinh456/most-used-word

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

most-used-word

A simple service return top used words in a given text and how many times they occur

Requirement

Input: a text of string contains many word without punctuation and number n represent top n words occur in text

Output: return a list of key-value with represent for top n most used words and the number of time they occur in text, if there is more than one word with same number occur then choose any of them.

Example:

ex1

input: 2, "the sky is the great black sky for the camera"

output: {"the" : 3, "sky" : 2}


ex2

input: 2, "no one no why"

output: {"no" : 2, "why" : 1} or {"no" : 2, "one" : 1}

Installation

To install project, you need to install Go and set your Go workspace first, then download repository or clone.

git clone https://github.com/tangducthinh456/most-used-word.git

How to run

First go to root directory of project

cd most-used-word

To start service, use go command

go run src/main.go

To start unit test, use go test command

go test -v ./src/test/unit_test/

To start integration test, use go test command

go test -v ./src/test/integration_test/

Projec structure

└───src
    ├───config
    ├───model
    ├───server
    │   ├───handle
    │   └───router
    ├───services
    └───test
        ├───integration_test
        └───unit_test

Config

Config package used to configure all configuration of service, include server config (IP, Port, HTTP/HTTPS), Database config (IP, Port, Username, Password, Table) if any, etc ... By default, config package will get default config from file config.yaml in root directory of project. Config is implement in Singleton pattern because config object of package is initial at once while program run, and it has private constructor and a public method GetServerConfig() to access from outside package.

Model

Package model defines model of service used, include data transfer object between server and client, model to save to database if any.

Server

Server folder will contain 2 packages handle and router.

Handle package define all handler used in router.

Router contains object route that is implement in singleton pattern because it also initial once.

Services

Services package provide all bussiness of service and method that other package call. If service connect to a database, services package also have responsibility for connection and all method CRUD on it. On this service, the business is function that get text and return top used word.

Test

Test folder contains 2 package unit_test and integration_test.

Unit_test is used to test unit of project which is business function, ...

Integration_test is used to test integration of package and final result.

Get detail

The main purpose of this service is to get input as a string together with a number of top words want to know be used in string and return a list of key-value pairs with key is words and value is the number it occurs in string.

To solve this problem, we can have a very simple approach that is loop for each word that contains in the string and then do another loop inside to count how many times it occur. There will be 2 nested loop so the time complexity is O(n^2) and we have to save the state of words counted to an array of key-value pair so the space complexity is O(n), after that we have to do another loop to get top used words. This approach costs a lot of time complexity, which is unacceptable in real massive scale system.

Another approach that is more efficient is use data struct map with key is the word in string and value is the number it occur. First we iterate to all words in string and for each word, the initial value of map is 0, when we meet a word we add it to our map and increase the value corresponding to the word we meet which is the key to represent it's occurent in string. Finally we we will have a map with word key and occurent value. The time complexity to this step is O(n) because we just have to loop one time. But map in Go is not implement as order map, which means the largest value is not at first, so you can implement an order map which will save the state of key-value in head of map in value order, but the implementation of order map is not easy. The problem now is find n top largest value in the map, which will be easy solved by sorting then choose top element from it. But Go does not allow to sort map, so we have to implement a struct that map the key and value from map, and use sort function in Go built-in. The sort function in Go will be implement efficiently as the quicksort with average time complexity in O(n.logn). So the overall time complexity of this approach is O(nlogn) and space complexity is O(2n) ~ O(n).

Detail of the code is implement in function TopUsedWord in package services.

Test

With unit test, for simple I just use blackbox test, I divide all posibility in 2 range which is 0 < top < 5 and top >= 5. With this division, I use robust boundary testing techic, which will test the value before, in and after point of range. There are 2 points 0 and 5 so all test are in -1, 0, 1 and 4, 5, 6. I also test in some special case like if we need top 2 words but the case is {"a" : 3, "b" : 2, "c" : 2, "d" : 2}, it means there are many seconds top word that equals.

With integration test, I test with 2 case. Case success return value and check the value and status code, case fail I check the status code.

Finally

Thanks for reading

About

A service return 10 most used words and how many times they occur

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages