This past summer I worked as a data insights intern at Mercedes-Benz Financial Services tackling a problem that any company with data struggles with: data governance.
With so many employees dependent on such an extensive database, it is hard to guarantee data integrity and disseminate knowledge of the underlying sourcing. Thus, both efficiency and accuracy are both compromised.
An effective, in house solution would be a data catalog application that a user can use to grab relevant information on specific subsets of data needed to identify trends and gain insight. We want it to be easy to use and efficient, querying through a huge volume of data quickly, but also accurate.
We decided on a simple search engine. It takes in user input, either free form or data specific, and returns 5-10 results that are most similar to the input. The user can then manually choose a result and get more detailed information on his or her search.
The first step was to tackle the underlying dataset the search engine would be querying for the information it would display. We went in and manually created a new column in the dataset called "Revised_Column_Name" which was basically an unabbreviated version of "COLNAME". This would make the later process of matching the user input to the information in the dataset easier. Although tedious, it ensured the matching would be as accurate as possible.
The application depends on two main functions, one that gets the main idea of the user's search and another that scores how similar the main idea is to each entry in the dataset.