Jessica Zhang

Problem Statement

How can MBFS more easily understand its data?

With so many employees dependent on such an extensive database, it is hard to guarantee data integrity and disseminate knowledge of the underlying sourcing. Thus, both efficiency and accuracy are both compromised.

Current Issues:

Heavy dependency on "data masters" for explanations and sourcing
Not having an agreed upon definition of data between business users and the data department
Redundancy and duplication of columns across tables
Time searching for reliable information without easy self-service tool

Value Proposition

What would a self service tool to solve these problems look like?

Enterprise Data Catalog

EPM licensing
Enterprise money > 1 million + 100k per license
Heavy maintenance

SharePoint

Difficult to search
Not easily scalable
Cumbersome to update

Building it in-house

Inexpensive
Easy to search
Ownership of product
Capable of adding features/functionality
No licensing issues/cross team usage

Ideation

An effective, in house solution would be a data catalog application that a user can use to grab relevant information on specific subsets of data needed to identify trends and gain insight. We want it to be easy to use and efficient, querying through a huge volume of data quickly, but also accurate.

Wireframing

We decided on a simple search engine. It takes in user input, either free form or data specific, and returns 5-10 results that are most similar to the input. The user can then manually choose a result and get more detailed information on his or her search.

Backend Analysis

The first step was to tackle the underlying dataset the search engine would be querying for the information it would display. We went in and manually created a new column in the dataset called "Revised_Column_Name" which was basically an unabbreviated version of "COLNAME". This would make the later process of matching the user input to the information in the dataset easier. Although tedious, it ensured the matching would be as accurate as possible.

How it Works

The application depends on two main functions, one that gets the main idea of the user's search and another that scores how similar the main idea is to each entry in the dataset.

First, the search engine will take in free form text or exact column names. This is where the first function comes into play:
- Main Idea Function
  - Takes the user's input
  - Lowers everything and removes the punctuation
  - Splits it into separated words
  - Iterates through the words and gets rid of any also present in the stop-word list, a list of commonly used words (such as "the", "a", "an", "in") that we program the search engine to ignore.
  - Joins the remaining words together to form the main idea of the user's input
Next, the application will query Vertica, where the company dataset resides, to pull out the columns of the dataset the user is most likely looking for. This is where the second function comes into play:
- Scoring System Function
  - Iterates through the dataset and calculates a:
    - score for how similar the main idea is to the revised column name
    - score for how similar the main idea is to the definion
  - Uses a percentage of each score to calculate a composite score:
    - (0.7 * column name score) + (0.3 * definition score)
Lastly, the user will be allowed to manually choose which column they want more information on and the application will display that

Mercedes-Benz Internship

S.M.A.R.T data catalog application

Introduction