I am a Data Scientist with over 7 years of expertise in applied analytics strategies, discovering insights, and driving business success. I have a strong background in product analytics, primarily within sales and marketing teams. I also have growing NLP experience.
I am skilled at analyzing data and storytelling, advanced statistical methods, and deploying scalable solutions. I have a proven track record researching and implementing unique and innovative solutions to business problems.
This site is a showcase of my early projects. Included are decision trees to discover factors of football event attendance, text mining document classification, and twitter sentiment analysis.
Other projects I've done include leads prioritization with logistic regression, sales forecasting and with ARIMA time series modeling, model year life cycle prediction using various time series methods, and marketing optimization using multiple regression. Download my resume below for more info!
Please feel free to email me at chrisgbergin@gmail.com. Thank you! Looking forward to connecting.
Goal:
Determine the factors that affect attendance of Georgia Southern students at their home football game vs. Appalachian State.
There are 20,550 instances, which is the total amount of Georgia Southern University (GSU) students enrolled as of September 25, 2014 (GSU vs. Appalachian State Game). The matchup is a historic rival, which makes it an interesting game to analyze. There were originally 43 attributes that we reduced to the 18 displayed below.
Georgia Southern's Event Management System records when a student swipes their Eagle ID card to gain entry to a football game. Attendance data is collected for event statistics and NCAA reporting. This binary outcome becomes our FOOTBALL attribute that we used to classify students in the dataset.
Georgia Southern's Banner System is used to gather demographic data about students. The Banner data was joined to The Event Management System on the student’s Eagle ID number which was replaced with a generic INSTANCE number for confidentiality reasons. We received Institutional Review Board (IRB) under exempt status because the data was anonymous. This approval allowed us to research and present our findings.
Attribute | Description | Attribute | Description |
---|---|---|---|
INSTANCE | Student’s unique identifier | FOOTBALL | Did the student attend the game? |
HOUSING INDICATOR | If the student lives on campus or not | MEAL PLAN | Which meal plan the student has if any |
AGE | The student’s age | ETHNIC DESC | The student’s ethnicity |
CURRENT TERM CREDIT HOURS | The number of credit hours the student is taking this term | LEVEL CODE | Type of degree that the student is pursuing (undergraduate, graduate, doctoral) |
DEGREE DESC | The degree that the student is pursuing | COLLEGE DESC | The college the student is a part of |
MAJOR DESC | The student’s major | EXP GRAD DATE | The student’s expected graduation term |
FIRST TERM | The term the student started at school | CLASSIFICATION DESC | Student’s year in school |
HOURS TOTAL | The number of credit hours the student has completed | OVERALL GPA | The student’s GPA |
SOR_FRAT STATUS | If the student is a part of a sorority or fraternity | SEX | The student’s gender |
For attributes such as SAT TOTAL and OVERALL GPA with many missing values we used mean replacement. For attibutes such as MEAL PLAN where missing values just meant the student did not opt in to an optional university plan we replaced with a value of “None”.
We used Microsoft Sql Server Management Studio (SSMS) as our database engine and Microsoft SQL Server Analysis Services (SSAS) for developing data mining models.
We used 70 percent of the data to create the model and the remaining 30 percent for testing the model's performance.
Based on Rev. Thomas Bayes’ theorems, this algorithm works well with categorical data, missing values, outliers, and is a good exploratory model because it alludes to which input attributes are important in predicting the output.
The accompanying attribute profiles are an example of the result of the Bayes algorithm. They visually display how different states of the input attributes (Banner data) affect the outcome of the classification attribute (Football Attendence).
Trees are easy to understand and visualize and work well with categorical data.
This flow diagram is the main result of the Microsoft Decision Tree algorithm. Each rectangle is called a node that has a miniature bar graph that shows what percentage of each classification of Football Attendence resides inside of it. Each node relates to a set of rules or conditions that an instance has to meet to end up in that particular node.
A student’s age was the primary factor in determining if they attended the football game. Students under 21 years of age were more likely to attend the game.
The next factor was a students classification (Freshman, Sophomore, etc). Attendance decreases with progression through college. It was interesting to see that that a students age and classification were not always related and there are many non-traditionally age students at GSU.
The Third influencing factor on GSU student football attendance was the number of credit hour taken in the semester. 12+ hour students (the limit to be considered a full time student) were more likely to attend.
We suggested marketing targeted towards older students, upperclassmen, and non-traditional students because they are the ones less likely to go to the games.
My group took submitted our research paper titled Analysis of Georgia Southern University Student College Football Attendance to the Southeast Decision Sciences Institute (SEDSI) conference in Savannah, GA and won Best Paper Award in Undergraduate Student Research. It was an incredible experience to present at a conference as undergrads and Georgia Southern even honored us with an article.
Goal:
Automatically classify documents based on their content using machine learning text analytics.
Textron is a Fortune 500 conglomerate company with multi-industry businesses that has military products and works with the government. Some of their documents contain sensitive material and are subject to export regulations (controlled) while others are OK for anyone to view (uncontrolled). Errors in classification of these documents can result in legal action, which costs the company time and money and can result in red flagging in the marketplace, which decreases business opportunities. Because of these negative consequences, the people who manually classify documents tend to overclassify the documents as containing restricted information to stay on the safe side, which can withhold data from people who could make use of it.
My goal was to address these issues by developing a model that could make a prediction on the classification of a document (Controlled vs. Uncontrolled) within the target accuracy of 65-75% as seen in the accompanying figure.
A collection of documents known as a corpus was gathered containing both controlled and uncontrolled documents. We assumed that the documents were classified correctly to begin with and that there is determining factors of classification within the contents of the documents. I incorporated Apache Tika to convert the corpus into plaintext so the model could be more efficient.
Documents that are known to be controlled are stamped with a disclaimer that says they contain sensitive data (See below figure). This is just an example of one thing that needs to be 100% removed during pre-processing to not bias the way the model is trained.
The models were developed using both the R programming language and a free GUI analytics program called RapidMiner.
Just like the development of the Football Attendence Model, I used 70 percent of the data to create the model and the remaining 30 percent for testing the model's performance.
The models are based on a multidimensional plane. Each word is its own axis and the weight of the words becomes a plotted point. The vertex of these points is the final position of the document on the plane.
The K Nearest Neighbor algorithm makes a prediction based on where the evaluated case falls amongst "neighbors" of past cases. The classification of the closest neighbors becomes the predicted classification of the evaluated case.
Support Vector Machines are based on an optimized divider that splits the overall plane into two sides based on their classification. The side that the evaluated case falls on becomes the predicted classification for that case.
Some of the analysis of the results included: top influential words of each classification, dendrograms, area under the curve, and accuracy.
It was interesting that the top words were not incredibly revealing of the sensitive subjects of the documents. Many people thought that this would be the case but the natural language processing behind the process revealed that the mention of these regular words were actually where the heavy influence lies on the classification of the documents. Average words and the way the documents were composed gave the strongest clues as to the classification of the documents.
Dendrograms cluster words that appear together frequently in the documents. They reveal trends and valuable information about the text.
The included dendrograms were generated in the preliminary stages of my process. The outlined clustered revealed words that appeared together often that proved to be disclaimers that we originally overlooked and were able to filter out and therefore improve the preprocessing stages and the overall model.
Accuracy is a measure of the predictive power of the model on the testing cases of the original sample. It is important to note that the model is classifying uncontrolled documents as controlled more than the other way around. This was a goal I discussed with my managers. The model cannot be perfect but we prefered it to be wrong this way.
Area Under the Curve is essentially a measure of the likelihood that one document chosen will be classified correctly.
I completed my job as an intern by successfully developing a model that proved the concept of using text analytics to automatically classify the companies documents.
The next steps for the company are to continue to build a true, accurate, enveloping, random, representative sample of all of Textron’s documents across their business units. The structure of the model will be the one I developed but with additional business unit specific information, Textron can build a model against the classification number of specific types of controlled vs types of uncontrolled documents. The technology also needs to be packaged into a clean application to better be accepted by the company as a whole.
The recursive process of determining where a model is failing will last the lifespan of the model. Re-feeding the model false results to examine performance on small datasets will be important to see where improvements can be made. The maintenance of the model includes this as well as pulling new documents to keep the model up to date with the new content of controlled and uncontrolled documents.
Goal:
Explore social media for any interesting findings on the 1st game of the 2015 World Series: New York Mets vs. Kansas City Royals.
My team logged tweets based off the search term “world series” from twitter. We chose a search instead of a hashtag to get broader results. We collected tweets throughout the entire 1st game of the 2015 World Series, which turned out to be quite interesting, lasting over 5 hours with a delay of game, a total of 14 innings making it the longest Game 1 in World Series history.
Tweet Archivist logged basic data about the tweets such as the timestamp, the user, any interactions with the tweet, and of course the content of the tweet itself. We exported this data as a csv file and imported it into SAP HANA to use as our database engine and for exploratory analytics. The only preprocessing that we did was to remove non-english tweets.
Using the timestamp of the tweets, we classified the tweets into three “buckets” of time: intervals of hours, half hours, and quarter hours.
We generated calculation views with these new columns in SAP HANA where we could see the amount of tweets for each bucket of time. The accompanying picture shows the amount of tweets over time in 15 min intervals.
Using SAP HANA’s built-in functionality to aid linguistic analysis by automatically extracting meaning from text data, we applied the Voice of the Customer function to our dataset.
This function extracts text entity types (based on a predefined list of entity types. Some entries include: facility, general request, organization, entertainment, sentiment, vehicle, product, people country, etc), sentiment analysis (attitude of the tweets), and tokenizes the tweets (breaks tweets into individual terms that can be used as variables of analysis).
Using the columns generated from the voice of the customer function we were able to see the amount of entity types and determine the stance and influence of our tweeters, which we used to cluster the tweeters into a few different groups.
We could visualize the amount of each type of entity mentioned in our dataset and after removing the obvious and the miscellaneous ones we generated the accompanying world cloud.
We created a table of the top 1000 tokens along with their frequency counts. We connected R to HANA with an Open Database Connectivity (ODBC) bridge and used the world cloud package to generate a word cloud of the top tokens as seen in the accompanying picture.
Seeing that delay was a large term mentioned we decided to dig deeper. Using the bursts algorithm, we generated the accompanying graphic that has the level of burst associated with the token "delay".
Burst level is a measure that increases when there is an intense "burst" of a given token. The algorithm uses Markov's model to detect periods of increased activity in a series of events so if something like a power outage happens and many people use the word "delay" in their tweet then the level of burst will be high.
We also mapped the frequency of the mention of the term "delay" over time and got similar results. We added our inferences to the accompanying graph.
HANA classifies tweets as Strong Positive (1), Weak Positive (.5), Neutral (0), Weak Negative (-.5), and Strong Negative (-1)
I assigned the above parenthetical number values to each sentiment type and taking the average sentiment of the users in our 15 minute interval baskets, I plotted this average sentiment over time.
Notice how the lowest point matches up with the highest frequency of the term "delay". The post game positivity could be the zeal of finally starting an eventful World Series.
Using excel's map feature of power view we plotted the amount of tweets with each sentiment by location. The size of the pie chart represents the amount of tweets and the colors correspond to their respective sentiments.
It was interesting to see just how much of a world audience the World Series has characterized by their geographical opinions of the first game.
We derived a tweeter stance attribute from the sum of sentiment tweeted by a user. If a tweeter tweets more positive than negative overall then they will have a higher stance score. We derived a tweeter influence attribute based on a tweeter’s interactions with others. If a tweeter gets more replies and retweets then they will have a higher influence score.
We took this data and appended the number of tweets for each user then took the top 300 tweeters to see if we could segment our top tweeters in any significant way utilizing SAP Predictive Analytics to cluster by using the R’s K-Means algorithm.
We were able to come up with 3 clusters with two main findings. We were able to segment our tweeters into the influential, the positive, and then everybody else. This revealed that the influential had relatively neutral tweets. Further exploration revealed that tweeters in this cluster were mostly news sources that remained unbiased and therefore had mostly neutral sentiment.