Description of the Dataset

Language learning becomes harder as you grow older. Studies have shown that learning language as a child is much more effective and by the age of 5, an infant can learn any language regardless of the consonant inventory. Babies are not associated with one single phonological system and thus, can pick up pronunciation differences. As we become more attached with one language and use it more often, our pronunciation in additional languages will sound accented compared to that of native speakers.

A study was conducted on the raw, anonymized data of learner demographics and their score on the State Examination of Dutch as a Second Language (STEX) and matched with their background in terms of additional language fluency, national and linguistic background, educational background, and more. This is done in order to understand the impact of past linguistic experiences on the process of learning a new language. To do this, several linguistic comparison metrics were used to measure to linguistic similarity between languages, which will be be explained in the variable descriptions below. The dataset itself is available for download at https://zenodo.org/records/2863533#.Y9Y3pNJBwUE

There are around 50,000 observations in this dataset. Each observation is a Dutch language learner that filled out their information when taking the State Examination of Dutch as a Second Language. The 16 variables are as follows:

  • L1: The first language of the learner
  • C : The learner’s country of birth
  • L1L2: The string combination of first and best additional language besides Dutch
  • L2: The learner’s second best language besides Dutch (either lists another language or has monolingual if the learner only speaks one language)
  • AaA: The learner’s age at arrival in the Netherlands in years (starting date of residence)
  • LoR - Length of residence in the Netherlands in years
  • Edu.day - The amount of formal education the learner has received since the age of 6. This variable is categorized as 1-4 (1 low, 2 middle, 3 high, 4 very high): 1) 0 through 5 years; 2) 6 through 10 years; 3) 11 through 15 years; 4) 16 years or more.
  • Sex - The learner’s gender
  • Family - The language family of the learner’s L1
  • ISO639.3 - The Language ID code of the learner’s L1 according to Ethnologue
  • Enroll - The proportion of school-aged youth enrolled in secondary education (Gross Secondary School Enrollment Rate) according to the World Bank when the learner left their country, used as a measure of a country’s educational accessibility
  • Speaking - The STEX (State Examination of Dutch as a Second Language) test score for speaking proficiency.
  • morph - The morphological similarity of L1 and Dutch. This pertains to the similarity of the formation of words in the two languages.
  • lex - The lexical similarity of L1 and Dutch. This pertains to the similarity in the written languages.
  • new_feat - The phonological similarity (in terms of new features). This pertains to the speaking aspect in terms of the new ways of creating sounds between L1 and Dutch.
  • new_sounds - The phonological similarity (in terms of new sounds). This pertains to the speaking aspect in terms of the amount of new sounds based on the phonetic inventories between L1 and Dutch.

Our overarching question for exploring this dataset is whether innate linguistic differences between a learner’s first language and a target foreign language matter more for new language proficiency than the learner’s background (their stay in the corresponding foreign country, their nationality, their education level)? This expands upon the original research question for this dataset which focused purely on linguistic similarity.

The three questions we are exploring with this dataset are as follows:

  • Question 1: How does a Dutch language learner’s first language and country of origin interact to affect their Dutch speaking proficiency?

  • Question 2: How do different residential and educational experiences in the Netherlands impact Dutch speaking proficiency?

  • Question 3: How do individuals’ past exposure to languages (monolingual vs multilingual, languages learned, similarity between known language and target language, etc) affect their Dutch speaking proficiency?

Question 1: How does a Dutch language learner’s First Language and Country of Origin Interact to Affect their Dutch Speaking Score?

The process of learning a foreign language is complicated, especially for immigrants who must also adapt to living in their new country of residence while learning it’s primary language. While linguistic similarity has long been hypothesized to play a role in the relative difficulty of learning a foreign language, it’s also important to analyze cultural similarity and the role it plays as a mediator in the language learning process. For speakers of global languages like English and French, their country of origin can greatly influence the linguistic traits of their first language which may change the way their prior linguistic knowledge affects their ability to learn to a new language.

The language learner’s data set offers us three variables to answer our sub question of interest:

  • C: A language learner’s country of origin

  • L1: A language learner’s primary language

  • Speaking: Their score on a standardized Dutch Speaking Assessment

Since a Dutch language learner’s culture and specific dialect of their L1 is difficult to capture, their country of origin will be used as a proxy to represent that information.

Does Country Matter for a Speakers of some First Language?

To begin, we’ll investigate whether language learners with the same L1 language but different countries of origin have different Dutch speaking scores using grouped boxplots as shown below. After some EDA inspection of all of the L1s in the dataset, Spanish was chosen for this plot because several distinct countries have a large amount of Spanish speakers. This makes it possible to accurately compare the Dutch speaking scores of Spanish speaking learners from several countries. To avoid visual clutter, only countries with 90+ learners were placed into this plot.

From pure visual inspection, it seems that the median speaking scores of Argentina and Spain (both 520) are higher than that of the other Spanish speaking countries and that Chile had the lowest median Dutch speaking score (505.5). Overall, however, the spread of the data is quite large given the long “tails” of the box representing each country and the overlap of the interquartile ranges of the countries suggest that their mean Dutch speaking score might actually be the same.

To confirm our graphical intuition, let’s conduct a formal ANOVA test. The ratio of the largest variance: 1040.1717 for Venezuela and the smallest variance 715.8257 for Argentina, 1.453108, is less than 2 so the Equal Variance assumption is reasonably satisfied. As we’ll see later, the distributions of Speaking for each country is approximately normal, justifying ANOVA.

## Analysis of Variance Table
## 
## Response: Speaking
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## C            8   78554  9819.2  11.084 1.685e-15 ***
## Residuals 2508 2221855   885.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the output, since the p-value is less than 0.05, we reject the null hypothesis that the mean speaking score of the countries are all equal. Instead we accept the alternative hypothesis that at least one country’s speaking score is different from that of the others. To conclude our analysis of this plot, it seems that at least for Spanish speakers, linguistic similarity is not the only influence on their ability to learn Dutch. Their country of origin greatly matters for their speaking scores.

But the boxplot tells us nothing about the distribution of Dutch speaking scores for Spanish speakers of the countries above so let’s investigate with a grouped Ridgeline plot.

As expected, the center of the distribution of speaking score for Argentina and Venezuela are higher than that of the other Countries and the center of the distribution of speaking score for Chile is lower that that of the other countries. The distributions of Peru and Colombia are the most similar which might suggest something about the effect of geographic proximity. Furthermore, the distributions for all the countries closely match a normal distribution, justifying our ANOVA as stated earlier.

Can Language Learners be Grouped Based on their First Language?

Now that we have established that within the population of Dutch learners who speak a given first language, there can be differences in speaking score based on country of origin, we’ll investigate just how similar Dutch speakers who speak the same first language are. It might be that even if language learners from different countries who speak the same first language aren’t perfectly similar, they are more similar than Dutch learners who have other first languages. For example, Argentinian Spanish speakers might have different speaking scores from Chilean Spanish Speakers but that difference could be smaller than the difference between Argentinian Spanish speakers and French French speakers.

To answer this question, we’ll use hierarchical clustering on country-language pairs to build a dendrogram visualizing potential groups.

The graph above is a dendrogram built up using hierarchical clustering where the leafs (nodes) each represent an unique First Language + Country of Origin pair. The leaves are grouped together based on the average Dutch speaking score of all Dutch language learners associated with that pairing. The leaf for English speakers from Afghanistan, for example, has an average Dutch speaking score of 511. To make the final dendrogram easier to interpret, only a subset of the most common European languages (English, French, Spanish, and German) are included in the graph above. Given that there are four languages, the dendrogram is colored to create 4 clusters

Based the plot above, it seems that overall, German and Spanish speakers from different countries tend to have similar Dutch speaking scores. The leftmost, red cluster and the second from the left, lime cluster contains the majority of the orange colored German leaves which is evidence that German speakers’ Dutch speaking scores are similar to that of other German speakers, regardless of the speakers’ countries of origin. This effect is also seen for the Spanish leaves that occur most often in the 3rd, turquoise and 4th, purple clusters. The zoomed in snapshot of the plot reveal one more interesting insight. On a smaller scale (increasing the number of clusters), the micro-clusters (4 leaves or less) tend to be language uniform which means that the first groups that emerge from Hierarchical Clustering only contain leaves of a single language. Manual inspection further reveals that a sizable portion of the micro-clusters contain countries that are geographically close to each other. One hypothesis for this phenomenon is that for any given language, there are many different versions of it based on region and learners that share a specific version of a specific language tend to have similar Dutch speaking scores.

This plot gives us powerful intuition into the specifics of how a language learner’s first language can influence their performance on Dutch Speaking exams. Two language learners may speak the exact same language, but perform vastly differently on the Dutch Speaking exam based on what version of the language they speak. This plot also acts as EDA for the effect of geography on Dutch speaking score and motivates further investigation using geographical methods.

To test the hypothesis proposed above, we’ll use linear regression in an unexpected way. First let’s fit two model with Speaking as the response and C (Country) or L1 (First Language) as our predictors.

The plots above visualize the distribution of the effect size of various Countries or First Languages on speaking score. For example, Albania has an effect size of 17.43 which means holding all else constant, Dutch Language Learners from Albania will have a speaking score that is 17.43 points higher than that of Dutch Language Learners from Afghanistan (the base country) on average. The base language is Afrikaans. Only countries and languages significantly associated with speaking scores (based on p-values) are included in both plots.

We note from the plot above that the median effect size for language is higher than that of countries. This suggests that overall, while not all languages have grouping effects, a speaker’s first language does significantly affect their Dutch speaking score on average more than their country of origin does.

Geography and Dutch Speaking Scores

As mentioned earlier, here is a brief analysis of the effect of geography on Dutch speaking scores in the form of a map displaying the average Dutch speaking score of learners from each country as the fill of the country.

As expected, geography seems to have a significant grouping effect on Dutch speaking scores because aside from a few exceptions like South African (colonial Dutch influence), Dutch speaking scores of countries are similar to that of their closest neighbors. Furthermore, “Western Countries” (US, Canada, Australia, and Europe) had higher scores on average than African, Asian, and Latin American countries, likely due to linguistic dissimilarity.

Question 2: How do different living and educational experiences in the Netherlands impact Dutch speaking proficiency?

The next main question is understanding how much of an impact does prior education and their experiences living in the Netherlands impacted their proficiency in the language. The main variables we will be focusing on in this section are the age of arrival, education level, and length of residence.

How much does age of arrival and length of residency matter in terms of speaking proficiency?

The age of arrival variable contains a large range of values from 0 to 88. The outlier boundaries in the boxplot during the EDA stage were defined as [10,43]. Given this, it would be important to see how much does the age of arrival actually matter in terms of how well they can learn and be tested on the language. We split the age of arrival variable down the middle, with one having all data points less than 26 years old and the other having all data points greater than or equal to 26 years old. If we also map this with the length of residence, we can see how both variables play a role in the speaking score. As such, we made a scatterplot plotting length of residence on the x-axis and speaking score on the y-axis and coloring the points by the age groups.

## 
## Call:
## lm(formula = Speaking ~ LoR * median_comp, data = language)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -246.078  -22.913   -1.192   22.114  169.029 
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      521.50136    0.31335 1664.280  < 2e-16 ***
## LoR                0.19219    0.04787    4.015 5.95e-05 ***
## median_comp1      -6.80004    0.46646  -14.578  < 2e-16 ***
## LoR:median_comp1  -0.58637    0.08397   -6.983 2.92e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.63 on 50231 degrees of freedom
## Multiple R-squared:  0.0152, Adjusted R-squared:  0.01514 
## F-statistic: 258.5 on 3 and 50231 DF,  p-value: < 2.2e-16

In terms of the distribution, if we only consider the distribution of scores for those who arrived at an age below the median age versus those who arrived after the median age, there is not too much of a difference. The center for the distribution of scores for those who arrived at an age below the median age is slightly higher than that of those who arrived at an age above the median age, by around 10 points. This means solely on the basis of age, there is a slightly 10 point difference.

However, when taking length of residence into consideration, we have the following interpretations: for those who arrived at a younger age (before the median age of 26), as length of residence increases, the speaking score on the STEX is predicted to increase, on average. Because we wanted to avoid visual clutter, we sampled the data to 2000 for the graph but used the entire dataset for the linear regression. For those who arrived at a older age (after the median age of 26), as length of residence increases, the speaking score on the STEX is predicted to decrease, on average. The interaction linear regression models tells us more about the lines themselves. The slopes are significantly different since the interaction term LoR:median_comp1 is statistically significant (p < 0.05), with the below median age group slope being 0.19219 and the above median age group slope being -0.39418.

The slopes of both graphs are very shallow and it is interesting to see that at 0 years of residency, it is predicted that both age groups would have the same score of around 520, as the intercepts are very close to each other. From the interaction linear regression model, the median_comp1 term tells us that with all else constant, the STEX score for those who arrived later is 6 point less than that of those who arrived earlier, if they both had 0 years of residency, and this term is statistically significant (p < 0.05). The intercept term is 521.5 for the below median age group and 514.7 for the above median age group.

This shows that length of residency does play a role in the language learning process, as being exposed to the culture and people who speak the language that the person is learning and being forced to use it in everyday life helps increase proficiency. However, it is possible to explain the difference in the two groups as the difference in tendencies for age groups. Younger people would tend to be more active in learning and exploring due to fewer responsibilities or learning the language for fun or as an abroad experience. Older people would probably be able to use most of the very day common words but would not be as incentivized to learn more due to time constraints or greater priorities.

How does education level impact people based on whether or not they started their education in the Netherlands?

Because the education level variable takes into account the number of years of education the learner partook in since the age of 6, it would be interesting to split the age variable at that split point to compare the different distributions of those raised upon the Netherlands education system and those in other education systems. By adding in the education level variable, we can see the impact of education for those who grew up in education systems outside the Netherlands and those within the Netherlands. Thus, we made a density plot with different educational levels as the different distributions and facetted by the age groups.

When we see the distributions of all education levels for those who arrived after the age of 6, we can see that there is little difference between all of the distributions for each of the different levels of formal education. In particular, we can see the distribution for the 0-5 years of formal education deviating from the rest of them slightly but the rest of the distributions are very similar with one another. This shows that regardless of prior educational background, people who did not have a full experience under the Dutch education system should be score around the same score range.

On the other hand, the distribution of all education levels for those who arrived before the age of 6 shows more interesting results. The centers of the distributions for the before age 6 group from least to greatest is as follows: 0-5, 16+, 6-10, 11-15, which is interesting to see the highest education group not have the highest center. It is obvious that those with only 0-5 years of formal education would not be able to score that high on the STEX, since more Dutch is acquired as time goes on. The progression from 6-10 years of formal education to 11-15 years of formal education is also explained in the same way. This difference may not be attributed entirely to the amount of education they received, as people could have many years of higher education without complete speaking proficiency. Also, it could be attributed to the age at which they took it, since it could range from the time being when they were young adults compared to older adults. Thus, the distribution of 16+ years of formal education could be explained as such.

However, it is interesting that by the nature of the dataset and variable that people who arrived from ages 0-6 would have to take a proficiency exam in Dutch despite being in the Netherlands their entire life. Possible explanations could be that their parents were not fluent in Dutch and they acquired the language only when they started attending formal education at the age of 6 or they received formal schooling in a different language.

How does education level and length of residence in the Netherlands relate?

To explore the distributional differences for the two groups (those who arrived before age 6 and those who arrived after age 6) based on length of residence and education level, we’ll make two stacked histogram.

From this, we can see that the skewness from the length of residence plot from the EDA was mostly due to people who arrived after the age of 6. Those who arrived before the age of 6 shows a somewhat right-skewed plot, with more instances of 16+ years of education once we reach 20+ years of residency. Most people up until 20 years of residency have 11-15 years of formal education, which is ranging from having a high school diploma or a college degree. This observation is because people who get higher education in the Netherlands would be more likely to stay longer due to the years spent getting the education and being employed after that. There is a large peak at around 20 years old and very few instances of 50-60 years of residency.

On the other hand, the heavily right skewed plot of those who arrived after the age of 6 shows a peak of around 1-2 years, with a majority of 11-15 years of formal education followed by 16+ years of formal education. This distribution is interesting, as it shows that there is a larger proportion of everyday people (people with high school diplomas and college degrees) moving to the Netherlands and gaining proficiency, compared to those pursuing higher education. There are more instances of people with 0-5 years of formal education and 6-10 years of formal education, which is most likely by nature of the subset being much bigger. We can see the tendencies of staying in the Netherlands, from 22000 learners staying for 0-1 year to 14000 learners staying for 1-2 years to 5000 learners staying for 2-3 years and so on. The numbers continue to dwindle after 3 years with around less than 2500 learners for each of the following years.

Question 3: How do individuals’ past exposure to languages affect their ability to learn and speak a new language (Dutch)?

The motivation behind this question is based in an intuitive understanding that how difficult it is for an individual to learn a new language depends on their linguistic experiences. For example, is it the case that multilingual people have an advantage over their monolingual counterparts in learning new languages? Intuitively, we would think this hypothesis is true because people who have been exposed to and understand multiple languages on a high level will probably be able to pick up new languages easily. Similarly, how much of a difference does one’s native language actually play in learning a new target language? Again, we would expect that the more similar one’s native language is with the target language, the better a person would be able to learn that target. This section will delve deeper into these questions in hopes of quantifying how much the aforementioned factors matter into speaking a new language.

Do multilingual learners have an advantage?

To investigate the effects of being multilingual on their ability to learn to speak dutch, a density plot is created below. The turquoise density curve represents individuals who are monolingual while the coral density curve represents those who reported that they knew at least one other language. As seen by the vertical lines corresponding to the means of each group, we observe that there appears to be a slight advantage for poly-linguals when it comes to learning to speak Dutch. We proceed by conducting a two sample t-test to investigate whether this is a significant difference.

## 
##  Welch Two Sample t-test
## 
## data:  mono_data$Speaking and poly_data$Speaking
## t = -20.357, df = 12745, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.935311 -8.190071
## sample estimates:
## mean of x mean of y 
##  510.3189  519.3816

Our null hypothesis is that the two sample means, corresponding to mean speaking scores from monolinguals and poly-linguals, are equal. Our alternative hypothesis is that there is no difference between the two sample means. The p-value from our t-test is less than 2.2E-16, lower than our significance level of 0.05. As a result, we have sufficient evidence to believe that this difference in means is significant. The difference in speaking scores is around 9, corresponding to a 1.8% increase in the scores of poly-linguals.

What first languages are associated with the high speaking scores?

Now that we have observed that subjects in this dataset who reported knowing more than one language performed slightly better on the Dutch speaking test than those who are monolingual, we proceed by investigating individual language effects on learning to speak Dutch. To do this, a dumbbell plot was created where each line segment corresponds to the interquartile range of Dutch speaking scores for each native language present in the dataset. The red endpoint represents the 25% quantile and the green endpoint represents the 75% quantile of speaking scores. The y-axis labels corresponding to native languages were colored by whether or not the family of that language is Indo-European or not.

From this plot, we can clearly observe that native language plays a big factor in how well one was able to learn to speak Dutch. For example, when observing the IQR of scores for German, Norwegian, and Swedish, there appears to be a wide gap between the next best language (Finnish). The IQR of the best languages is approximately [525, 575], contrasted with the worst performing languages (Igbo, Somali) with IQRs of around [465, 510]. These observations signify that an individual’s native language plays a big factor in how well they can learn to speak Dutch, and likely any language.

The most likely explanation for why language plays such a big role is that there may be similarities between their native language and the target language (Dutch) that makes it easier for individuals to learn the intricacies of the new language. For example, if a Spanish-speaking person wants to learn Portuguese, it would be much easier for them than someone who speaks Chinese, which has practically no similarity with Portuguese. We can observe these trends in the dumbbell plot through the language family classification of these languages. The languages labeled orange are those that are in the Indo-European family of languages and those who are labeled purple are every other family. This simplification was made to reflect how most language families often contain one or two languages. Observing the labels, we see that the top half of the languages are orange, indicating a clear trend that those who know a Indo-Europearn language do much better on the Dutch speaking test than those who do not.

What can PCA reveal about variable correlation?

In order to investigate why native language is such a strong indicator for how well one learns to speak Dutch, we proceed by analyzing several of the dissimilarity metrics between Dutch and their native language. The motivation behind using PCA to do this is so that we can better understand the relationship between these dissimilarity factors themselves, as well as their relationship with the other quantitative variables in this dataset. The Scree for this PCA is shown below which highlight how the variance explained by each PCA component drops to 10.4% at PCA3 but hover around the 6-10 range up until PCA component 7.

To analyze the correlation of variables and investigate any potential groupings among the points, a PCA biplot is created below. The graph is colored by whether an observation is monolingual or not. Additionally, a sample of 1000 points from the original dataset was taken to reduce the crowding of points that would otherwise occur. From this plot, we do not observe any obvious groupings based on monolinguality. This is consistent with the observation above where despite there being a statistically significant difference in speaking scores among the two groups, the effect size was non-large. When observing the correlation between variables, we note that all of the dissimilarity metrics, “morph”, “new_sounds”, “new_feats”, and “lex” were all very positively correlated with each other. This is somewhat expected as we suspect languages that are lexicographically similar to Dutch to also be phonetically similar. The other main observation from this PCA is that all of these dissimilarity factors are strongly negatively correlated with “Speaking”, as expected. A surprising observation from this PCA is that “Enroll”, an indicator of their country’s educational accessibility is strongly negatively correlated with the dissimilarity factors as well. A possible explanation for this result could be how countries with more educational resources are typically well-developed countries in Europe or Americas who are linguistically similar.

How does linguistic similarity affect speaking scores?

To further investigate the correlation between speaking score and dissimilarity between native language and a person’s ability to learn to speak Dutch, we will use a scatterplot to visualize the same sample of data. The plot has Speaking Score on the y-axis and phonological similarity on the x-axis. The points are shaped and the graphs are faceted by whether or not the subject is monolingual or not. Further, the points are colored by the lexical dissimilarity between Dutch and their reported native language. There is a regression line highlighting the negative correlation we observe between speaking score and phonological similarity. It should be noted that because phonological dissimilarity is a discrete variable, there are distinct bands in the plot.

From the graph, there appears to be a moderately weak negative correlation between an individual’s Dutch speaking score and phonological dissimilarity between their languages. This trend does not appear to be influenced when grouping by monolinguality; both groups seem to exhibit a similar trend between the two variables. Looking at the color gradients of the points, we observe that points with a lower phonological dissimilarity are darker, corresponding to a lower lexicographical dissimilarity. This is consistent with our conclusions from the PCA, highlighting how each of the dissimilarity metrics are positively correlated with each other.

## 
## Call:
## lm(formula = Speaking ~ new_sounds + IsMonolingual + new_sounds * 
##     IsMonolingual, data = language_data_sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -152.538  -22.542   -0.494   21.909  140.462 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 586.8323     6.4738  90.647   <2e-16 ***
## new_sounds                   -3.2147     0.3181 -10.105   <2e-16 ***
## IsMonolingualYes             14.3556    20.5733   0.698    0.485    
## new_sounds:IsMonolingualYes  -1.2185     1.0120  -1.204    0.229    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.1 on 996 degrees of freedom
## Multiple R-squared:  0.1212, Adjusted R-squared:  0.1186 
## F-statistic:  45.8 on 3 and 996 DF,  p-value: < 2.2e-16

A linear regression model is computed and summarized above. The test is consistent with our visual observations seeing that there is a significant weak relationship between speaking score and phonological dissimilarity as reflected by an R^2 of 0.121. Additionally, we observed a non-significant p-value for the “IsMonolingual” variable as well as the interaction term between phonological dissimilarity and “IsMonolingual”. Given this smaller sample size, we were unable to identify whether or not the characteristic of simply being monolingual or poly-lingual plays a significant role on one’s ability to learn to speak Dutch.

Discussion and Future Direction

Having completed our analysis of the effects of both lingual and non-lingual factors on the speaking score of Dutch Learners, there are several conclusions we wish to highlight.

  1. First, while learners who share a first language do tend to have similar speaking scores in comparison with learners who speak other first languages, this effect has another layer to it. It’s likely, at least for global languages like Spanish, that country of origin can cause differences in the speaking scores of learners who share a first language. First language, however, is still a more powerful predictor than country of origin for speaking score, a fact reinforced by the geographic distribution of speaking scores favoring European countries with high linguistic similarity to Dutch.

  2. Second, we learned that the effect of several variables on speaking scores is dependent on the learner’s age when they came to the Netherlands. Confirming previous research, learners who came to the Netherlands at an early age have a higher speaking score than learners who arrived later. The quality of a learner’s education only significantly affected speaking scores for learners who arrived to the Netherlands at an early age (6 or younger).

  3. Finally, from exploring the lingual variables, we confirmed our intuition that IndoEuropean (Dutch is part of this family) language learners have higher speaking scores, and that linguistic dissimilarity is (with a high correlation) negatively associated with speaking scores. Most interestingly, we also found that multilingual learners had a small (but significant) advantage in speaking scores compared to monolingual learners.

While our original goal was to compare lingual or non-lingual factors’ influence on Dutch speaking scores, our analysis has shown that they are tightly intermingled, at least for this data set. Both are critical to determining how well a language learner can learn a foreign language and act as proxies for more complicated linguistic and cultural phenomenons that can influence language learning.

These more complicated phenomenons (cultural traits, linguistic characteristics of specific languages, other metrics of language proficiency like writing and comprehension) are candidates for future research. Performing similar analysis on the other key metrics of language proficiency not in this data set, (Writing, Listening, and Reading), which directly correlate to other sections of the same Standardized Dutch as Second Language Exam, is our first priority. After that, we’d like to answer the question, “Does an individual’s cultural background (based ethnic distributions in sub-regions) influence their ability to learn Dutch?” which was impossible to answer in this dataset because Country was the most detailed level of geographic information and culture was absent altogether.

We’ll also note a few shortcomings of this analysis:

  • There is no information in this data set on the age during which an individual took the exam.

  • The only obvious response variable in this dataset is quantitative which limited the graphs we could make. Furthermore, without more advanced statistical methods for working with categorical data that have a huge number of factor levels, we couldn’t explore the full range of this dataset (graphs become impossible to interpret with hundreds of colors corresponding to categories.

  • The Kaggle description of the dataset was incredible wrong, wasting valuable time.

Language is a complicated phenomenon that requires much more research. It is our hope that this analysis has contributed to the larger body of knowledge on language pedagogy and can contribute to the development of better techniques for teaching foreign languages.