Analysis for Translating science fiction in a CAT tool: machine translation and segmentation settings

Loading Packages

library(reshape2)
library(generics)
library(doBy)
library(pracma)
library(ggplot2)
library(gridExtra)
library(car)
library(MuMIn)
library(testit)
library(tidyverse)
library(glue)
library(ggeffects)

Study 1

Comparing T (translation) and P (post-editing)

Load data

Reads in and cleans data from the concatenated CRITT SG tables.

Cdata <- read.csv("../data/creative_1.csv") #Concatenated CRITT SG tables

# Excluding errors (e.g. participants who forgot to use Pinyin) + cases where there's evidence suggesting MT use in translation condition
Cdata <- subset(Cdata, Part != "P01" & Part != "P04" & Part != "P09" & Part != "P10" & Part != "P12" & Part != "P14" & Part != "P11") 

#Fix data type for categorical columns:
Cdata$Part <- as.factor(Cdata$Part)
Cdata$Task <- as.factor(Cdata$Task)

obs_1 <-nrow(Cdata) #156
print(glue("Study 1 data has {obs_1} observations."))

## Study 1 data has 156 observations.

unedited_segs_1 <-nrow(Cdata[ which(Cdata$FDur == 0),]) 
print(glue("MT output segments remained unedited in Study 1 {unedited_segs_1} times."))

## MT output segments remained unedited in Study 1 3 times.

Renaming variables

Pauses (cognitive effort)

The TB300 variable represents the number of typing bursts interspersed by 300-millisecond intervals. The number of typing bursts is largely equivalent to the number of pauses except for a probable initial and final pause. We therefore calculate the total number of pauses by adding 2 to the TB300 variable.

Cdata$Pauses <- Cdata$TB300 + 2

Nkeys (number of insertions and deletions – technical effort )

The Ins and Del variable represent the number of insertions and deletions as recorded by the Qualitivity plugin. We add up these two variables as a proxy for the number of keystrokes, which represents technical effort.

Cdata$Nkeys <- Cdata$Ins + Cdata$Del

FDur_s (number of seconds between first and last keystroke excluding pauses of 200 seconds or more – temporal effort )

We use variable FDur as our measure of temporal effort. This variable counts the number of milliseconds between the first and last keystroke in each segment while excluding pauses (breaks) longer than 200 seconds. We convert this variable from milliseconds FDur to seconds FDur_s. Where a segment is not edited, and therefore has no keystrokes, the value of FDur = 0. We exclude cases where FDur = 0 in calculations involving the FDur variable.

Cdata$FDur_s <- Cdata$FDur/1000

Correlations between variables used as proxies for cognitive (Pauses), temporal (FDur_s) and technical effort (NKeys)

###Correlations

cor.test(Cdata$Pauses,Cdata$Nkeys) #Pauses-Keystrokes 0.9302716

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata$Pauses and Cdata$Nkeys
## t = 31.467, df = 154, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9055001 0.9487244
## sample estimates:
##       cor 
## 0.9302716

cor.test(Cdata[ which(Cdata$FDur_s >0),]$FDur_s,Cdata[ which(Cdata$FDur_s >0),]$Nkeys) #Time-Keysrokes 0.8125572

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata[which(Cdata$FDur_s > 0), ]$FDur_s and Cdata[which(Cdata$FDur_s > 0), ]$Nkeys
## t = 17.13, df = 151, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7506663 0.8603111
## sample estimates:
##       cor 
## 0.8125572

cor.test(Cdata[ which(Cdata$FDur_s >0),]$FDur_s,Cdata[ which(Cdata$FDur_s >0),]$Pauses) #Time-Pauses 0.8900539

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata[which(Cdata$FDur_s > 0), ]$FDur_s and Cdata[which(Cdata$FDur_s > 0), ]$Pauses
## t = 23.993, df = 151, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8516571 0.9189472
## sample estimates:
##       cor 
## 0.8900539

cor.test(Cdata[ which(Cdata$FDur_s >0),]$FDur_s,Cdata[ which(Cdata$FDur_s >0),]$Nedit) #Time-Visits 0.5933315

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata[which(Cdata$FDur_s > 0), ]$FDur_s and Cdata[which(Cdata$FDur_s > 0), ]$Nedit
## t = 9.0576, df = 151, p-value = 6.385e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4798286 0.6873011
## sample estimates:
##       cor 
## 0.5933315

cor.test(Cdata$Nkeys,Cdata$Nedit)#Keystrokes-Visits 0.5420038

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata$Nkeys and Cdata$Nedit
## t = 8.0037, df = 154, p-value = 2.729e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4206939 0.6442714
## sample estimates:
##       cor 
## 0.5420038

cor.test(Cdata$Pauses,Cdata$Nedit)#Pauses-Visits 0.4764527

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata$Pauses and Cdata$Nedit
## t = 6.725, df = 154, p-value = 3.236e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3451536 0.5894604
## sample estimates:
##       cor 
## 0.4764527

Pauses and NKeys are very highly correlated (r = 0.93). For further analysis, we focus on pauses.

Exploratory Data Analysis

NKeys

cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") #colour-blind friendly from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette 
ggplot(Cdata, aes(x=Task,y=Nkeys/LenS,color=Task))+geom_boxplot()+geom_jitter()+scale_colour_manual(values=cbPalette) + ylab("Keystrokes per source character")

mean_keys_per_char_P <- mean(Cdata[ which(Cdata$Task == "P"),]$Nkeys/Cdata[ which(Cdata$Task == "P"),]$LenS) #P: 2.25397724051884 keys/character

mean_keys_per_char_T <- mean(Cdata[ which(Cdata$Task == "T"),]$Nkeys/Cdata[ which(Cdata$Task == "T"),]$LenS) #T: 3.71381669728488 keys/character

sd_keys_per_char_P <- sd(Cdata[ which(Cdata$Task == "P"),]$Nkeys/Cdata[ which(Cdata$Task == "P"),]$LenS) #P: 1.34950220614149  keys/character

sd_keys_per_char_T <- sd(Cdata[ which(Cdata$Task == "T"),]$Nkeys/Cdata[ which(Cdata$Task == "T"),]$LenS) #T: 1.75024294114716 keys/character

median_keys_per_char_P <- median(Cdata[ which(Cdata$Task == "P"),]$Nkeys/Cdata[ which(Cdata$Task == "P"),]$LenS) #P: 2.25397724051884 keys/character

median_keys_per_char_T <- median(Cdata[ which(Cdata$Task == "T"),]$Nkeys/Cdata[ which(Cdata$Task == "T"),]$LenS) #T: 3.71381669728488 keys/character


reduced_pc <- 100*(1-round(mean_keys_per_char_P/mean_keys_per_char_T, digits=3))

reduced_pc_median <- 100*(1-round(median_keys_per_char_P/median_keys_per_char_T, digits=3))

print(glue("On average there are more keystrokes per character in translation ({mean_keys_per_char_T}, SD = {sd_keys_per_char_T}) than in post-editing ({mean_keys_per_char_P}, SD = {sd_keys_per_char_P}). This is a {reduced_pc}% reduction for the post-editing condition. Based on the medians, this difference changes to {reduced_pc_median}%."))

## On average there are more keystrokes per character in translation (3.71381669728488, SD = 1.75024294114716) than in post-editing (2.25397724051884, SD = 1.34950220614149). This is a 39.3% reduction for the post-editing condition. Based on the medians, this difference changes to 35.8%.

ggplot(Cdata, aes(x=LenS, y=Nkeys,color=Task,shape=Part)) + geom_point() +geom_smooth(aes(group=Task),formula='y~x',method='loess',size=0.5, alpha=0.2)+scale_colour_manual(values=cbPalette) + xlab("Source segment length in characters") + ylab("Keystrokes")

ggplot(Cdata, aes(x=Part,y=Nkeys/LenS,fill=Task)) + geom_boxplot() + geom_jitter(alpha=0.25) + scale_fill_manual(values=cbPalette) + ylab("Keystrokes per source character") + xlab("Participants")

The median number of pauses per character is lower for 5 out of 6 participants in the post-editing (P) task.

Checking if results are similar for Pauses

ggplot(Cdata, aes(x=Part,y=Pauses/LenS,fill=Task)) + geom_boxplot() + geom_jitter(alpha=0.25) + scale_fill_manual(values=cbPalette) + ylab("Pauses per source character") + xlab("Participants")

There were more keystrokes for translation for all translators.

FDur_s Seconds

cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") #colour-blind friendly from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette 
ggplot(Cdata[ which(Cdata$FDur_s >0),], aes(x=Task,y=log(FDur_s/LenS),color=Task))+geom_boxplot()+geom_jitter()+scale_colour_manual(values=cbPalette) + ylab("Seconds per source character (log)")

The median number of seconds per character spent on the task is slightly higher for translation, but the difference is much less pronounced for seconds than it is for keystrokes (above).

mean_seconds_per_char_P <- mean(Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "P"),]$FDur_s/Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "P"),]$LenS)

mean_seconds_per_char_T <- mean(Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "T"),]$FDur_s/Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "T"),]$LenS) 

sd_seconds_per_char_P <- sd(Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "P"),]$FDur_s/Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "P"),]$LenS)

sd_seconds_per_char_T <- sd(Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "T"),]$FDur_s/Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "T"),]$LenS)

median_seconds_per_char_P <- median(Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "P"),]$FDur_s/Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "P"),]$LenS)

median_seconds_per_char_T <- median(Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "T"),]$FDur_s/Cdata[ which(Cdata$FDur_s >0 & Cdata$Task == "T"),]$LenS)

reduced_pc <- 100*(1-round(mean_seconds_per_char_P/mean_seconds_per_char_T, digits=3))

reduced_pc_median <- 100*(1-round(median_seconds_per_char_P/median_seconds_per_char_T, digits=3))

print(glue("On average there are more seconds per character in translation ({mean_seconds_per_char_T}, SD = {sd_seconds_per_char_T}) than in post-editing ({mean_seconds_per_char_P}, SD = {sd_seconds_per_char_P}). This is a {reduced_pc}% reduction for the post-editing condition. Based on the medians, the post-editing reduction changes to {reduced_pc_median}%."))

## On average there are more seconds per character in translation (3.18448348541411, SD = 3.89891863561509) than in post-editing (2.6560316339579, SD = 2.14051978962565). This is a 16.6% reduction for the post-editing condition. Based on the medians, the post-editing reduction changes to 9.3%.

We plot seconds (FDur_s) as a function of source-segment length (in characters, LenS) and draw loess regression lines for post-editing (P) and translation (T) with 95% confidence intervals (shaded areas).

ggplot(Cdata[ which(Cdata$FDur_s >0),], aes(x=LenS, y=log(FDur_s),color=Task,shape=Part)) + geom_point() +geom_smooth(aes(group=Task),formula='y~x',method='loess',size=0.5, alpha=0.2)+scale_colour_manual(values=cbPalette) + ylab("Seconds (log)") + xlab("Source segment length in characters")

Post-editing was faster than translation for shorter segments though slower for longer sentences.

ggplot(Cdata[ which(Cdata$FDur_s >0),], aes(x=Part,y=log(FDur_s/LenS),fill=Task)) + geom_boxplot() + geom_jitter(alpha=0.25) + scale_fill_manual(values=cbPalette) + ylab("Seconds per source character (log)") + xlab("Participants")

Post-editing (P) was faster only for 2 out of 6 participants. Post-editing required lower effort only in terms of pauses and keystrokes.

Study 2

Comparing different ways of presenting the texts on screen (sentence- and paragraph-level segmentation).

Loading Data

Cdata2 <- read.csv("../data/creative_2.csv") #Concatenated CRITT sg tables
Cdata2 <- subset(Cdata2, Part != "P06") #excluding participant who split paragraphs

obs_2 <-nrow(Cdata2) 
print(glue("Study 2 data has {obs_2} observations."))

## Study 2 data has 210 observations.

unedited_segs_2 <-nrow(Cdata2[ which(Cdata2$FDur == 0),])
unedited_segs_2_sent <-nrow(Cdata2[ which(Cdata2$FDur == 0 & Cdata2$Segmentation == "sentence"),])
unedited_segs_2_para <-nrow(Cdata2[ which(Cdata2$FDur == 0 & Cdata2$Segmentation == "paragraph"),])

print(glue("MT output segments remained unedited in Study 2 {unedited_segs_2} times, of which {unedited_segs_2_sent} were in the sentence condition and {unedited_segs_2_para} were in the paragraph condition ."))

## MT output segments remained unedited in Study 2 19 times, of which 13 were in the sentence condition and 6 were in the paragraph condition .

Renaming variables

Pauses (cognitive effort)

Cdata2$Pauses <- Cdata2$TB300 + 2

Nkeys (number of insertions and deletions – technical effort)

Cdata2$Nkeys <- Cdata2$Ins + Cdata2$Del

FDur_s (number of seconds between first and last keystroke excluding pauses of 200 seconds or more – temporal effort)

Cdata2$FDur_s <- Cdata2$FDur/1000

Correlations between variables used as proxies for cognitive (Pauses), temporal(FDur) and technical effort (NKeys)

###Correlations

cor.test(Cdata2$Pauses,Cdata2$Nkeys) #Pauses-Keystrokes 0.9590925

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata2$Pauses and Cdata2$Nkeys
## t = 48.861, df = 208, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9466228 0.9686959
## sample estimates:
##       cor 
## 0.9590925

cor.test(Cdata2[ which(Cdata2$FDur_s >0),]$FDur_s,Cdata2[ which(Cdata2$FDur_s >0),]$Nkeys) #Time-Keysrokes 0.8650908

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata2[which(Cdata2$FDur_s > 0), ]$FDur_s and Cdata2[which(Cdata2$FDur_s > 0), ]$Nkeys
## t = 23.709, df = 189, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8243640 0.8969075
## sample estimates:
##       cor 
## 0.8650908

cor.test(Cdata2[ which(Cdata2$FDur_s >0),]$FDur_s,Cdata2[ which(Cdata2$FDur_s >0),]$Pauses) #Time-Pauses 0.9113697

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata2[which(Cdata2$FDur_s > 0), ]$FDur_s and Cdata2[which(Cdata2$FDur_s > 0), ]$Pauses
## t = 30.441, df = 189, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8837428 0.9326661
## sample estimates:
##       cor 
## 0.9113697

cor.test(Cdata2[ which(Cdata2$FDur_s >0),]$FDur_s,Cdata2[ which(Cdata2$FDur_s >0),]$Nedit) #Time-Visits 0.4537433

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata2[which(Cdata2$FDur_s > 0), ]$FDur_s and Cdata2[which(Cdata2$FDur_s > 0), ]$Nedit
## t = 7, df = 189, p-value = 4.336e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3332315 0.5596676
## sample estimates:
##       cor 
## 0.4537433

cor.test(Cdata2$Nkeys,Cdata2$Nedit) #Keystrokes-Visits 0.4353859

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata2$Nkeys and Cdata2$Nedit
## t = 6.975, df = 208, p-value = 3.999e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3187871 0.5390037
## sample estimates:
##       cor 
## 0.4353859

cor.test(Cdata2$Pauses,Cdata2$Nedit) #Pauses-Visits 0.3944248

## 
##  Pearson's product-moment correlation
## 
## data:  Cdata2$Pauses and Cdata2$Nedit
## t = 6.1903, df = 208, p-value = 3.151e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2736475 0.5029566
## sample estimates:
##       cor 
## 0.3944248

There are very high correlations between pauses and keystrokes and between seconds and pauses. We therefore focus on seconds and keystrokes for further analysis

FDur_s Seconds

cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") #colour-blind friendly from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette 
ggplot(Cdata2[ which(Cdata2$FDur_s >0),], aes(x=Segmentation,y=log(FDur_s/LenS),color=Segmentation))+geom_boxplot()+geom_jitter()+scale_colour_manual(values=cbPalette) + ylab("Seconds per source character (log)")

The median number of seconds per character was only marginally higher when the texts were segmented into sentences.

mean_seconds_per_char_para <- mean((Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "paragraph"),]$FDur_s)/Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "paragraph"),]$LenS)

mean_seconds_per_char_sent <- mean((Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "sentence"),]$FDur_s)/Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "sentence"),]$LenS) 

sd_seconds_per_char_para <- sd((Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "paragraph"),]$FDur_s)/Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "paragraph"),]$LenS)

sd_seconds_per_char_sent <- sd((Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "sentence"),]$FDur_s)/Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "sentence"),]$LenS) 

median_seconds_per_char_para <- median((Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "paragraph"),]$FDur_s)/Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "paragraph"),]$LenS)

median_seconds_per_char_sent <- median((Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "sentence"),]$FDur_s)/Cdata2[ which(Cdata2$FDur_s >0 & Cdata2$Segmentation == "sentence"),]$LenS)

reduced_pc <- 100*(1-round(mean_seconds_per_char_para/mean_seconds_per_char_sent, digits=3))

reduced_pc_median <- 100*(1-round(median_seconds_per_char_para/median_seconds_per_char_sent, digits=3))

print(glue("On average there are more seconds per character for sentence segmentation ({mean_seconds_per_char_sent}, SD = {sd_seconds_per_char_sent}) than for paragraph segmentation ({mean_seconds_per_char_para}, SD = {sd_seconds_per_char_para}). This is a {reduced_pc}% reduction for the paragraph condition. Based on the medians, this difference changes to {reduced_pc_median}%."))

## On average there are more seconds per character for sentence segmentation (2.7219708576529, SD = 2.65140700953951) than for paragraph segmentation (2.29590941327593, SD = 1.77051453035775). This is a 15.7% reduction for the paragraph condition. Based on the medians, this difference changes to 5.3%.

ggplot(Cdata2[ which(Cdata2$FDur_s >0),], aes(x=Part,y=log(FDur_s/LenS),fill=Segmentation)) + geom_boxplot() + geom_jitter(alpha=0.25) + scale_fill_manual(values=cbPalette) + ylab("Seconds per source character (log)") + xlab("Participants")

The text was faster to translate/post-edit when segmented in sentences for five out of ten participants. Translators are split in half. #### NKeys

cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") #colour-blind friendly from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette 
ggplot(Cdata2, aes(x=Segmentation,y=sqrt(Nkeys/LenS),color=Segmentation))+geom_boxplot()+geom_jitter()+scale_colour_manual(values=cbPalette) + ylab("Keystroke count per source character (sqrt)")

The number of keystrokes per character is slightly higher when the texts are segmented into sentences.

mean_Nkeys_per_char_para <- mean((Cdata2[ which(Cdata2$Segmentation == "paragraph"),]$Nkeys)/Cdata2[ which(Cdata2$Segmentation == "paragraph"),]$LenS)

mean_Nkeys_per_char_sent <- mean((Cdata2[ which(Cdata2$Segmentation == "sentence"),]$Nkeys)/Cdata2[ which(Cdata2$Segmentation == "sentence"),]$LenS) 

sd_Nkeys_per_char_para <- sd((Cdata2[ which(Cdata2$Segmentation == "paragraph"),]$Nkeys)/Cdata2[ which(Cdata2$Segmentation == "paragraph"),]$LenS)

sd_Nkeys_per_char_sent <- sd((Cdata2[ which(Cdata2$Segmentation == "sentence"),]$Nkeys)/Cdata2[ which(Cdata2$Segmentation == "sentence"),]$LenS) 

median_Nkeys_per_char_para <- median((Cdata2[ which(Cdata2$Segmentation == "paragraph"),]$Nkeys)/Cdata2[ which(Cdata2$Segmentation == "paragraph"),]$LenS)

median_Nkeys_per_char_sent <- median((Cdata2[ which(Cdata2$Segmentation == "sentence"),]$Nkeys)/Cdata2[ which(Cdata2$Segmentation == "sentence"),]$LenS)


reduced_pc <- 100*(1-round(mean_Nkeys_per_char_para/mean_Nkeys_per_char_sent, digits=3))

reduced_pc_median <- 100*(1-round(median_Nkeys_per_char_para/median_Nkeys_per_char_sent, digits=3))

print(glue("On average there are more keystrokes per character for sentence segmentation ({mean_Nkeys_per_char_sent}, SD = {sd_Nkeys_per_char_sent}) than for paragraph segmentation ({mean_Nkeys_per_char_para}, SD = {sd_Nkeys_per_char_para}). This is a {reduced_pc}% reduction for the paragraph condition. Based on the medians, this difference changes to {reduced_pc_median}%."))

## On average there are more keystrokes per character for sentence segmentation (2.24974361552381, SD = 2.54993671994802) than for paragraph segmentation (1.61895213781313, SD = 1.2818125627208). This is a 28% reduction for the paragraph condition. Based on the medians, this difference changes to 13.8%.

The difference between sentence and paragraph segmentation is more pronounced in terms of keystrokes than in terms of translation time.

ggplot(Cdata2, aes(x=Part,y=sqrt(Nkeys/LenS),fill=Segmentation)) + geom_boxplot() + geom_jitter(alpha=0.25) + scale_fill_manual(values=cbPalette) + ylab("Keystroke count per source character (sqrt)") + xlab("Participants")

For six out of ten translators, there are more keystrokes per character when the texts are segmented into sentences.

Checking results for Pauses

ggplot(Cdata2, aes(x=Part,y=log(Pauses/LenS),fill=Segmentation)) + geom_boxplot() + geom_jitter(alpha=0.25) + scale_fill_manual(values=cbPalette) + ylab("Pauses per source character (log)") + xlab("Participants")

For five out of ten translators, there are more pauses per character when the texts are segmented into sentences.

Number of edits

We use the Nedit variable (number of times a segment is edited) to compare sentence and paragraph segmentation in relation to how often trasnlators returned to the segments (paragraphs or sentences, depending on the task condition) to edit them further. We plot the number of edits (Nedit) per source character (LenS) (y-axis) as a function of the number of seconds (FDur_s) per source character (x-axis). We take the logs of both of these variables to facilitate visualisation.

ggplot(Cdata2[ which(Cdata2$FDur_s >0),], aes(x=log(FDur_s/LenS), y=log(Nedit/LenS),color=Segmentation)) + geom_point() +geom_smooth(aes(group=Segmentation),formula='y~x',method='lm',size=0.5, alpha=0.2)+scale_colour_manual(values=cbPalette) + ylab("Number of editing visits per source character (log)") + xlab("Seconds per source character (log)")

Sentences with more seconds per character are also associated with more visits. This is also the case for paragraphs, but translators returned less often to paragraphs than to sentences.

Reproducibility

Testing reproducibility

The following code snippets check that the input data, and results are the same as ours to 7 significant figures.

sig_fig <- 14

# Check descriptive stats:
m_ks_P <- 2.25397724051884
testit::assert(glue("Mean keystrokes/character for post-editing is {m_ks_P}"), round(mean_keys_per_char_P, digits=sig_fig) == m_ks_P)

m_ks_T <- 3.71381669728488
testit::assert(glue("Mean keystrokes/character for translation is {m_ks_T}"), round(mean_keys_per_char_T, digits=sig_fig) == m_ks_T)

Snapshot R library environment

renv::clean()

## * No stale lockfiles were found.
## * No temporary directories were found in the project library.
## The following non-system packages are installed in the system library:
## 
##  abind, acepack, askpass, assertive, assertive.base,
##  assertive.code, assertive.data, assertive.data.uk,
##  assertive.data.us, assertive.datetimes, assertive.files,
##  assertive.matrices, assertive.models, assertive.numbers,
##  assertive.properties, assertive.reflection, assertive.sets,
##  assertive.strings, assertive.types, assertthat, automap,
##  backports, base64enc, BH, bindr, bindrcpp, bitops, brew,
##  broom, callr, caTools, cellranger, checkmate, classInt,
##  cli, clipr, clisymbols, coda, colorspace, crayon,
##  crosstalk, curl, data.table, DBI, dbplyr, desc, deSolve,
##  devtools, DiagrammeR, dichromat, digest, doParallel,
##  dotCall64, downloader, dplyr, DT, e1071, editData,
##  ellipsis, evaluate, fansi, fasterize, fields, FNN, forcats,
##  foreach, Formula, fs, gapminder, gdtools, generics,
##  geosphere, ggdendro, ggformula, ggfortify, ggmap, ggplot2,
##  ggrepel, ggstance, gh, GISTools, git2r, glue, gmp, gplots,
##  gridExtra, gstat, gtable, haven, hexbin, HH, highr, Hmisc,
##  hms, htmlTable, htmltools, htmlwidgets, httpuv, httr,
##  hydroTSM, igraph, influenceR, ini, iterators, jpeg,
##  jsonlite, knitr, labeling, later, latticeExtra, lazyeval,
##  leaflet, leafsync, leaps, lifecycle, lmtest, lubridate,
##  lwgeom, magrittr, mapdata, mapproj, maps, maptools,
##  maptree, mapview, markdown, memisc, memoise, mime, miniUI,
##  modelr, mosaic, mosaicCore, mosaicData, multcomp, munsell,
##  mvtnorm, ncdf4, nycflights13, openair, openssl,
##  OpenStreetMap, packrat, pillar, pkgbuild, pkgconfig,
##  pkgload, plogr, plyr, png, prettyunits, processx, progress,
##  promises, proto, ps, purrr, R2MLwiN, R2WinBUGS, R6, raster,
##  rasterVis, rcmdcheck, RColorBrewer, Rcpp, RcppArmadillo,
##  RCurl, readr, readxl, rematch, remotes, repr, reprex,
##  reshape, reshape2, rgdal, rgeos, rgexf, RgoogleMaps, rjags,
##  rJava, rjson, rlang, rmarkdown, Rmisc, Rmpfr,
##  rnaturalearthdata, RNetCDF, rprojroot, rsconnect,
##  rstudioapi, rvest, rworldmap, rworldxtra, sandwich,
##  satellite, scales, selectr, sessioninfo, sf, shapefiles,
##  shiny, sourcetools, sp, spacetime, spam, spData, stringi,
##  stringr, svglite, sys, texreg, TH.data, tibble, tidyr,
##  tidyselect, tidyverse, tinyProject, tinytex, tmap,
##  tmaptools, translations, tsc, tufte, tufterhandout, units,
##  usethis, utf8, vcd, vctrs, viridis, viridisLite,
##  visNetwork, webshot, whisker, withr, xfun, XML, xml2,
##  xopen, xtable, xts, yaml, zeallot, zoo
## 
## Normally, only packages distributed with R should be installed in the system library.
## These packages will be removed.
## If necessary, consider re-installing these packages in your site library.
## 
## * Removing package(s) from library 'C:/Program Files/R/R-3.6.1/library' ...
## Removing package 'abind' ... Done!
## Removing package 'acepack' ... Done!
## Removing package 'askpass' ... Done!
## Removing package 'assertive' ... Done!
## Removing package 'assertive.base' ... Done!
## Removing package 'assertive.code' ... Done!
## Removing package 'assertive.data' ... Done!
## Removing package 'assertive.data.uk' ... Done!
## Removing package 'assertive.data.us' ... Done!
## Removing package 'assertive.datetimes' ... Done!
## Removing package 'assertive.files' ... Done!
## Removing package 'assertive.matrices' ... Done!
## Removing package 'assertive.models' ... Done!
## Removing package 'assertive.numbers' ... Done!
## Removing package 'assertive.properties' ... Done!
## Removing package 'assertive.reflection' ... Done!
## Removing package 'assertive.sets' ... Done!
## Removing package 'assertive.strings' ... Done!
## Removing package 'assertive.types' ... Done!
## Removing package 'assertthat' ... Done!
## Removing package 'automap' ... Done!
## Removing package 'backports' ... Done!
## Removing package 'base64enc' ... Done!
## Removing package 'BH' ... Done!
## Removing package 'bindr' ... Done!
## Removing package 'bindrcpp' ... Done!
## Removing package 'bitops' ... Done!
## Removing package 'brew' ... Done!
## Removing package 'broom' ... Done!
## Removing package 'callr' ... Done!
## Removing package 'caTools' ... Done!
## Removing package 'cellranger' ... Done!
## Removing package 'checkmate' ... Done!
## Removing package 'classInt' ... Done!
## Removing package 'cli' ... Done!
## Removing package 'clipr' ... Done!
## Removing package 'clisymbols' ... Done!
## Removing package 'coda' ... Done!
## Removing package 'colorspace' ... Done!
## Removing package 'crayon' ... Done!
## Removing package 'crosstalk' ... Done!
## Removing package 'curl' ... Done!
## Removing package 'data.table' ... Done!
## Removing package 'DBI' ... Done!
## Removing package 'dbplyr' ... Done!
## Removing package 'desc' ... Done!
## Removing package 'deSolve' ... Done!
## Removing package 'devtools' ... Done!
## Removing package 'DiagrammeR' ... Done!
## Removing package 'dichromat' ... Done!
## Removing package 'digest' ... Done!
## Removing package 'doParallel' ... Done!
## Removing package 'dotCall64' ... Done!
## Removing package 'downloader' ... Done!
## Removing package 'dplyr' ... Done!
## Removing package 'DT' ... Done!
## Removing package 'e1071' ... Done!
## Removing package 'editData' ... Done!
## Removing package 'ellipsis' ... Done!
## Removing package 'evaluate' ... Done!
## Removing package 'fansi' ... Done!
## Removing package 'fasterize' ... Done!
## Removing package 'fields' ... Done!
## Removing package 'FNN' ... Done!
## Removing package 'forcats' ... Done!
## Removing package 'foreach' ... Done!
## Removing package 'Formula' ... Done!
## Removing package 'fs' ... Done!
## Removing package 'gapminder' ... Done!
## Removing package 'gdtools' ... Done!
## Removing package 'generics' ... Done!
## Removing package 'geosphere' ... Done!
## Removing package 'ggdendro' ... Done!
## Removing package 'ggformula' ... Done!
## Removing package 'ggfortify' ... Done!
## Removing package 'ggmap' ... Done!
## Removing package 'ggplot2' ... Done!
## Removing package 'ggrepel' ... Done!
## Removing package 'ggstance' ... Done!
## Removing package 'gh' ... Done!
## Removing package 'GISTools' ... Done!
## Removing package 'git2r' ... Done!
## Removing package 'glue' ... Done!
## Removing package 'gmp' ... Done!
## Removing package 'gplots' ... Done!
## Removing package 'gridExtra' ... Done!
## Removing package 'gstat' ... Done!
## Removing package 'gtable' ... Done!
## Removing package 'haven' ... Done!
## Removing package 'hexbin' ... Done!
## Removing package 'HH' ... Done!
## Removing package 'highr' ... Done!
## Removing package 'Hmisc' ... Done!
## Removing package 'hms' ... Done!
## Removing package 'htmlTable' ... Done!
## Removing package 'htmltools' ... Done!
## Removing package 'htmlwidgets' ... Done!
## Removing package 'httpuv' ... Done!
## Removing package 'httr' ... Done!
## Removing package 'hydroTSM' ... Done!
## Removing package 'igraph' ... Done!
## Removing package 'influenceR' ... Done!
## Removing package 'ini' ... Done!
## Removing package 'iterators' ... Done!
## Removing package 'jpeg' ... Done!
## Removing package 'jsonlite' ... Done!
## Removing package 'knitr' ... Done!
## Removing package 'labeling' ... Done!
## Removing package 'later' ... Done!
## Removing package 'latticeExtra' ... Done!
## Removing package 'lazyeval' ... Done!
## Removing package 'leaflet' ... Done!
## Removing package 'leafsync' ... Done!
## Removing package 'leaps' ... Done!
## Removing package 'lifecycle' ... Done!
## Removing package 'lmtest' ... Done!
## Removing package 'lubridate' ... Done!
## Removing package 'lwgeom' ... Done!
## Removing package 'magrittr' ... Done!
## Removing package 'mapdata' ... Done!
## Removing package 'mapproj' ... Done!
## Removing package 'maps' ... Done!
## Removing package 'maptools' ... Done!
## Removing package 'maptree' ... Done!
## Removing package 'mapview' ... Done!
## Removing package 'markdown' ... Done!
## Removing package 'memisc' ... Done!
## Removing package 'memoise' ... Done!
## Removing package 'mime' ... Done!
## Removing package 'miniUI' ... Done!
## Removing package 'modelr' ... Done!
## Removing package 'mosaic' ... Done!
## Removing package 'mosaicCore' ... Done!
## Removing package 'mosaicData' ... Done!
## Removing package 'multcomp' ... Done!
## Removing package 'munsell' ... Done!
## Removing package 'mvtnorm' ... Done!
## Removing package 'ncdf4' ... Done!
## Removing package 'nycflights13' ... Done!
## Removing package 'openair' ... Done!
## Removing package 'openssl' ... Done!
## Removing package 'OpenStreetMap' ... Done!
## Removing package 'packrat' ... Done!
## Removing package 'pillar' ... Done!
## Removing package 'pkgbuild' ... Done!
## Removing package 'pkgconfig' ... Done!
## Removing package 'pkgload' ... Done!
## Removing package 'plogr' ... Done!
## Removing package 'plyr' ... Done!
## Removing package 'png' ... Done!
## Removing package 'prettyunits' ... Done!
## Removing package 'processx' ... Done!
## Removing package 'progress' ... Done!
## Removing package 'promises' ... Done!
## Removing package 'proto' ... Done!
## Removing package 'ps' ... Done!
## Removing package 'purrr' ... Done!
## Removing package 'R2MLwiN' ... Done!
## Removing package 'R2WinBUGS' ... Done!
## Removing package 'R6' ... Done!
## Removing package 'raster' ... Done!
## Removing package 'rasterVis' ... Done!
## Removing package 'rcmdcheck' ... Done!
## Removing package 'RColorBrewer' ... Done!
## Removing package 'Rcpp' ... Done!
## Removing package 'RcppArmadillo' ... Done!
## Removing package 'RCurl' ... Done!
## Removing package 'readr' ... Done!
## Removing package 'readxl' ... Done!
## Removing package 'rematch' ... Done!
## Removing package 'remotes' ... Done!
## Removing package 'repr' ... Done!
## Removing package 'reprex' ... Done!
## Removing package 'reshape' ... Done!
## Removing package 'reshape2' ... Done!
## Removing package 'rgdal' ... Done!
## Removing package 'rgeos' ... Done!
## Removing package 'rgexf' ... Done!
## Removing package 'RgoogleMaps' ... Done!
## Removing package 'rjags' ... Done!
## Removing package 'rJava' ... Done!
## Removing package 'rjson' ... Done!
## Removing package 'rlang' ... Done!
## Removing package 'rmarkdown' ... Done!
## Removing package 'Rmisc' ... Done!
## Removing package 'Rmpfr' ... Done!
## Removing package 'rnaturalearthdata' ... Done!
## Removing package 'RNetCDF' ... Done!
## Removing package 'rprojroot' ... Done!
## Removing package 'rsconnect' ... Done!
## Removing package 'rstudioapi' ... Done!
## Removing package 'rvest' ... Done!
## Removing package 'rworldmap' ... Done!
## Removing package 'rworldxtra' ... Done!
## Removing package 'sandwich' ... Done!
## Removing package 'satellite' ... Done!
## Removing package 'scales' ... Done!
## Removing package 'selectr' ... Done!
## Removing package 'sessioninfo' ... Done!
## Removing package 'sf' ... Done!
## Removing package 'shapefiles' ... Done!
## Removing package 'shiny' ... Done!
## Removing package 'sourcetools' ... Done!
## Removing package 'sp' ... Done!
## Removing package 'spacetime' ... Done!
## Removing package 'spam' ... Done!
## Removing package 'spData' ... Done!
## Removing package 'stringi' ... Done!
## Removing package 'stringr' ... Done!
## Removing package 'svglite' ... Done!
## Removing package 'sys' ... Done!
## Removing package 'texreg' ... Done!
## Removing package 'TH.data' ... Done!
## Removing package 'tibble' ... Done!
## Removing package 'tidyr' ... Done!
## Removing package 'tidyselect' ... Done!
## Removing package 'tidyverse' ... Done!
## Removing package 'tinyProject' ... Done!
## Removing package 'tinytex' ... Done!
## Removing package 'tmap' ... Done!
## Removing package 'tmaptools' ... Done!
## Removing package 'translations' ... Done!
## Removing package 'tsc' ... Done!
## Removing package 'tufte' ... Done!
## Removing package 'tufterhandout' ... Done!
## Removing package 'units' ... Done!
## Removing package 'usethis' ... Done!
## Removing package 'utf8' ... Done!
## Removing package 'vcd' ... Done!
## Removing package 'vctrs' ... Done!
## Removing package 'viridis' ... Done!
## Removing package 'viridisLite' ... Done!
## Removing package 'visNetwork' ... Done!
## Removing package 'webshot' ... Done!
## Removing package 'whisker' ... Done!
## Removing package 'withr' ... Done!
## Removing package 'xfun' ... Done!
## Removing package 'XML' ... Done!
## Removing package 'xml2' ... Done!
## Removing package 'xopen' ... Done!
## Removing package 'xtable' ... Done!
## Removing package 'xts' ... Done!
## Removing package 'yaml' ... Done!
## Removing package 'zeallot' ... Done!
## Removing package 'zoo' ... Done!
## * Done! Removed 245 packages.
## * No unused packages were found in the project library.
## * The project has been cleaned.

renv::snapshot()

## The following package(s) will be updated in the lockfile:
## 
## # CRAN ===============================
## - BH             [* -> 1.69.0-1]
## - DBI            [* -> 1.0.0]
## - Deriv          [* -> 4.0]
## - DiagrammeR     [* -> 1.0.1]
## - MASS           [* -> 7.3-51.4]
## - Matrix         [* -> 1.2-17]
## - MatrixModels   [* -> 0.4-1]
## - MuMIn          [* -> 1.43.15]
## - R6             [* -> 2.4.0]
## - RColorBrewer   [* -> 1.1-2]
## - Rcpp           [* -> 1.0.2]
## - RcppEigen      [* -> 0.3.3.7.0]
## - Rook           [* -> 1.1-1]
## - SparseM        [* -> 1.78]
## - XML            [* -> 3.98-1.20]
## - abind          [* -> 1.4-5]
## - askpass        [* -> 1.1]
## - assertthat     [* -> 0.2.1]
## - backports      [* -> 1.1.4]
## - base64enc      [* -> 0.1-3]
## - boot           [* -> 1.3-23]
## - brew           [* -> 1.0-6]
## - broom          [* -> 0.5.6]
## - callr          [* -> 3.3.2]
## - car            [* -> 3.0-6]
## - carData        [* -> 3.0-3]
## - cellranger     [* -> 1.1.0]
## - cli            [* -> 1.1.0]
## - clipr          [* -> 0.7.0]
## - colorspace     [* -> 1.4-1]
## - crayon         [* -> 1.3.4]
## - curl           [* -> 4.2]
## - data.table     [* -> 1.12.2]
## - dbplyr         [* -> 1.4.2]
## - digest         [* -> 0.6.21]
## - doBy           [* -> 4.6-4.1]
## - downloader     [* -> 0.4]
## - dplyr          [* -> 0.8.3]
## - ellipsis       [* -> 0.3.0]
## - evaluate       [* -> 0.14]
## - fansi          [* -> 0.4.0]
## - forcats        [* -> 0.4.0]
## - foreign        [* -> 0.8-72]
## - fs             [* -> 1.3.1]
## - generics       [* -> 0.0.2]
## - ggeffects      [* -> 0.16.0]
## - ggplot2        [* -> 3.2.1]
## - glue           [* -> 1.3.1]
## - gridExtra      [* -> 2.3]
## - gtable         [* -> 0.3.0]
## - haven          [* -> 2.1.1]
## - highr          [* -> 0.8]
## - hms            [* -> 0.5.1]
## - htmltools      [* -> 0.3.6]
## - htmlwidgets    [* -> 1.3]
## - httr           [* -> 1.4.1]
## - igraph         [* -> 1.2.4.1]
## - influenceR     [* -> 0.1.0]
## - insight        [* -> 0.9.6]
## - irr            [* -> 0.84.1]
## - jsonlite       [* -> 1.6]
## - knitr          [* -> 1.25]
## - labeling       [* -> 0.3]
## - lattice        [* -> 0.20-38]
## - lazyeval       [* -> 0.2.2]
## - lifecycle      [* -> 0.2.0]
## - lme4           [* -> 1.1-23]
## - lmerTest       [* -> 3.1-1]
## - lpSolve        [* -> 5.6.15]
## - lubridate      [* -> 1.7.4]
## - magrittr       [* -> 1.5]
## - maptools       [* -> 0.9-5]
## - markdown       [* -> 1.1]
## - mgcv           [* -> 1.8-29]
## - mime           [* -> 0.7]
## - minqa          [* -> 1.2.4]
## - modelr         [* -> 0.1.5]
## - munsell        [* -> 0.5.0]
## - nlme           [* -> 3.1-141]
## - nloptr         [* -> 1.2.1]
## - nnet           [* -> 7.3-12]
## - numDeriv       [* -> 2016.8-1.1]
## - openssl        [* -> 1.4.1]
## - openxlsx       [* -> 4.1.4]
## - pbkrtest       [* -> 0.4-7]
## - pillar         [* -> 1.4.4]
## - pkgconfig      [* -> 2.0.3]
## - plogr          [* -> 0.2.0]
## - plyr           [* -> 1.8.4]
## - pracma         [* -> 2.2.9]
## - prettyunits    [* -> 1.0.2]
## - processx       [* -> 3.4.1]
## - progress       [* -> 1.2.2]
## - ps             [* -> 1.3.0]
## - purrr          [* -> 0.3.2]
## - quantreg       [* -> 5.54]
## - readr          [* -> 1.3.1]
## - readxl         [* -> 1.3.1]
## - rematch        [* -> 1.0.1]
## - reprex         [* -> 0.3.0]
## - reshape2       [* -> 1.4.3]
## - rgexf          [* -> 0.15.3]
## - rio            [* -> 0.5.16]
## - rlang          [* -> 0.4.6]
## - rmarkdown      [* -> 1.15]
## - rstudioapi     [* -> 0.10]
## - rvest          [* -> 0.3.4]
## - scales         [* -> 1.0.0]
## - selectr        [* -> 0.4-1]
## - sjlabelled     [* -> 1.1.7]
## - sp             [* -> 1.3-1]
## - statmod        [* -> 1.4.34]
## - stringi        [* -> 1.4.3]
## - stringr        [* -> 1.4.0]
## - sys            [* -> 3.3]
## - testit         [* -> 0.12]
## - tibble         [* -> 3.0.1]
## - tidyr          [* -> 1.0.0]
## - tidyselect     [* -> 1.1.0]
## - tidyverse      [* -> 1.2.1]
## - tinytex        [* -> 0.16]
## - utf8           [* -> 1.1.4]
## - vctrs          [* -> 0.3.1]
## - viridis        [* -> 0.5.1]
## - viridisLite    [* -> 0.3.0]
## - visNetwork     [* -> 2.0.8]
## - whisker        [* -> 0.4]
## - withr          [* -> 2.1.2]
## - xfun           [* -> 0.9]
## - xml2           [* -> 1.2.2]
## - yaml           [* -> 2.2.0]
## - zip            [* -> 2.0.4]
## 
## * Lockfile written to 'C:/Users/ln15242/Downloads/translating-science-fiction-2/scripts/renv.lock'.

Software citations

knitr::write_bib(c(.packages(), "bookdown"), "../software-citations.bib")

## Warning in knitr::write_bib(c(.packages(), "bookdown"), "../software-
## citations.bib"): package(s) bookdown not found

creative_article

Analysis for Translating science fiction in a CAT tool: machine translation and segmentation settings

Loading Packages

Study 1

Load data

Renaming variables

Pauses (cognitive effort)

Nkeys (number of insertions and deletions – technical effort )

FDur_s (number of seconds between first and last keystroke excluding pauses of 200 seconds or more – temporal effort )

Correlations between variables used as proxies for cognitive (Pauses), temporal (FDur_s) and technical effort (NKeys)

Exploratory Data Analysis

NKeys

Checking if results are similar for Pauses

FDur_s Seconds

Study 2

Loading Data

Renaming variables

Pauses (cognitive effort)

Nkeys (number of insertions and deletions – technical effort)

FDur_s (number of seconds between first and last keystroke excluding pauses of 200 seconds or more – temporal effort)

Correlations between variables used as proxies for cognitive (Pauses), temporal(FDur) and technical effort (NKeys)

FDur_s Seconds

Checking results for Pauses

Number of edits

Reproducibility

Testing reproducibility

Snapshot R library environment

Software citations