Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Wiki Markup
I don't know these languages, but I ran \[SQL7\] and then put the contents of _TOP_10_UNIQUE_TOKEN_DIFFS_A_ and _TOP_10_UNIQUE_TOKEN_DIFFS_B_ through Google translate.  For example, for the top 10 unique words in _commoncrawl3_refetched/XH/XHYIWIBT5QPY64UYUPLXZXAYC2I5JPZS_:

No Format

ميں: 532 | ہے: 520 | كے: 450 | ہيں: 370 | كہ: 365 | كو: 343 | سے: 342 | كا: 297 | ہم: 280 | جناب: 254

are translated as:

No Format
I: 532 | Is: 520 | Of: 450 | Are: 370 | Yes: 365 | Who: 343 | From: 342 | : 297 | We: 280 | Mr.: 254

...

  • We observed a handful of cases where the number of "common words" increased, but the content extracted was probably worse between two tools. This happened when one tool added spaces incorrectly, but the sub-words were actual words within the language. See, for example: commoncrawl3/TF/TFNFGXL27M77Q6X42ECYWJNSJ32WES74 (Russian) and commoncrawl3/FS/FSEHYPPOEV6EUYND5BRP3BBNAI5FVPYP (German) ("fachgruppe" vs "fach gruppe" and "ermoglicht" and "ermog" "licht")
  • If there's an "extract exception", meaning an empty file or an incomplete json file, we include that information in the containers table, but we don't include a row for that file in the profiles table. This causes some of the SQL that ships with tika-eval to result in not-quite-fair comparisons; some of the SQL that takes into account "runtime exceptions" fails to take into account "extract exceptions."
  • See the point above about improving the "junk" metric.

SQL

Wiki Markup
\[SQL1\]

No Format

select sum(cb.num_common_tokens) from contents_b cb
join profiles_b pb on pb.id=cb.id
left join profiles_a pa on pb.id=pa.id
left join contents_a ca on pa.id=ca.id
where pa.is_embedded = false and pb.is_embedded=false
and (ca.lang_id_1 = cb.lang_id_1
or ca.lang_id_1 is null)

Wiki Markup
\[SQL2\]

No Format

select sum(ca.num_common_tokens) from contents_a ca
join profiles_a pa on pa.id=ca.id
left join profiles_b pb on pa.id=pb.id
left join contents_b cb on pb.id=cb.id
where pa.is_embedded = false and pb.is_embedded=false
and (cb.lang_id_1 = ca.lang_id_1
or cb.lang_id_1 is null)

Wiki Markup
\[SQL3\]

No Format

select lang_id_1, sum(num_common_tokens) as total_common_tokens
from contents_b
group by lang_id_1
order by lang_id_1

Wiki Markup
\[SQL4\]

No Format

select ca.lang_id_1, sum(ca.num_common_tokens)
from contents_a ca
join contents_b  cb on ca.id=cb.id
where ca.lang_id_1=cb.lang_id_1
group by ca.lang_id_1
order by ca.lang_id_1

Wiki Markup
\[SQL5\]

No Format

select ca.lang_id_1, ca.top_n_tokens, cb.top_n_tokens from contents_b cb
join contents_a ca on cb.id=ca.id
where cb.lang_id_1 = 'bn'
order by rand()
limit 100;

Wiki Markup
\[SQL6\]

No Format

select  ca.id, file_path, 
1-(cast(ca.num_common_tokens as float) / cast(ca.num_alphabetic_tokens as float)) as OOV_A,
ca.num_alphabetic_tokens,
1-(cast(cb.num_common_tokens as float) / cast(cb.num_alphabetic_tokens as float)) as OOV_B,
cb.num_alphabetic_tokens,
ca.lang_id_1, ca.lang_id_prob_1,
cb.lang_id_1, cb.lang_id_prob_1,
ca.top_n_tokens, cb.top_n_tokens 
from contents_b cb
join contents_a ca on cb.id=ca.id
join profiles_a pa on ca.id=pa.id
join containers c on pa.container_id=c.container_id
where cb.lang_id_1 = 'bn' and
ca.num_alphabetic_tokens > 0
and cb.num_alphabetic_tokens > 0
order by OOV_B asc
limit 100;

Wiki Markup
\[SQL7\]

No Format

select file_path, ca.top_n_tokens, cb.top_n_tokens,
(cb.num_common_tokens-ca.num_common_tokens) as delta_common_tokens,
top_10_unique_token_diffs_a, top_10_unique_token_diffs_b
from contents_a ca 
join contents_b cb on ca.id=cb.id
join content_comparisons cc on cc.id=ca.id
join profiles_a pa on ca.id=pa.id
join containers cc on pa.container_id=cc.container_id
where ca.lang_id_1='ur'
and cb.lang_id_1='ur'
order by delta_common_tokens asc

How to make sense of the tika-eval reports

...