Collecting Social Associations from China Official Histories

從中國的正史記錄中收集社會網路關係資料

Description:

Many social association data can be found in China official histoy texts. A big challenge of collecting these association data automatically is that they were recorded in complicated narrative contexts by Classical Chinese.

中國的正史中記載了許多社會網路關係數據,但由於這些數據是敘述性的文言文,它們很難被自動抓取。

Now the commercial branch of China Bioraphical Database project - Yinde Group is investigating a crowdsourcing team to create the social association training data from China historical official history texts. In this sub-project, we are using Markus to crowdsource whether a sentence include the information which we want to collect or not (These sentences have been tagged by with-information tag automatically in advance).

中國歷代人物傳記資料庫的商業化分支引得小組正在資助一個子項目來為未來自動抓取社會網路關係的機器學習模型收集訓練集資料。此子項目中,我們使用 Markus 平台甄別句子中是否有我們所需要的資訊(這些句子被事先自動識別為有資訊的句子或無資訊的句子):

create_association_data_by_using_markus

We also developed an program to classcify what kind of data types a sentence contains automatically, and an interface for our crowdsourcing editors to validate the results.

此外,我們開發了自動識別句子中包含數據類型的程式,並設計了眾包平台,由眾包編輯對自動識別結果進行修正。

check_association_data_results

We will report more when we release the first batch of crowdsourcing data later. All the source codes and training data will be opened on GitHub.

未來我們公佈第一批眾包數據時,會向您帶來更多資訊。所有程式碼及訓練集數據都將在 GitHub 上開源、開放。

 

GitHub Links:

Crowdsourcing data generator

Cleaned named entities from CBDB