In order to do that, we designed a two-pass method:. To do this, we compute all 1, 2, and 3 g of each sentence while letting out the first word. We then keep all isolated proper nouns i. For each of them, we store an index of the sentences they appear in for further processing. This approach allows to recover all occurrences of each capitalized word, as long as they are not systematically at the start of sentences. In practice, the resulting list needs to be refined thereafter, as some capitalized common nouns still happen to end up in it.
The reasons for this may be multiple, ranging from sentence tokenization errors, typos in the source text or other stylistic effects that may influcence punctuation and case. This refining can be done accurately and efficiently by combining three strategies:. Figure 1 illustrates that the typical distributions often allow for easy separation, with very few outliers. Figure 1. Typical mean positions of uppercased words in their respective tokenized sentences vs. Once identified, proper nouns usually fall in three main categories that serve different purposes to narration: they can namely be characters, places , or others brands, abstract concepts, acronyms….
We designed and evaluated six independent classifiers. Each classifier gets one word at a time as an input as well as the context that is necessary and relevant for its way of processing data, and returns the predicted category namely character, place , or other. We first present the implementation characteristics of each component before looking in details at the resulting scores. When one encounters a proper noun in a sentence, a good guess on its nature can sometimes easily be taken due to the immediate context.
- Morgue Keeper.
- PDF Reine de Mémoire 5. La Maison dÉquité (French Edition).
- Preface to Queen Zarah (), Sources and Translations.
The simplest case, which we will refer here as obvious context , would be if the noun is immediately preceded by a title or a predicate that hints at what it refers to. For this classifier, we compiled a simple list of obvious context classifiers that allow to make good guesses about the nature of the immediately or next to following proper noun:.
In French, like in many other languages, the grammatical structure makes it more likely for sentences to follow a pattern that puts the subject of the action at the beginning, and the location toward the end. This characteristic can be used when one looks at enough examples to make a simple, yet quite powerful guess about the global roles of the proper nouns.
The accuracy of this classifier is indeed strongly dependent on the writing style of the author, as the frequent use of specific figures of speech may break its work hypothesis, and longer sentences may narrow the gap between the categories or blurry the bounds. This can be seen clearly in Figure 2 , where we show the relative positions of identified classes of names for three different stories.
Figure 2. Relative mean position of characters and places names for three classical French novels. This approach is very different from section 3.
- The Power of Spirit: How Organizations Transform.
- Finnisches Quartett: Kriminalroman (Arto Ratamo ermittelt) (German Edition)?
- The New Feminist Agenda: Defining the Next Revolution for Women, Work, and Family?
- Unrest Among the Smart Cows.
- KAN SORTIES : Agenda Africain de Paris de Juillet 2015?
For this implementation, we compiled lists of words that are more likely but not exclusively to appear, respectively, nearby characters, places, or abstract concepts. For instance, we expect names of characters to be more often surrounded by words related to emotions, body functions, speech, or professions, whereas names of places would be more closely related to motion verbs, place features, and prepositions.
Starting from common nouns that are unambiguously related to one of the categories we are interested in, we used a French synonyms dictionary service 10 to put up a list aiming to be as extensive as possible. The final files resulted in 4, words for characters, for places and 50 for concepts see Appendix B for the complete list. The script then looks for these words in the neighborhood of the nouns to be disambiguated and returns the most probable category. As characters and places serve different narrative purposes, one may expect the grammatical constructs surrounding them to differ in a significant way.
For instance, place names are often preceded with prepositions or determiners, whereas it is expected for character names to be more often directly followed by verbs. We thus introduced a script classifying names based on its knowledge of the full text, grammatically tagged using TreeTagger, 11 and tokenized in sentences. To guess the nature of the names, it then matches all sentences containing them against a set of rules that are typical constructions one uses when writing about a person or a place.
We tried out a set of seven manually established rules covering the most straightforward grammatical constructs described in details in Table 1 , plus two that help filter out tokenization errors at a sentence level by flagging words that are preceded by a punctuation mark or that are alone in their sentence. When matched, they increase or decrease the probability score for one or several classifications, and the category yielding the highest score is chosen and returned in the end. A lot of proper nouns can be non-ambiguously or with a high probability related to one or several categories based on general knowledge.
But the same knowledge may equivocally tell that those same words could also potentially be related to the ship RMS Queen Elisabeth , an abstract concept project Manhattan , or a place Amur river , probably with a lower likelihood if no other context is available. For many nouns, the knowledge we are looking for is well captured in the categorization of their related Wikipedia pages.
Using categories instead of the text of the articles also presents the advantages of being very straightforward and reduces a lot noisy signals related to text processing techniques. To test this idea, we implemented a simple algorithm that gathers the categories of the page whose name is closest to the noun we are looking for and looks for ones tagging people, places, or abstract concepts.
In the case no category gives a hint which tends to happen both with very complex or very precise pages , it tries to recursively walk up the hierarchy until the necessary clues are found. Several works already showed the relevance of locating direct and indirect speech parts to identify characters in novels Glass and Bangay, ; Goh et al. Most of these approaches rely heavily on the lexical database WordNet 12 to find out speech-related verbs and refine their accuracy, but for performance reasons and since we wanted the classifiers to remain efficient even on very long texts we implemented a simpler version that simply checks the proximity of detected proper nouns to quotation marks.
For each proper noun w appearing m w times, the system would essentially count the number q w of mentions that appear near quotations. It then computes:. Once all classifiers returned their answer for a given word, the last step is to compare these results and to decide on a final answer. This meta-classification step can be done by voting systems, choosing the final result according to the majority of predictions using various strategies, or by a meta-recognition system, aiming to discard classifiers that seem to have encountered a problem on the considered text file.
We implemented and discussed the performance of four distinct meta-classification methods. The easiest and most obvious solution to average the different classifications is a simple voting system i.
However, since there is an even number of classifiers, ties are to be expected. This situation is quite unlikely since it would require exactly three classifiers deciding correctly, and the three others agreeing incorrectly on a wrong categorization. Still, in case, this situation occurred the final choice would be non-deterministic by lack of model to support one option or the other. For this reason, we introduced a second meta-classification, which involves for each classifier to compute a confidence self-assessment score. For most classifiers, their internal mechanics allow themselves to evaluate to which extent the strategy they are using seems likely to return reliable results, given the current work context.
Hence, a simple strategy to help the voting process in the case of ties is for each classifier to return a confidence index, between 0 and 1. This index is thus expected to equal 1 if the decision was made with no ambiguity and 0 if the clues were equally distributed. For instance, considering the Quotes classifier computes a ratio of 0.
Again, this index is expected to see its value tend toward 0 for ambiguous cases and toward 1 for the more definite ones. On top of that, some classifiers are given the possibility to return 0 to mark their results as known to be invalid, and thus irrelevant at voting time.
This can happen for instance when we do not find any known title preceding a word throughout the text, if no grammatical rule could be matched, or if Wikipedia does not have any result for the searched word. The improved voting algorithm then first discards all classifications that have a confidence mark of 0 and proceeds to a simple vote between the remaining ones for each noun.
Blog :: saygelapdever.ml (saygelapdever.ml) :: (photos, videos + actualités)
In case of a tie, the results rated with the highest confidence will be privileged. Not all classifiers exhibit the same behavior regarding precision and recall. It thus can be justified to put more confidence on some of them in cases when we know they are more likely to succeed. For this test, we used manually set weights putting more importance to the obvious context classifier section 3. With the help of confidence rating section 4.
Hence, those cases will be discarded regardless of the coefficient. A good compromise can be reached by giving 3 times more weight to the obvious context classifier, allowing the others to still easily overpower it in the unlikely case a majority of them reach a contradictory agreement. A meta-recognition algorithm follows the idea of improving its accuracy by entirely removing one classifier if it detects it is consistently failing, typically due to stylistic biases or other broken assumptions on the considered book.
Our hypothesis here is that since the remaining classifiers reached a higher agreement, the discarded one must have globally failed in some way and needs to be put aside. Let us consider in Figure 3 the precisions vs. One can immediately see a typical pattern in any information retrieval system: one parameter is detrimental to the other, and no two classifiers behave in a similar way. We can also see that for each of them, some books get incredibly good results, and few others turn out very bad.
Interestingly and as backed up by the full numerical values shown in Table 2 , those are almost never the same, confirming our hypothesis that some methods may work way better or worse on some texts, giving a strong justification for the multi-classifier Mcapproach. The averaged results seem to confirm this intuition.www.juraa.com/images/autobiographies/the-end-how-to-write-a-bestselling-novel-in-30-days-the-cartel-publications-presents.php
Relations entre la France et la Pologne
In Figures 4 and 5 , we can see that all meta-classification schemes overall pushed the results toward the top, and at the same time made the clustering denser, hence reducing the differences between the books and output more consistant results by removing the worse outliers. Figure 3. Comparison between precision and recall for each classifier, on each book.