自定義字符集 #59

osfans · 2015-10-22T01:27:45Z

目前只有通過extended_charset過濾擴展字符集吧？
https://github.com/rime/librime/blob/develop/src/gear/charset_filter.cc#L18

Boost.Locale 庫是否可以過濾gbk、big5等標準字符集？
甚至自定義字符集https://github.com/rime-aca/character_set

The text was updated successfully, but these errors were encountered:

osfans · 2015-10-22T03:04:03Z

嗯，標準字符集(比如gb2312)試驗成功了：

diff --git a/src/gear/charset_filter.cc b/src/gear/charset_filter.cc
index edf3a90..b8072dc 100644
--- a/src/gear/charset_filter.cc
+++ b/src/gear/charset_filter.cc
@@ -12,6 +12,7 @@
 #include <rime/engine.h>
 #include <rime/dict/vocabulary.h>
 #include <rime/gear/charset_filter.h>
+#include <boost/locale/encoding.hpp>

 namespace rime {

@@ -32,6 +33,15 @@ bool contains_extended_cjk(const string& text)
 {
   const char *p = text.c_str();
   uint32_t ch;
+  string charset = "GB2312";
+
+  try {
+    boost::locale::conv::from_utf(text, charset, boost::locale::conv::method_type::stop);
+    return false;
+  } catch (...){
+    LOG(INFO)<<text<<"not in "<<charset;
+    return true;
+  }

   while ((ch = utf8::unchecked::next(p)) != 0) {
     if (is_extended_cjk(ch)) {

如果要做的話，yaml的語法怎麼定義呢？
仿照extended_charset，再弄些gb2312_charset、big5_charset？
那如果要自定義字符集怎麼弄呢？比如需要指定 https://github.com/rime-aca/character_set 中的一個文件。

lotem · 2015-10-31T08:09:25Z

自定義字符集大概要做一個詞典文件，然後以文件名爲該項配置的值。

charset_filter:
  charset: gbk  # or custom_charset

@osfans 有沒有覺得現有架構有所欠缺？

osfans · 2015-11-02T04:09:22Z

我覺得已經很強大了。

osfans · 2015-11-02T05:13:38Z

switches:
  - options: [ utf8, big5, gbk ]
    states:
      - UTF-8
      - BIG5
      - GBK
    reset: 0
engine:
  filters:
    - charset_filter@big5
    - charset_filter@gbk
big5:
  option_name: big5
  charset: big5
gbk:
  option_name: gbk
  charset: gbk

那如果要多個字符集之間切換呢？
光用charset可以區分嗎？還得使用option_name吧？

lotem · 2015-11-02T15:00:59Z

也可以通过 Ticket::name_space 拿到 @ 之后那部份直接作为 option_name，甚至也用作 charset 的值（如果可识别为已知字符集）。

osfans · 2015-11-03T09:21:30Z

突然發現，extended_charset是false起作用，好像不太適合做成switches/options。
其他charset都是true起作用。

osfans · 2015-11-04T06:25:00Z

支持標準字符集，name_space直接用作option_name和charset。

switches:
  - options: [ utf8, big5, gbk ]
    states:
      - UTF-8
      - BIG5
      - GBK
engine:
  filters:
    - charset_filter@big5
    - charset_filter@gbk

不過translator/enable_charset_filter還不能指定charset，不知道怎麼搞。

測試程序： https://github.com/osfans/trime/releases/download/v3.0-beta/trime-20151105-charset.apk

lotem · 2015-11-04T07:10:31Z

@osfans 嗯。这是当前架构的限制之一。造句应该改为由一个独立翻译器实现，他用的词典也应当可以单独配置。而且，现在只能在所有 translator 合并结果之后加 filter，而不能加到个别 translation 再做合并。这样组合能力就很弱。再比如，也不容易实现反查先简繁转换再反查。加选项来控制实在是一个错误的思路，这些原本应该通过排列组合小组件达成。
所以，我想设计个新架构，支持不支持？

osfans · 2015-11-04T07:28:00Z

支持！學習！

ghost · 2016-02-27T17:09:03Z

應該再加入測試代碼吧？

zaqzrh · 2016-06-06T09:13:13Z

我想自定义2个字符集，一个是gb2312，一个是全集(utf8)，
按照以上的方法：

switches:

options: [ gb2312, utf8 ]
states:
- gb2312
- UTF-8

engine:
filters:
- charset_filter@gb2312

gb2312:
option_name: gb2312
charset: gb2312

那如果要多個字符集之間切換呢？

没有产生效果。

求@osfans 实例代码。

osfans · 2016-06-06T09:18:14Z

@zaqzrh
直接這樣就好：

switches:
  - options: [ utf8, gb2312]
    states:
      - UTF-8
      - GB2312
engine:
  filters:
    - charset_filter@gb2312
translator:
  enable_charset_filter: false

zaqzrh · 2016-06-21T15:04:43Z

translator:
enable_charset_filter: false
@osfans拼音输入，没有成功。

上面这个"table/enable_charset_filter"中的table是什么意思？

osfans · 2016-06-22T01:21:38Z

我寫錯了，那就是translator。可能是你的librime太舊了？或者windows版還要其他庫？我用你的就可以，rong2 UTF可以打出镕，gb2312不能。

osfans · 2016-06-22T01:47:51Z

就是說有兩種charset_filter，

原來的是常用和擴展，開關是translator/enable_charset_filter，option是extended_charset。
新的這個是指定字符集，開關是engine/filters/charset_filter@gb2312，option是gb2312。

你如果不用第一種，那直接用第二種應該就可以。如果還留着第一個，那兩種會同時起作用。

osfans · 2016-07-01T05:00:06Z

如 rime/home#91 討論，暫時關閉。

alswl · 2017-01-24T07:38:36Z

想使用 charset_filter@gbk + Emoji，该如何配置呢？

osfans · 2017-01-24T09:44:37Z

確實有這個問題

alswl · 2018-07-05T04:02:53Z

@osfans 我自力更生吧，提交了一个 PR。

alswl · 2019-07-01T07:47:38Z

想使用 charset_filter@gbk + Emoji，该如何配置呢？

两年后，我终于回来填坑了。

#293 已经合并到 master，可以在配置文件使用 charset_filter@gbk+emoji 方式来添加 Emoji 支持了。

f4nyc · 2019-09-20T02:09:13Z

@zaqzrh
直接這樣就好：

switches:
  - options: [ utf8, gb2312]
    states:
      - UTF-8
      - GB2312
engine:
  filters:
    - charset_filter@gb2312
translator:
  enable_charset_filter: false

@osfans 如上配置，会报
E0920 10:08:02.380353 1627 engine.cc:366] error creating filter: 'charset_filter'
请问是什么缘故，librime版本是最新的1.5.3

IceCodeNew · 2020-04-15T02:37:43Z

楼上的错误与我这里说的无关，但我想提醒后来人想要折腾 charset_filter 时要注意，因为在 19 年 12 月 27 号的一次提交中，librime 已经除去（根据 commit 信息说是降级好一点）了对 charset_filter 的支持。
如果你的输入法后端在编译时没有纳入 librime-charcode 组件，那 charset_filter 就可能失效。

参见：Issue#335

osfans added the enhancement label Oct 22, 2015

osfans added a commit that referenced this issue Nov 4, 2015

Issue #59: add standard charset in charset_filter

1b9ea97

osfans referenced this issue Feb 18, 2016

Link to iconv library.

20cec49

osfans closed this as completed Jul 1, 2016

osfans mentioned this issue Aug 27, 2016

开发版的「罕见字过滤」失效 rime/squirrel#120

Closed

lotem mentioned this issue May 27, 2017

ibus 下 rime 候选词乱码 rime/ibus-rime#28

Closed

alswl mentioned this issue Jul 5, 2018

feat: chareset_filter always allow emoji #213

Closed

ianzhuo mentioned this issue Dec 11, 2018

開關以options設置時，要如何綁定快捷鍵呢？ #231

Open

nameoverflow mentioned this issue Dec 28, 2018

feature request：简中默认字符集调小 rime/weasel#314

Closed

nameoverflow mentioned this issue May 20, 2019

很都字变成方框的问题 rime/home#368

Closed

alswl mentioned this issue Jun 29, 2019

咨询问题，charset_filter 在 rime_api_console 不生效 #290

Closed

oniondelta mentioned this issue Apr 18, 2023

求教：如何在 charset 脚本中实现过滤 CJK 扩展汉字的同时保留 Emoticons hchunhui/librime-lua#241

Open

Lawrence-of-AnKing mentioned this issue Jan 12, 2024

rtt编码有一个未知字符 KyleBing/rime-wubi86-jidian#115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

自定義字符集 #59

自定義字符集 #59

osfans commented Oct 22, 2015

osfans commented Oct 22, 2015

lotem commented Oct 31, 2015

osfans commented Nov 2, 2015

osfans commented Nov 2, 2015

lotem commented Nov 2, 2015

osfans commented Nov 3, 2015

osfans commented Nov 4, 2015 •

edited

Loading

lotem commented Nov 4, 2015

osfans commented Nov 4, 2015

ghost commented Feb 27, 2016

zaqzrh commented Jun 6, 2016 •

edited

Loading

osfans commented Jun 6, 2016 •

edited

Loading

zaqzrh commented Jun 21, 2016

osfans commented Jun 22, 2016

osfans commented Jun 22, 2016

osfans commented Jul 1, 2016 •

edited

Loading

alswl commented Jan 24, 2017

osfans commented Jan 24, 2017

alswl commented Jul 5, 2018

alswl commented Jul 1, 2019

f4nyc commented Sep 20, 2019

IceCodeNew commented Apr 15, 2020

自定義字符集 #59

自定義字符集 #59

Comments

osfans commented Oct 22, 2015

osfans commented Oct 22, 2015

lotem commented Oct 31, 2015

osfans commented Nov 2, 2015

osfans commented Nov 2, 2015

lotem commented Nov 2, 2015

osfans commented Nov 3, 2015

osfans commented Nov 4, 2015 • edited Loading

lotem commented Nov 4, 2015

osfans commented Nov 4, 2015

ghost commented Feb 27, 2016

zaqzrh commented Jun 6, 2016 • edited Loading

osfans commented Jun 6, 2016 • edited Loading

zaqzrh commented Jun 21, 2016

osfans commented Jun 22, 2016

osfans commented Jun 22, 2016

osfans commented Jul 1, 2016 • edited Loading

alswl commented Jan 24, 2017

osfans commented Jan 24, 2017

alswl commented Jul 5, 2018

alswl commented Jul 1, 2019

f4nyc commented Sep 20, 2019

IceCodeNew commented Apr 15, 2020

osfans commented Nov 4, 2015 •

edited

Loading

zaqzrh commented Jun 6, 2016 •

edited

Loading

osfans commented Jun 6, 2016 •

edited

Loading

osfans commented Jul 1, 2016 •

edited

Loading