"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.

Overview

jieba-php

Build Status codecov.io Latest Stable Version Codacy Badge Made with Love

"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件,目前翻譯版本為 jieba-0.33 版本,未來再慢慢往上升級,效能也需要再改善,請有興趣的開發者一起加入開發!若想使用 Python 版本請前往 fxsjy/jieba

現在已經可以支援繁體中文!只要將字典切換為 big 模式即可!

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.

Scroll down for English documentation.

線上展示

Feature

  • 支持三種分詞模式:

  • 1)默認精確模式,試圖將句子最精確地切開,適合文本分析;

  • 2)全模式,把句子中所有的可以成詞的詞語都掃描出來,但是不能解決歧義。(需要充足的字典)

    1. 搜尋引擎模式,在精確模式的基礎上,對長詞再次切分,提高召回率,適合用於搜尋引擎分詞。
  • 支持繁體斷詞

  • 支持自定義詞典

Usage

  • 自動安裝:使用 composer 安裝後,透過 autoload 引用

代碼示例

composer require fukuball/jieba-php:dev-master

代碼示例

require_once "/path/to/your/vendor/autoload.php";
  • 手動安裝:將 jieba-php 放置適當目錄後,透過 require_once 引用

代碼示例

require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";

Algorithm

  • 基於 Trie 樹結構實現高效的詞圖掃描,生成句子中漢字所有可能成詞情況所構成的有向無環圖(DAG)
  • 採用了動態規劃查找最大概率路徑, 找出基於詞頻的最大切分組合
  • 對於未登錄詞,採用了基於漢字成詞能力的 HMM 模型,使用了 Viterbi 算法
  • BEMS 的解釋 https://github.com/fxsjy/jieba/issues/7

Interface

  • 組件只提供 jieba.cut 方法用於分詞
  • cut 方法接受兩個輸入參數: 1) 第一個參數為需要分詞的字符串 2)cut_all 參數用來控制分詞模式
  • 待分詞的字符串可以是 utf-8 字符串
  • jieba.cut 返回的結構是一個可迭代的 array

功能 1):分词

  • cut 方法接受想個輸入參數: 1) 第一個參數為需要分詞的字符串 2)cut_all 參數用來控制分詞模式
  • cutForSearch 方法接受一個參數:需要分詞的字符串,該方法適合用於搜索引擎構建倒排索引的分詞,粒度比較細
  • 注意:待分詞的字符串是 utf-8 字符串
  • cut 以及 cutForSearch 返回的結構是一個可迭代的 array

代碼示例 (Tutorial)

ini_set('memory_limit', '1024M');

require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init();
Finalseg::init();

$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);

$seg_list = Jieba::cut("我来到北京清华大学", true);
var_dump($seg_list); #全模式

$seg_list = Jieba::cut("我来到北京清华大学", false);
var_dump($seg_list); #默認精確模式

$seg_list = Jieba::cut("他来到了网易杭研大厦");
var_dump($seg_list);

$seg_list = Jieba::cutForSearch("小明硕士毕业于中国科学院计算所,后在日本京都大学深造"); #搜索引擎模式
var_dump($seg_list);

Output:

array(7) {
  [0]=>
  string(12) "怜香惜玉"
  [1]=>
  string(3) "也"
  [2]=>
  string(3) "得"
  [3]=>
  string(3) "要"
  [4]=>
  string(3) "看"
  [5]=>
  string(6) "对象"
  [6]=>
  string(3) "啊"
}

Full Mode:
array(15) {
  [0]=>
  string(3) "我"
  [1]=>
  string(3) "来"
  [2]=>
  string(6) "来到"
  [3]=>
  string(3) "到"
  [4]=>
  string(3) "北"
  [5]=>
  string(6) "北京"
  [6]=>
  string(3) "京"
  [7]=>
  string(3) "清"
  [8]=>
  string(6) "清华"
  [9]=>
  string(12) "清华大学"
  [10]=>
  string(3) "华"
  [11]=>
  string(6) "华大"
  [12]=>
  string(3) "大"
  [13]=>
  string(6) "大学"
  [14]=>
  string(3) "学"
}

Default Mode:
array(4) {
  [0]=>
  string(3) "我"
  [1]=>
  string(6) "来到"
  [2]=>
  string(6) "北京"
  [3]=>
  string(12) "清华大学"
}
array(6) {
  [0]=>
  string(3) "他"
  [1]=>
  string(6) "来到"
  [2]=>
  string(3) "了"
  [3]=>
  string(6) "网易"
  [4]=>
  string(6) "杭研"
  [5]=>
  string(6) "大厦"
}
(此處,“杭研“並沒有在詞典中,但是也被 Viterbi 算法識別出來了)

Search Engine Mode:
array(18) {
  [0]=>
  string(6) "小明"
  [1]=>
  string(6) "硕士"
  [2]=>
  string(6) "毕业"
  [3]=>
  string(3) "于"
  [4]=>
  string(6) "中国"
  [5]=>
  string(6) "科学"
  [6]=>
  string(6) "学院"
  [7]=>
  string(9) "科学院"
  [8]=>
  string(15) "中国科学院"
  [9]=>
  string(6) "计算"
  [10]=>
  string(9) "计算所"
  [11]=>
  string(3) "后"
  [12]=>
  string(3) "在"
  [13]=>
  string(6) "日本"
  [14]=>
  string(6) "京都"
  [15]=>
  string(6) "大学"
  [16]=>
  string(18) "日本京都大学"
  [17]=>
  string(6) "深造"
}

功能 2):添加自定義詞典

  • 開發者可以指定自己自定義的詞典,以便包含 jieba 詞庫裡沒有的詞。雖然 jieba 有新詞識別能力,但是自行添加新詞可以保證更高的正確率

  • 用法: Jieba::loadUserDict(file_name) # file_name 為自定義詞典的絕對路徑

  • 詞典格式和 dict.txt 一樣,一個詞佔一行;每一行分為三部分,一部分為詞語,一部分為詞頻,一部分為詞性,用空格隔開

  • 範例:

    云计算 5 n 李小福 2 n 创新办 3 n

    之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / 加載自定義詞庫後: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /

說明:"通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14

功能 3):關鍵詞提取

  • JiebaAnalyse::extractTags($content, $top_k)
  • content 為待提取的文本
  • top_k 為返回幾個 TF/IDF 權重最大的關鍵詞,默認值為 20
  • 可使用 setStopWords 增加自定義 stop words

代碼示例 (關鍵詞提取)

ini_set('memory_limit', '600M');

require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
require_once "/path/to/your/class/JiebaAnalyse.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\JiebaAnalyse;
Jieba::init(array('mode'=>'test','dict'=>'small'));
Finalseg::init();
JiebaAnalyse::init();

$top_k = 10;
$content = file_get_contents("/path/to/your/dict/lyric.txt", "r");

$tags = JiebaAnalyse::extractTags($content, $top_k);

var_dump($tags);

JiebaAnalyse::setStopWords('/path/to/your/dict/stop_words.txt');

$tags = JiebaAnalyse::extractTags($content, $top_k);

var_dump($tags);

Output:

array(10) {
  '沒有' =>
  double(1.0592831964595)
  '所謂' =>
  double(0.90795702553671)
  '是否' =>
  double(0.66385043195443)
  '一般' =>
  double(0.54607060161899)
  '雖然' =>
  double(0.30265234184557)
  '來說' =>
  double(0.30265234184557)
  '肌迫' =>
  double(0.30265234184557)
  '退縮' =>
  double(0.30265234184557)
  '矯作' =>
  double(0.30265234184557)
  '怯懦' =>
  double(0.24364586159392)
}
array(10) {
  '所謂' =>
  double(1.1569129841516)
  '一般' =>
  double(0.69579963754677)
  '矯作' =>
  double(0.38563766138387)
  '來說' =>
  double(0.38563766138387)
  '退縮' =>
  double(0.38563766138387)
  '雖然' =>
  double(0.38563766138387)
  '肌迫' =>
  double(0.38563766138387)
  '怯懦' =>
  double(0.31045198493419)
  '隨便說說' =>
  double(0.19281883069194)
  '一場' =>
  double(0.19281883069194)
}

功能 4):詞性分詞

代碼示例 (Tutorial)

ini_set('memory_limit', '600M');

require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
require_once dirname(dirname(__FILE__))."/class/Posseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\Posseg;
Jieba::init();
Finalseg::init();
Posseg::init();

$seg_list = Posseg::cut("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。");
var_dump($seg_list);

Output:

array(21) {
  [0]=>
  array(2) {
    ["word"]=>
    string(3) "这"
    ["tag"]=>
    string(1) "r"
  }
  [1]=>
  array(2) {
    ["word"]=>
    string(3) "是"
    ["tag"]=>
    string(1) "v"
  }
  [2]=>
  array(2) {
    ["word"]=>
    string(6) "一个"
    ["tag"]=>
    string(1) "m"
  }
  [3]=>
  array(2) {
    ["word"]=>
    string(18) "伸手不见五指"
    ["tag"]=>
    string(1) "i"
  }
  [4]=>
  array(2) {
    ["word"]=>
    string(3) "的"
    ["tag"]=>
    string(2) "uj"
  }
  [5]=>
  array(2) {
    ["word"]=>
    string(6) "黑夜"
    ["tag"]=>
    string(1) "n"
  }
  [6]=>
  array(2) {
    ["word"]=>
    string(3) "。"
    ["tag"]=>
    string(1) "x"
  }
  [7]=>
  array(2) {
    ["word"]=>
    string(3) "我"
    ["tag"]=>
    string(1) "r"
  }
  [8]=>
  array(2) {
    ["word"]=>
    string(3) "叫"
    ["tag"]=>
    string(1) "v"
  }
  [9]=>
  array(2) {
    ["word"]=>
    string(9) "孙悟空"
    ["tag"]=>
    string(2) "nr"
  }
  [10]=>
  array(2) {
    ["word"]=>
    string(3) ","
    ["tag"]=>
    string(1) "x"
  }
  [11]=>
  array(2) {
    ["word"]=>
    string(3) "我"
    ["tag"]=>
    string(1) "r"
  }
  [12]=>
  array(2) {
    ["word"]=>
    string(3) "爱"
    ["tag"]=>
    string(1) "v"
  }
  [13]=>
  array(2) {
    ["word"]=>
    string(6) "北京"
    ["tag"]=>
    string(2) "ns"
  }
  [14]=>
  array(2) {
    ["word"]=>
    string(3) ","
    ["tag"]=>
    string(1) "x"
  }
  [15]=>
  array(2) {
    ["word"]=>
    string(3) "我"
    ["tag"]=>
    string(1) "r"
  }
  [16]=>
  array(2) {
    ["word"]=>
    string(3) "爱"
    ["tag"]=>
    string(1) "v"
  }
  [17]=>
  array(2) {
    ["word"]=>
    string(6) "Python"
    ["tag"]=>
    string(3) "eng"
  }
  [18]=>
  array(2) {
    ["word"]=>
    string(3) "和"
    ["tag"]=>
    string(1) "c"
  }
  [19]=>
  array(2) {
    ["word"]=>
    string(3) "C++"
    ["tag"]=>
    string(3) "eng"
  }
  [20]=>
  array(2) {
    ["word"]=>
    string(3) "。"
    ["tag"]=>
    string(1) "x"
  }
}

功能 5):切換成繁體字典

代碼示例 (Tutorial)

ini_set('memory_limit', '1024M');

require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('mode'=>'default','dict'=>'big'));
Finalseg::init();

$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);

$seg_list = Jieba::cut("憐香惜玉也得要看對象啊!");
var_dump($seg_list);

Output:

array(7) {
  [0]=>
  string(12) "怜香惜玉"
  [1]=>
  string(3) "也"
  [2]=>
  string(3) "得"
  [3]=>
  string(3) "要"
  [4]=>
  string(3) "看"
  [5]=>
  string(6) "对象"
  [6]=>
  string(3) "啊"
}
array(7) {
  [0]=>
  string(12) "憐香惜玉"
  [1]=>
  string(3) "也"
  [2]=>
  string(3) "得"
  [3]=>
  string(3) "要"
  [4]=>
  string(3) "看"
  [5]=>
  string(6) "對象"
  [6]=>
  string(3) "啊"
}

功能 5):切換成繁體字典

代碼示例 (Tutorial)

ini_set('memory_limit', '1024M');

require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('mode'=>'default','dict'=>'big'));
Finalseg::init();

$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);

$seg_list = Jieba::cut("憐香惜玉也得要看對象啊!");
var_dump($seg_list);

Output:

array(7) {
  [0]=>
  string(12) "怜香惜玉"
  [1]=>
  string(3) "也"
  [2]=>
  string(3) "得"
  [3]=>
  string(3) "要"
  [4]=>
  string(3) "看"
  [5]=>
  string(6) "对象"
  [6]=>
  string(3) "啊"
}
array(7) {
  [0]=>
  string(12) "憐香惜玉"
  [1]=>
  string(3) "也"
  [2]=>
  string(3) "得"
  [3]=>
  string(3) "要"
  [4]=>
  string(3) "看"
  [5]=>
  string(6) "對象"
  [6]=>
  string(3) "啊"
}

功能 6):保留日语或者朝鲜语原文不进行过滤

代碼示例 (Tutorial)

ini_set('memory_limit', '1024M');

require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('cjk'=>'all'));
Finalseg::init();

$seg_list = Jieba::cut("한국어 또는 조선말은 제주특별자치도를 제외한 한반도 및 그 부속 도서와 한민족 거주 지역에서 쓰이는 언어로");
var_dump($seg_list);

$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);

// 加载日语词库可以对日语进行简单的分词
Jieba::loadUserDict("/path/to/your/japanese/dict.txt");
$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);

Output:

array(15) {
  [0]=>
  string(9) "한국어"
  [1]=>
  string(6) "또는"
  [2]=>
  string(12) "조선말은"
  [3]=>
  string(24) "제주특별자치도를"
  [4]=>
  string(9) "제외한"
  [5]=>
  string(9) "한반도"
  [6]=>
  string(3) "및"
  [7]=>
  string(3) "그"
  [8]=>
  string(6) "부속"
  [9]=>
  string(9) "도서와"
  [10]=>
  string(9) "한민족"
  [11]=>
  string(6) "거주"
  [12]=>
  string(12) "지역에서"
  [13]=>
  string(9) "쓰이는"
  [14]=>
  string(9) "언어로"
}
array(21) {
  [0]=>
  string(6) "日本"
  [1]=>
  string(3) "語"
  [2]=>
  string(3) "は"
  [3]=>
  string(3) "主"
  [4]=>
  string(3) "に"
  [5]=>
  string(6) "日本"
  [6]=>
  string(6) "国内"
  [7]=>
  string(3) "や"
  [8]=>
  string(6) "日本"
  [9]=>
  string(3) "人"
  [10]=>
  string(6) "同士"
  [11]=>
  string(3) "の"
  [12]=>
  string(3) "間"
  [13]=>
  string(3) "で"
  [14]=>
  string(3) "使"
  [15]=>
  string(3) "わ"
  [16]=>
  string(6) "れて"
  [17]=>
  string(6) "いる"
  [18]=>
  string(6) "言語"
  [19]=>
  string(3) "で"
  [20]=>
  string(6) "ある"
}
array(17) {
  [0]=>
  string(9) "日本語"
  [1]=>
  string(3) "は"
  [2]=>
  string(6) "主に"
  [3]=>
  string(9) "日本国"
  [4]=>
  string(3) "内"
  [5]=>
  string(3) "や"
  [6]=>
  string(9) "日本人"
  [7]=>
  string(6) "同士"
  [8]=>
  string(3) "の"
  [9]=>
  string(3) "間"
  [10]=>
  string(3) "で"
  [11]=>
  string(3) "使"
  [12]=>
  string(3) "わ"
  [13]=>
  string(6) "れて"
  [14]=>
  string(6) "いる"
  [15]=>
  string(6) "言語"
  [16]=>
  string(9) "である"
}

功能 7):返回詞語在原文的起止位置

代碼示例 (Tutorial)

ini_set('memory_limit', '1024M');

require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('mode'=>'test','dict'=>'big'));
Finalseg::init();

$seg_list = Jieba::tokenize("永和服装饰品有限公司");
var_dump($seg_list);

Output:

array(4) {
  [0] =>
  array(3) {
    'word' =>
    string(6) "永和"
    'start' =>
    int(0)
    'end' =>
    int(2)
  }
  [1] =>
  array(3) {
    'word' =>
    string(6) "服装"
    'start' =>
    int(2)
    'end' =>
    int(4)
  }
  [2] =>
  array(3) {
    'word' =>
    string(6) "饰品"
    'start' =>
    int(4)
    'end' =>
    int(6)
  }
  [3] =>
  array(3) {
    'word' =>
    string(12) "有限公司"
    'start' =>
    int(6)
    'end' =>
    int(10)
  }
}

其他詞典

  1. 佔用內容較小的詞典 https://github.com/fukuball/jieba-php/blob/master/src/dict/dict.small.txt

  2. 支持繁體斷詞的詞典 https://github.com/fukuball/jieba-php/blob/master/src/dict/dict.big.txt

常見問題

  1. 模型的數據是如何生成的? https://github.com/fxsjy/jieba/issues/7
  2. 這個庫的授權是? https://github.com/fxsjy/jieba/issues/2

jieba-php English Document

Online Demo

Feature

  • Support three types of segmentation mode:
    1. Accurate Mode, attempt to cut the sentence into the most accurate segmentation, which is suitable for text analysis;
    1. Full Mode, break the words of the sentence into words scanned
    1. Search Engine Mode, based on the Accurate Mode, with an attempt to cut the long words into several short words, which can enhance the recall rate

Usage

  • Installation: Use composer to install jieba-php, then require the autoload file to use jieba-php.

Algorithm

  • Based on the Trie tree structure to achieve efficient word graph scanning; sentences using Chinese characters constitute a directed acyclic graph (DAG).
  • Employs memory search to calculate the maximum probability path, in order to identify the maximum tangential points based on word frequency combination.
  • For unknown words, the character position HMM-based model is used, using the Viterbi algorithm.
  • The meaning of BEMS https://github.com/fxsjy/jieba/issues/7.

Interface

  • The cut method accepts two parameters: 1) first parameter is the string to segmentation 2)the second parameter cut_all to control segmentation mode.
  • The string to segmentation may use utf-8 string.
  • cutForSearch accpets only on parameter: the string that requires segmentation, and it will cut the sentence into short words
  • cut and cutForSearch return an segmented array.

Function 1) Segmentation

Example (Tutorial)

ini_set('memory_limit', '1024M');

require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init();
Finalseg::init();

$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);

seg_list = jieba.cut("我来到北京清华大学", true)
var_dump($seg_list); #全模式

seg_list = jieba.cut("我来到北京清华大学", false)
var_dump($seg_list); #默認精確模式

seg_list = jieba.cut("他来到了网易杭研大厦")
var_dump($seg_list);

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
var_dump($seg_list);

Output:

array(7) {
  [0]=>
  string(12) "怜香惜玉"
  [1]=>
  string(3) "也"
  [2]=>
  string(3) "得"
  [3]=>
  string(3) "要"
  [4]=>
  string(3) "看"
  [5]=>
  string(6) "对象"
  [6]=>
  string(3) "啊"
}

Full Mode:
array(15) {
  [0]=>
  string(3) "我"
  [1]=>
  string(3) "来"
  [2]=>
  string(6) "来到"
  [3]=>
  string(3) "到"
  [4]=>
  string(3) "北"
  [5]=>
  string(6) "北京"
  [6]=>
  string(3) "京"
  [7]=>
  string(3) "清"
  [8]=>
  string(6) "清华"
  [9]=>
  string(12) "清华大学"
  [10]=>
  string(3) "华"
  [11]=>
  string(6) "华大"
  [12]=>
  string(3) "大"
  [13]=>
  string(6) "大学"
  [14]=>
  string(3) "学"
}

Default Mode:
array(4) {
  [0]=>
  string(3) "我"
  [1]=>
  string(6) "来到"
  [2]=>
  string(6) "北京"
  [3]=>
  string(12) "清华大学"
}
array(6) {
  [0]=>
  string(3) "他"
  [1]=>
  string(6) "来到"
  [2]=>
  string(3) "了"
  [3]=>
  string(6) "网易"
  [4]=>
  string(6) "杭研"
  [5]=>
  string(6) "大厦"
}
(此處,“杭研“並沒有在詞典中,但是也被 Viterbi 算法識別出來了)

Search Engine Mode:
array(18) {
  [0]=>
  string(6) "小明"
  [1]=>
  string(6) "硕士"
  [2]=>
  string(6) "毕业"
  [3]=>
  string(3) "于"
  [4]=>
  string(6) "中国"
  [5]=>
  string(6) "科学"
  [6]=>
  string(6) "学院"
  [7]=>
  string(9) "科学院"
  [8]=>
  string(15) "中国科学院"
  [9]=>
  string(6) "计算"
  [10]=>
  string(9) "计算所"
  [11]=>
  string(3) "后"
  [12]=>
  string(3) "在"
  [13]=>
  string(6) "日本"
  [14]=>
  string(6) "京都"
  [15]=>
  string(6) "大学"
  [16]=>
  string(18) "日本京都大学"
  [17]=>
  string(6) "深造"
}

Function 2) Add a custom dictionary

  • Developers can specify their own custom dictionary to include in the jieba thesaurus. jieba has the ability to identify new words, but adding your own new words can ensure a higher rate of correct segmentation.

  • Usage: Jieba::loadUserDict(file_name) # file_name is a custom dictionary path.

  • The dictionary format is the same as that of dict.txt: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space.

  • Example:

    云计算 5 李小福 2 创新办 3

    之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / 加載自定義詞庫後: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /

Function 3) Keyword Extraction

  • JiebaAnalyse::extractTags($content, $top_k)
  • content: the text to be extracted
  • top_k: to return several TF/IDF weights for the biggest keywords, the default value is 20

Example (keyword extraction)

ini_set('memory_limit', '600M');

require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
require_once "/path/to/your/class/JiebaAnalyse.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\JiebaAnalyse;
Jieba::init(array('mode'=>'test','dict'=>'small'));
Finalseg::init();
JiebaAnalyse::init();

$top_k = 10;
$content = file_get_contents("/path/to/your/dict/lyric.txt", "r");

$tags = JiebaAnalyse::extractTags($content, $top_k);

var_dump($tags);

Output:

array(10) {
  ["是否"]=>
  float(1.2196321889395)
  ["一般"]=>
  float(1.0032459890209)
  ["肌迫"]=>
  float(0.64654314660465)
  ["怯懦"]=>
  float(0.44762844339349)
  ["藉口"]=>
  float(0.32327157330233)
  ["逼不得已"]=>
  float(0.32327157330233)
  ["不安全感"]=>
  float(0.26548304656279)
  ["同感"]=>
  float(0.23929673812326)
  ["有把握"]=>
  float(0.21043366018744)
  ["空洞"]=>
  float(0.20598261709442)
}

Function 4) Word Segmentation and Tagging

Example (word tagging)

ini_set('memory_limit', '600M');

require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
require_once dirname(dirname(__FILE__))."/class/Posseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\Posseg;
Jieba::init();
Finalseg::init();
Posseg::init();

$seg_list = Posseg::cut("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。");
var_dump($seg_list);

Output:

array(21) {
  [0]=>
  array(2) {
    ["word"]=>
    string(3) "这"
    ["tag"]=>
    string(1) "r"
  }
  [1]=>
  array(2) {
    ["word"]=>
    string(3) "是"
    ["tag"]=>
    string(1) "v"
  }
  [2]=>
  array(2) {
    ["word"]=>
    string(6) "一个"
    ["tag"]=>
    string(1) "m"
  }
  [3]=>
  array(2) {
    ["word"]=>
    string(18) "伸手不见五指"
    ["tag"]=>
    string(1) "i"
  }
  [4]=>
  array(2) {
    ["word"]=>
    string(3) "的"
    ["tag"]=>
    string(2) "uj"
  }
  [5]=>
  array(2) {
    ["word"]=>
    string(6) "黑夜"
    ["tag"]=>
    string(1) "n"
  }
  [6]=>
  array(2) {
    ["word"]=>
    string(3) "。"
    ["tag"]=>
    string(1) "w"
  }
  [7]=>
  array(2) {
    ["word"]=>
    string(3) "我"
    ["tag"]=>
    string(1) "r"
  }
  [8]=>
  array(2) {
    ["word"]=>
    string(3) "叫"
    ["tag"]=>
    string(1) "v"
  }
  [9]=>
  array(2) {
    ["word"]=>
    string(9) "孙悟空"
    ["tag"]=>
    string(2) "nr"
  }
  [10]=>
  array(2) {
    ["word"]=>
    string(3) ","
    ["tag"]=>
    string(1) "w"
  }
  [11]=>
  array(2) {
    ["word"]=>
    string(3) "我"
    ["tag"]=>
    string(1) "r"
  }
  [12]=>
  array(2) {
    ["word"]=>
    string(3) "爱"
    ["tag"]=>
    string(1) "v"
  }
  [13]=>
  array(2) {
    ["word"]=>
    string(6) "北京"
    ["tag"]=>
    string(2) "ns"
  }
  [14]=>
  array(2) {
    ["word"]=>
    string(3) ","
    ["tag"]=>
    string(1) "w"
  }
  [15]=>
  array(2) {
    ["word"]=>
    string(3) "我"
    ["tag"]=>
    string(1) "r"
  }
  [16]=>
  array(2) {
    ["word"]=>
    string(3) "爱"
    ["tag"]=>
    string(1) "v"
  }
  [17]=>
  array(2) {
    ["word"]=>
    string(6) "Python"
    ["tag"]=>
    string(3) "eng"
  }
  [18]=>
  array(2) {
    ["word"]=>
    string(3) "和"
    ["tag"]=>
    string(1) "c"
  }
  [19]=>
  array(2) {
    ["word"]=>
    string(3) "C++"
    ["tag"]=>
    string(3) "eng"
  }
  [20]=>
  array(2) {
    ["word"]=>
    string(3) "。"
    ["tag"]=>
    string(1) "w"
  }
}

Function 5):Use Traditional Chinese

Example (Tutorial)

ini_set('memory_limit', '1024M');

require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('mode'=>'default','dict'=>'big'));
Finalseg::init();

$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);

$seg_list = Jieba::cut("憐香惜玉也得要看對象啊!");
var_dump($seg_list);

Output:

array(7) {
  [0]=>
  string(12) "怜香惜玉"
  [1]=>
  string(3) "也"
  [2]=>
  string(3) "得"
  [3]=>
  string(3) "要"
  [4]=>
  string(3) "看"
  [5]=>
  string(6) "对象"
  [6]=>
  string(3) "啊"
}
array(7) {
  [0]=>
  string(12) "憐香惜玉"
  [1]=>
  string(3) "也"
  [2]=>
  string(3) "得"
  [3]=>
  string(3) "要"
  [4]=>
  string(3) "看"
  [5]=>
  string(6) "對象"
  [6]=>
  string(3) "啊"
}

Function 6):Keeping Japanese or Korean original text

Example (Tutorial)

ini_set('memory_limit', '1024M');

require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('cjk'=>'all'));
Finalseg::init();

$seg_list = Jieba::cut("한국어 또는 조선말은 제주특별자치도를 제외한 한반도 및 그 부속 도서와 한민족 거주 지역에서 쓰이는 언어로");
var_dump($seg_list);

$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);

// Loading custom Japanese dictionary can do a simple word segmentation
Jieba::loadUserDict("/path/to/your/japanese/dict.txt");
$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);

Output:

array(15) {
  [0]=>
  string(9) "한국어"
  [1]=>
  string(6) "또는"
  [2]=>
  string(12) "조선말은"
  [3]=>
  string(24) "제주특별자치도를"
  [4]=>
  string(9) "제외한"
  [5]=>
  string(9) "한반도"
  [6]=>
  string(3) "및"
  [7]=>
  string(3) "그"
  [8]=>
  string(6) "부속"
  [9]=>
  string(9) "도서와"
  [10]=>
  string(9) "한민족"
  [11]=>
  string(6) "거주"
  [12]=>
  string(12) "지역에서"
  [13]=>
  string(9) "쓰이는"
  [14]=>
  string(9) "언어로"
}
array(21) {
  [0]=>
  string(6) "日本"
  [1]=>
  string(3) "語"
  [2]=>
  string(3) "は"
  [3]=>
  string(3) "主"
  [4]=>
  string(3) "に"
  [5]=>
  string(6) "日本"
  [6]=>
  string(6) "国内"
  [7]=>
  string(3) "や"
  [8]=>
  string(6) "日本"
  [9]=>
  string(3) "人"
  [10]=>
  string(6) "同士"
  [11]=>
  string(3) "の"
  [12]=>
  string(3) "間"
  [13]=>
  string(3) "で"
  [14]=>
  string(3) "使"
  [15]=>
  string(3) "わ"
  [16]=>
  string(6) "れて"
  [17]=>
  string(6) "いる"
  [18]=>
  string(6) "言語"
  [19]=>
  string(3) "で"
  [20]=>
  string(6) "ある"
}
array(17) {
  [0]=>
  string(9) "日本語"
  [1]=>
  string(3) "は"
  [2]=>
  string(6) "主に"
  [3]=>
  string(9) "日本国"
  [4]=>
  string(3) "内"
  [5]=>
  string(3) "や"
  [6]=>
  string(9) "日本人"
  [7]=>
  string(6) "同士"
  [8]=>
  string(3) "の"
  [9]=>
  string(3) "間"
  [10]=>
  string(3) "で"
  [11]=>
  string(3) "使"
  [12]=>
  string(3) "わ"
  [13]=>
  string(6) "れて"
  [14]=>
  string(6) "いる"
  [15]=>
  string(6) "言語"
  [16]=>
  string(9) "である"
}

詞性說明

a 形容词 (取英语形容词 adjective 的第 1 个字母。)
  ad 副形词 (直接作状语的形容词,形容词代码 a 和副词代码 d 并在一起。)
  ag 形容词性语素 (形容词性语素,形容词代码为 a,语素代码 g 前面置以 a。)
  an 名形词 (具有名词功能的形容词,形容词代码 a 和名词代码 n 并在一起。)
b 区别词 (取汉字「别」的声母。)
c 连词 (取英语连词 conjunction 的第 1 个字母。)
d 副词 (取 adverb 的第 2 个字母,因其第 1 个字母已用于形容词。)
  df 副词*
  dg 副语素 (副词性语素,副词代码为 d,语素代码 g 前面置以 d。)
e 叹词 (取英语叹词 exclamation 的第 1 个字母。)
eng 外语
f 方位词 (取汉字「方」的声母。)
g 语素 (绝大多数语素都能作为合成词的「词根」,取汉字「根」的声母。)
h 前接成分 (取英语 head 的第 1 个字母。)
i 成语 (取英语成语 idiom 的第 1 个字母。)
j 简称略语 (取汉字「简」的声母。)
k 后接成分
l 习用语 (习用语尚未成为成语,有点「临时性」,取「临」的声母。)
m 数词 (取英语 numeral 的第 3 个字母,n,u 已有他用。)
  mg 数语素
  mq 数词*
n 名词 (取英语名词 noun 的第 1 个字母。)
  ng 名语素 (名词性语素,名词代码为 n,语素代码 g 前面置以 n。)
  nr 人名 (名词代码n和「人(ren)」的声母并在一起。)
  nrfg 名词*
  nrt 名词*
  ns 地名 (名词代码 n 和处所词代码 s 并在一起。)
  nt 机构团体 (「团」的声母为 t,名词代码 n 和 t 并在一起。)
  nz 其他专名 (「专」的声母的第 1 个字母为 z,名词代码 n 和 z 并在一起。)
o 拟声词 (取英语拟声词 onomatopoeia 的第 1 个字母。)
p 介词 (取英语介词 prepositional 的第 1 个字母。)
q 量词 (取英语 quantity 的第 1 个字母。)
r 代词 (取英语代词 pronoun的 第 2 个字母,因 p 已用于介词。)
  rg 代词语素
  rr 代词*
  rz 代词*
s 处所词 (取英语 space 的第 1 个字母。)
t 时间词 (取英语 time 的第 1 个字母。)
  tg 时语素 (时间词性语素,时间词代码为 t,在语素的代码 g 前面置以 t。)
u 助词 (取英语助词 auxiliary 的第 2 个字母,因 a 已用于形容词。)
  ud 助词*
  ug 助词*
  uj 助词*
  ul 助词*
  uv 助词*
  uz 助词*
v 动词 (取英语动词 verb 的第一个字母。)
  vd 副动词 (直接作状语的动词,动词和副词的代码并在一起。)
  vg 动语素
  vi 动词*
  vn 名动词 (指具有名词功能的动词,动词和名词的代码并在一起。)
  vq 动词*
w 标点符号
x 非语素字 (非语素字只是一个符号,字母 x 通常用于代表未知数、符号。)
y 语气词 (取汉字「语」的声母。)
z 状态词 (取汉字「状」的声母的前一个字母。)
  zg 状态词*

Donate

If you find fuku-ml useful, please consider a donation. Thank you!

  • bitcoin: 1BbihQU3CzSdyLSP9bvQq7Pi1z1jTdAaq9
  • eth: 0x92DA3F837bf2F79D422bb8CEAC632208F94cdE33

License

The MIT License (MIT)

Copyright (c) 2015 fukuball

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Comments
  • 我自定义的字典,词的权重不会发生改变呢

    我自定义的字典,词的权重不会发生改变呢

    我载入了自定义字典,想对部分词的权重进行提升,但是实际结果该词的权重并没有收到任何影响。 举个栗子: 我没有在自定义词库里添加销量这个词的时候,它分出来的权重值是0.061093127894683,如下图1 image1 (图1) 然后我把销量这个词加入自定义词库,权重设为100,再进行分词,结果还是这个,丝毫不受影响,如下图2 image2 (图2) 这是我的自定义字典userdict.txt的内容,图3 image4 (图3) 这是我的代码,图4 image3 (图4) 请问这是为什么呢

    opened by f1120309265 8
  • 你好,laravel5.5 初始化init的时候错误

    你好,laravel5.5 初始化init的时候错误

    你好,我在控制器中初始化这两个Jieba::init();Finalseg::init();

    线上环境500 ; 本地没有问题;

    注释掉这两句后,网页正常打开;

    在这两句前后dd('test');正常;

    就是init的时候不知道什么问题,laravel.log没有记录;network显示500没有提示信息

    opened by bestcyt 7
  • 给DAG处理增加缓存,提升性能

    给DAG处理增加缓存,提升性能

    尝试从以下两个方面提高性能。 1 self::$trie命中的词组,无需再做一次end相关的比较。 2 self::$trie未命中的词组,若多次出现,则第二次开始,无需再去调用MultiArray::get并最终用MultiArray::getValue进行递归遍历。 因为是跟self::$trie相关,所以在self::$trie作修改的时候需要清除缓存。

    我用如下代码在浏览器中进行测试,跟之前对比,大概能有百分之二十左右的提升。像"是否和"一词在lyric.txt.中出现了8次,对它的处理就会比之前更快。

    $top_k = 10;
    $content = file_get_contents(dirname(__FILE__)."/../src/dict/lyric.txt", "r");
    $t1 = microtime(true);
    for ($i = 0; $i < 100; $i++) {
        Jieba::$dag_cache = array();
        $tags = JiebaAnalyse::extractTags($content, $top_k);
    }
    $t2 = microtime(true);
    echo ($t2 - $t1)."<br>\n";
    

    不知对 #27 会不会有所帮助。

    opened by tianhe1986 3
  • extractTags方法选项设置allowPOS后可能异常

    extractTags方法选项设置allowPOS后可能异常

    未预先设置$words变量,导致程序异常。

    foreach ($wordsPos as $key => $word) {
        if (in_array($word['tag'], $options['allowPOS'])) {
            $words[] = $word['word'];
        }
    }
    
    opened by bedlate 2
  • array_push is too slow

    array_push is too slow

    It says in the documentation

    If you use array_push() to add one element to the array it's better to use $array[] = because in that way there is no overhead of calling a function.

    opened by Yahasana 2
  • linux 服务器 php7.2

    linux 服务器 php7.2

    Undefined index: B" ["string":"Exception":private]=> string(0) "" ["code":protected]=> int(8) ["file":protected]=> string(69) "/home/www/wechat_api/vendor/fukuball/jieba-php/src/class/Finalseg.php"

    opened by wuwu123 2
  • 修正JiebaAnalyse::extractTags遇到關鍵詞是數字時的錯誤

    修正JiebaAnalyse::extractTags遇到關鍵詞是數字時的錯誤

    修正JiebaAnalyse::extractTags遇到關鍵詞是數字時,數字的關鍵詞會被清除,從0開始排。 例如下例文章中的2016會變成0:

    台灣科技業向來以深厚的代工實力自豪,不過現在則是面臨產業轉型的問題。
    今天有場2016年科技趨勢座談,討論的就是台灣IT產業如何面對未來。
    

    extract tag:

        [產業] => 1.0692828963077
        [台灣] => 1.0692828963077
        [有場] => 0.53464144815385
        [問題] => 0.53464144815385
        [0] => 0.53464144815385
        [討論] => 0.53464144815385
        [未來] => 0.53464144815385
        [面對] => 0.53464144815385
        [IT] => 0.53464144815385
        [趨勢] => 0.53464144815385
        [轉型] => 0.53464144815385
        [面臨] => 0.53464144815385
        [業向] => 0.53464144815385
        [不過] => 0.53464144815385
        [實力] => 0.53464144815385
        [現在則] => 0.53464144815385
        [科技] => 0.44863651487769
        [代工] => 0.38121898482346
        [自豪] => 0.33183377282346
        [深厚] => 0.29984170844462
    
    opened by hippolin 2
  • 利用composer的autoload引用JiebaAnalyse會發生class not found

    利用composer的autoload引用JiebaAnalyse會發生class not found

    使用composer安裝後,透過autoload引用,使用JiebaAnalyse時,會出現下列錯誤訊息:

    Class 'Fukuball\JiebaAnalyse' not found
    

    composer.json的設定:

    "require-dev": {
      "fukuball/jieba-php": "dev-master"
    }
    
    opened by hippolin 2
  • 结果中出现了形如 \n2 这样的换行加数字的结果,如何不匹配换行开始的结果呢?

    结果中出现了形如 \n2 这样的换行加数字的结果,如何不匹配换行开始的结果呢?

    原字符串是 团队管理 1、团队成员赋能\n2、团队成员工作协助、流程管控\n3、团队稳定性 匹配后的结果是 团队,管理,成员,赋能,\n2,工作,协助,流程,管控,\n3,稳定性 其中 \n是换行符,希望能不匹配到换行符应该怎么做呢?user_dict.txt里面添加 \n 1000000 w 没有效果

    help wanted 
    opened by 591776998 2
  • 实现初始化时的性能调优

    实现初始化时的性能调优

    用了下这版的jieba,感觉加载词典时候太慢了, 性能分析之后发现是因为拆分字典每一行外加存入original_freq、total数组各占了一半消耗时间。 之前有人发过issue说的是开api服务的方式常驻内存的方式来减少加载消耗,但方案还是繁琐, 于是改造了Jieba::genTrie()方法,做了一个缓存功能,使得不用重复读取字典,运行过第一次之后会直接生成缓存,之后就能直接使用生成好的original_freq数组即可。其实作者在方法里注释掉了其他cache载入,应该也有想到,不知道为什么没加这个功能? 经测试: 加载big字典,处理速度从原来的9秒以上缩减到2-3秒 加载普通字典,从原来的5秒以上缩减到2秒 加载small词典,从原来的3秒以上缩减到1秒以下。 最后希望作者能够继续维护好这版jieba。现在在爆肝赶工中,而且也优化了了其他的地方,暂时没时间pull request 如果修改了字典,把.cache文件删除即可。 只需要把下面代码覆盖原来的Jieba::genTrie()方法即可实现缓存字典:

    /**
         * Static method genTrie
         *
         * @param string $f_name # input f_name
         * @param array $options # other options
         *
         * @return array self::$trie
         */
        public static function genTrie($f_name, $options = array())
        {
            $defaults = array(
                'mode' => 'default'
            );
            $options = array_merge($defaults, $options);
            self::$trie = new MultiArray(file_get_contents($f_name . '.json'));
            //self::$trie->cache = new MultiArray(file_get_contents($f_name.'.cache.json'));
            if(file_exists($f_name.'.cache')){
                #有缓存就取
                $datas=json_decode(file_get_contents($f_name.'.cache'),true);
                self::$original_freq=$datas['original_freq'];
                self::$total = $datas['total'];
            }else {
                $content = fopen($f_name, "r");
                while (($line = fgets($content)) !== false) {
                    $explode_line = explode(" ", trim($line));
                    $word = $explode_line[0];
                    $freq = $explode_line[1];
                    $tag = $explode_line[2];
                    $freq = (float)$freq;
                    if (isset(self::$original_freq[$word])) {
                        self::$total -= self::$original_freq[$word];
                    }
                    self::$original_freq[$word] = $freq;
                    self::$total += $freq;
                    #//$l = mb_strlen($word, 'UTF-8');
                    #//$word_c = array();
                    #//for ($i=0; $i<$l; $i++) {
                    #//    $c = mb_substr($word, $i, 1, 'UTF-8');
                    #//    array_push($word_c, $c);
                    #//}
                    #//$word_c_key = implode('.', $word_c);
                    #//self::$trie->set($word_c_key, array("end"=>""));
                }
                fclose($content);
                #添加缓存文件.cache
                $datas=[];
                $datas['original_freq']=self::$original_freq;
                $datas['total']=self::$total ;
                file_put_contents($f_name.'.cache',json_encode($datas));
            }
            return self::$trie;
        }// end function genTrie
    
    enhancement 
    opened by yukon12345 1
  • 中文操作tip

    中文操作tip

    在中文的操作过程中,遇到字符串的长度、截取等操作,如果直接使用 strlen、substr 等处理字符串,会在 VicWord.php 的 function getD(&$str, $i) 报错,由于中文的编码不同与英文,所以 需要用 mb_strlen、mb_substr 等,以 mb_开头的方法来处理字符串,不然会无法分词或者报错

    question 
    opened by acclea 0
  • 请求联系方式——any contact way

    请求联系方式——any contact way

    本人是广东深圳的一名后端工程师,主要使用语言为C/Go/PHP 因为业务需要,而尝试使用贵方的jieba-php分词,经过实践,基本使用需求是可以满足的 但是,发现贵方代码中有较多缺乏灵活性的设计以及代码细节问题(前面也提了一部分issue) 如果可以,期待能获得贵方的任意一种联系方式,用于沟通一些细节实现。便于我后续对贵方代码做优化调整。(问题太多,且issue难以得到及时回复)

    wontfix 
    opened by fapi-china 1
Releases(0.34)
Owner
Fukuball Lin
我是林志傑,網路上常用的名字是 Fukuball。我使用 PHP 及 Python,最近對機器學習感到興趣。 / 我也是一個快樂的吉他手~ www.fukuball.com
Fukuball Lin
A tiny PHP class-based program to analyze an input file and extract all of that words and detect how many times every word is repeated

A tiny PHP class-based program to analyze an input file and extract all of that words and detect how many times every word is repeated

Max Base 4 Feb 22, 2022
Paranoid text spacing in PHP

pangu.php Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charac

Clancey Lin 79 May 29, 2022
A language detection library for PHP. Detects the language from a given text string.

language-detection Build Status Code Coverage Version Total Downloads Minimum PHP Version License This library can detect the language of a given text

Patrick Schur 738 Dec 28, 2022
Render Persian Text (UTF-8 Hexadecimals)

Persian-Glyphs Purpose This class takes Persian text (encoded in Windows-1256 character set) as input and performs Persian glyph joining on it and out

Rf 3 Aug 25, 2021
Magento 2.4.x module Sort Out Of Stock Product At last the product list

Magento 2.4.x module Sort Out Of Stock Product At last the product list composer require ghoster/module-outofstockatlast Extension on GitHub Direct d

Tuyen Nguyen 42 Nov 19, 2022
A sane interface for php's built in preg_* functions

Making regex great again Php's built in preg_* functions require some odd patterns like passing variables by reference and treating false or null valu

Spatie 1.1k Jan 4, 2023
A PHP string manipulation library with multibyte support. Compatible with PHP 5.4+, PHP 7+, and HHVM.

A PHP string manipulation library with multibyte support. Compatible with PHP 5.4+, PHP 7+, and HHVM. s('string')->toTitleCase()->ensureRight('y') ==

Daniel St. Jules 2.5k Dec 28, 2022
highlight.php is a server-side syntax highlighter written in PHP that currently supports 185 languages

highlight.php is a server-side syntax highlighter written in PHP that currently supports 185 languages. It's a port of highlight.js by Ivan Sagalaev that makes full use of the language and style definitions of the original JavaScript project.

Geert Bergman 633 Dec 27, 2022
Mobile_Detect is a lightweight PHP class for detecting mobile devices (including tablets). It uses the User-Agent string combined with specific HTTP headers to detect the mobile environment.

Motto: "Every business should have a detection script to detect mobile readers." About Mobile Detect is a lightweight PHP class for detecting mobile d

Şerban Ghiţă 10.2k Jan 4, 2023
A PHP library for generating universally unique identifiers (UUIDs).

ramsey/uuid A PHP library for generating and working with UUIDs. ramsey/uuid is a PHP library for generating and working with universally unique ident

Ben Ramsey 11.9k Jan 8, 2023
👮 A PHP desktop/mobile user agent parser with support for Laravel, based on Mobiledetect

Agent A PHP desktop/mobile user agent parser with support for Laravel, based on Mobile Detect with desktop support and additional functionality. Insta

Jens Segers 4.2k Jan 5, 2023
A lightweight php class for formatting sql statements. Handles automatic indentation and syntax highlighting.

SqlFormatter A lightweight php class for formatting sql statements. It can automatically indent and add line breaks in addition to syntax highlighting

Jeremy Dorn 3.9k Jan 1, 2023
A PHP string manipulation library with multibyte support

A PHP string manipulation library with multibyte support. Compatible with PHP 5.4+, PHP 7+, and HHVM. s('string')->toTitleCase()->ensureRight('y') ==

Daniel St. Jules 2.5k Jan 3, 2023
A fast PHP slug generator and transliteration library that converts non-ascii characters for use in URLs.

URLify for PHP A fast PHP slug generator and transliteration library, started as a PHP port of URLify.js from the Django project. Handles symbols from

Aband*nthecar 667 Dec 20, 2022
🉑 Portable UTF-8 library - performance optimized (unicode) string functions for php.

?? Portable UTF-8 Description It is written in PHP (PHP 7+) and can work without "mbstring", "iconv" or any other extra encoding php-extension on your

Lars Moelleken 474 Dec 22, 2022
ColorJizz is a PHP library for manipulating and converting colors.

#Getting started: ColorJizz-PHP uses the PSR-0 standards for namespaces, so there should be no trouble using with frameworks like Symfony 2. ###Autolo

Mikeemoo 281 Nov 25, 2022
🔡 Portable ASCII library - performance optimized (ascii) string functions for php.

?? Portable ASCII Description It is written in PHP (PHP 7+) and can work without "mbstring", "iconv" or any other extra encoding php-extension on your

Lars Moelleken 380 Jan 6, 2023
:clamp: HtmlMin: HTML Compressor and Minifier via PHP

??️ HtmlMin: HTML Compressor and Minifier for PHP Description HtmlMin is a fast and very easy to use PHP library that minifies given HTML5 source by r

Lars Moelleken 135 Dec 8, 2022
Generate Heroku-like random names to use in your php applications.

HaikunatorPHP Generate Heroku-like random names to use in your PHP applications. Installation composer require atrox/haikunator Usage Haikunator is p

Atrox 99 Jul 19, 2022